PyThaiNLP / Pythainlp
Programming Languages
Projects that are alternatives of or similar to Pythainlp
PyThaiNLP is a Python package for text processing and linguistic analysis, similar to NLTK with focus on Thai language.
PyThaiNLP เป็นไลบารีภาษาไพทอนสำหรับประมวลผลภาษาธรรมชาติ คล้ายกับ NLTK โดยเน้นภาษาไทย ดูรายละเอียดภาษาไทยได้ที่ README_TH.MD
News
We are conducting a 2-minute survey to know more about your experience using the library and your expectations regarding what the library should be able to do. Take part in this survey.
Version | Description | Status |
---|---|---|
2.2.6 | Stable | Change Log |
dev |
Release Candidate for 2.3 | Change Log |
Please follow our PyThaiNLP Facebook page for more updates.
Getting Started with PyThaiNLP
We provide PyThaiNLP Get Started Tutorial for exploring features in PyThaiNLP; We also have tutorials for specific tasks. Please visit our tutorial page.
Latest document is available at https://pythainlp.github.io/docs/2.2/.
We try to make the package easy to use as much as possible; therefore, some additional data (like word lists and language models) may get automatically download during runtime. PyThaiNLP caches additional data under the directory ~/pythainlp-data
by default, but the user can change the value by specifying the environment variable PYTHAINLP_DATA_DIR
. See corpus catalog at PyThaiNLP/pythainlp-corpus.
Capabilities
PyThaiNLP provides standard NLP functions for Thai, for example part-of-speech tagging, linguistic unit segmentation (syllable, word, or sentence). Some of these functions are also available via command-line interface.
List of Features
- Convenient character and word classes, like Thai consonants (
pythainlp.thai_consonants
), vowels (pythainlp.thai_vowels
), digits (pythainlp.thai_digits
), and stop words (pythainlp.corpus.thai_stopwords
) -- comparable to constants likestring.letters
,string.digits
, andstring.punctuation
- Thai linguistic unit segmentation/tokenization, including sentence (
sent_tokenize
), word (word_tokenize
), and subword segmentations based on Thai Character Cluster (subword_tokenize
) - Thai part-of-speech tagging (
pos_tag
) - Thai spelling suggestion and correction (
spell
andcorrect
) - Thai transliteration (
transliterate
) - Thai soundex (
soundex
) with three engines (lk82
,udom83
,metasound
) - Thai collation (sort by dictionary order) (
collate
) - Read out number to Thai words (
bahttext
,num_to_thaiword
) - Thai datetime formatting (
thai_strftime
) - Thai-English keyboard misswitched fix (
eng_to_thai
,thai_to_eng
) - Command-line interface for basic functions, like tokenization and pos tagging (run
thainlp
in your shell)
Please see our tutorials on how to apply these functions to machine-learning problems.
Installation
pip install --upgrade pythainlp
This will install the latest stable release of PyThaiNLP. PyThaiNLP uses pip as its package manager and PyPI as its main distribution channel, see https://pypi.org/project/pythainlp/
Install different releases:
- Stable release:
pip install --upgrade pythainlp
- Pre-release (near ready):
pip install --upgrade --pre pythainlp
- Development (likely to break things):
pip install https://github.com/PyThaiNLP/pythainlp/archive/dev.zip
Installation Options
Some functionalities, like Thai WordNet, may require extra packages. To install those requirements, specify a set of [name]
immediately after pythainlp
:
pip install pythainlp[extra1,extra2,...]
List of possible `extras`
-
full
(install everything) -
attacut
(to support attacut, a fast and accurate tokenizer) -
benchmarks
(for word tokenization benchmarking) -
icu
(for ICU, International Components for Unicode, support in transliteration and tokenization) -
ipa
(for IPA, International Phonetic Alphabet, support in transliteration) -
ml
(to support ULMFiT models for classification) -
thai2fit
(for Thai word vector) -
thai2rom
(for machine-learnt romanization) -
wordnet
(for Thai WordNet API)
For dependency details, look at extras
variable in setup.py
.
Command-Line Interface
Some of PyThaiNLP functionalities can be used at command line, using thainlp
command.
For example, displaying a catalog of datasets:
thainlp data catalog
Showing how to use:
thainlp help
Python 2 Users
- PyThaiNLP 2 supports Python 3.6+. Some functions may work with older version of Python 3, but it is not well-tested and will not be supported. See 1.7 -> 2.0 change log.
- Python 2.7 users can use PyThaiNLP 1.6
Citations
If you use PyThaiNLP
in your project or publication, please cite the library as follows
Wannaphong Phatthiyaphaibun, Korakot Chaovavanich, Charin Polpanumas, Arthit Suriyawongkul, Lalita Lowphansirikul, & Pattarawat Chormai. (2016, Jun 27). PyThaiNLP: Thai Natural Language Processing in Python. Zenodo. http://doi.org/10.5281/zenodo.3519354
or BibTeX entry:
@misc{pythainlp,
author = {Wannaphong Phatthiyaphaibun, Korakot Chaovavanich, Charin Polpanumas, Arthit Suriyawongkul, Lalita Lowphansirikul, Pattarawat Chormai},
title = {{PyThaiNLP: Thai Natural Language Processing in Python}},
month = Jun,
year = 2016,
doi = {10.5281/zenodo.3519354},
publisher = {Zenodo},
url = {http://doi.org/10.5281/zenodo.3519354}
}
Contribute to PyThaiNLP
- Please do fork and create a pull request :)
- For style guide and other information, including references to algorithms we use, please refer to our contributing page.
Who uses PyThaiNLP?
You can read INTHEWILD.md.
Licenses
License | |
---|---|
PyThaiNLP Source Code and Notebooks | Apache Software License 2.0 |
Corpora, datasets, and documentations created by PyThaiNLP | Creative Commons Zero 1.0 Universal Public Domain Dedication License (CC0) |
Language models created by PyThaiNLP | Creative Commons Attribution 4.0 International Public License (CC-by) |
Other corpora and models that may included with PyThaiNLP | See Corpus License |
Model Cards
For technical details, caveats, and ethical considerations of the models developed and used in PyThaiNLP, see Model cards.
Sponsors
Since 2019, our contributors Korakot Chaovavanich and Lalita Lowphansirikul have been supported by VISTEC-depa Thailand Artificial Intelligence Research Institute.