All Projects → PyThaiNLP → Pythainlp

PyThaiNLP / Pythainlp

Licence: apache-2.0
Thai Natural Language Processing in Python.

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to Pythainlp

Toiro
A comparison tool of Japanese tokenizers
Stars: ✭ 95 (-83.68%)
Mutual labels:  natural-language-processing, nlp-library, word-segmentation
Nlp profiler
A simple NLP library allows profiling datasets with one or more text columns. When given a dataset and a column name containing text data, NLP Profiler will return either high-level insights or low-level/granular statistical information about the text in that column.
Stars: ✭ 181 (-68.9%)
Mutual labels:  hacktoberfest, natural-language-processing, nlp-library
Syfertext
A privacy preserving NLP framework
Stars: ✭ 170 (-70.79%)
Mutual labels:  hacktoberfest, natural-language-processing
Nagisa
A Japanese tokenizer based on recurrent neural networks
Stars: ✭ 260 (-55.33%)
Mutual labels:  nlp-library, word-segmentation
Vncorenlp
A Vietnamese natural language processing toolkit (NAACL 2018)
Stars: ✭ 354 (-39.18%)
Mutual labels:  natural-language-processing, word-segmentation
Ciphey
⚡ Automatically decrypt encryptions without knowing the key or cipher, decode encodings, and crack hashes ⚡
Stars: ✭ 9,116 (+1466.32%)
Mutual labels:  hacktoberfest, natural-language-processing
Lingua Franca
Mycroft's multilingual text parsing and formatting library
Stars: ✭ 51 (-91.24%)
Mutual labels:  hacktoberfest, natural-language-processing
Lingua
👄 The most accurate natural language detection library for Java and the JVM, suitable for long and short text alike
Stars: ✭ 341 (-41.41%)
Mutual labels:  natural-language-processing, nlp-library
Lingo
package lingo provides the data structures and algorithms required for natural language processing
Stars: ✭ 113 (-80.58%)
Mutual labels:  natural-language-processing, nlp-library
Spacy
💫 Industrial-strength Natural Language Processing (NLP) in Python
Stars: ✭ 21,978 (+3676.29%)
Mutual labels:  natural-language-processing, nlp-library
Ekphrasis
Ekphrasis is a text processing tool, geared towards text from social networks, such as Twitter or Facebook. Ekphrasis performs tokenization, word normalization, word segmentation (for splitting hashtags) and spell correction, using word statistics from 2 big corpora (english Wikipedia, twitter - 330mil english tweets).
Stars: ✭ 433 (-25.6%)
Mutual labels:  nlp-library, word-segmentation
Nlp.js
An NLP library for building bots, with entity extraction, sentiment analysis, automatic language identify, and so more
Stars: ✭ 4,670 (+702.41%)
Mutual labels:  hacktoberfest, natural-language-processing
Fastnlp
fastNLP: A Modularized and Extensible NLP Framework. Currently still in incubation.
Stars: ✭ 2,441 (+319.42%)
Mutual labels:  natural-language-processing, nlp-library
Awesome Pytorch List
A comprehensive list of pytorch related content on github,such as different models,implementations,helper libraries,tutorials etc.
Stars: ✭ 12,475 (+2043.47%)
Mutual labels:  natural-language-processing, nlp-library
Pycantonese
Cantonese Linguistics and NLP in Python
Stars: ✭ 147 (-74.74%)
Mutual labels:  natural-language-processing, word-segmentation
Chatbot ner
chatbot_ner: Named Entity Recognition for chatbots.
Stars: ✭ 273 (-53.09%)
Mutual labels:  natural-language-processing, nlp-library
Kagome
Self-contained Japanese Morphological Analyzer written in pure Go
Stars: ✭ 554 (-4.81%)
Mutual labels:  hacktoberfest, nlp-library
Transformers
🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
Stars: ✭ 55,742 (+9477.66%)
Mutual labels:  natural-language-processing, nlp-library
Danlp
DaNLP is a repository for Natural Language Processing resources for the Danish Language.
Stars: ✭ 111 (-80.93%)
Mutual labels:  natural-language-processing, nlp-library
Pynlpl
PyNLPl, pronounced as 'pineapple', is a Python library for Natural Language Processing. It contains various modules useful for common, and less common, NLP tasks. PyNLPl can be used for basic tasks such as the extraction of n-grams and frequency lists, and to build simple language model. There are also more complex data types and algorithms. Moreover, there are parsers for file formats common in NLP (e.g. FoLiA/Giza/Moses/ARPA/Timbl/CQL). There are also clients to interface with various NLP specific servers. PyNLPl most notably features a very extensive library for working with FoLiA XML (Format for Linguistic Annotation).
Stars: ✭ 426 (-26.8%)
Mutual labels:  natural-language-processing, nlp-library

PyThaiNLP: Thai Natural Language Processing in Python

pypi Python 3.6 License Download Build status Coverage Status Codacy Badge FOSSA Status Google Colab Badge DOI

PyThaiNLP is a Python package for text processing and linguistic analysis, similar to NLTK with focus on Thai language.

PyThaiNLP เป็นไลบารีภาษาไพทอนสำหรับประมวลผลภาษาธรรมชาติ คล้ายกับ NLTK โดยเน้นภาษาไทย ดูรายละเอียดภาษาไทยได้ที่ README_TH.MD

News

We are conducting a 2-minute survey to know more about your experience using the library and your expectations regarding what the library should be able to do. Take part in this survey.

Version Description Status
2.2.6 Stable Change Log
dev Release Candidate for 2.3 Change Log

Please follow our PyThaiNLP Facebook page for more updates.

Getting Started with PyThaiNLP

We provide PyThaiNLP Get Started Tutorial for exploring features in PyThaiNLP; We also have tutorials for specific tasks. Please visit our tutorial page.

Latest document is available at https://pythainlp.github.io/docs/2.2/.

We try to make the package easy to use as much as possible; therefore, some additional data (like word lists and language models) may get automatically download during runtime. PyThaiNLP caches additional data under the directory ~/pythainlp-data by default, but the user can change the value by specifying the environment variable PYTHAINLP_DATA_DIR. See corpus catalog at PyThaiNLP/pythainlp-corpus.

Capabilities

PyThaiNLP provides standard NLP functions for Thai, for example part-of-speech tagging, linguistic unit segmentation (syllable, word, or sentence). Some of these functions are also available via command-line interface.

List of Features
  • Convenient character and word classes, like Thai consonants (pythainlp.thai_consonants), vowels (pythainlp.thai_vowels), digits (pythainlp.thai_digits), and stop words (pythainlp.corpus.thai_stopwords) -- comparable to constants like string.letters, string.digits, and string.punctuation
  • Thai linguistic unit segmentation/tokenization, including sentence (sent_tokenize), word (word_tokenize), and subword segmentations based on Thai Character Cluster (subword_tokenize)
  • Thai part-of-speech tagging (pos_tag)
  • Thai spelling suggestion and correction (spell and correct)
  • Thai transliteration (transliterate)
  • Thai soundex (soundex) with three engines (lk82, udom83, metasound)
  • Thai collation (sort by dictionary order) (collate)
  • Read out number to Thai words (bahttext, num_to_thaiword)
  • Thai datetime formatting (thai_strftime)
  • Thai-English keyboard misswitched fix (eng_to_thai, thai_to_eng)
  • Command-line interface for basic functions, like tokenization and pos tagging (run thainlp in your shell)

Please see our tutorials on how to apply these functions to machine-learning problems.

Installation

pip install --upgrade pythainlp

This will install the latest stable release of PyThaiNLP. PyThaiNLP uses pip as its package manager and PyPI as its main distribution channel, see https://pypi.org/project/pythainlp/

Install different releases:

  • Stable release: pip install --upgrade pythainlp
  • Pre-release (near ready): pip install --upgrade --pre pythainlp
  • Development (likely to break things): pip install https://github.com/PyThaiNLP/pythainlp/archive/dev.zip

Installation Options

Some functionalities, like Thai WordNet, may require extra packages. To install those requirements, specify a set of [name] immediately after pythainlp:

pip install pythainlp[extra1,extra2,...]
List of possible `extras`
  • full (install everything)
  • attacut (to support attacut, a fast and accurate tokenizer)
  • benchmarks (for word tokenization benchmarking)
  • icu (for ICU, International Components for Unicode, support in transliteration and tokenization)
  • ipa (for IPA, International Phonetic Alphabet, support in transliteration)
  • ml (to support ULMFiT models for classification)
  • thai2fit (for Thai word vector)
  • thai2rom (for machine-learnt romanization)
  • wordnet (for Thai WordNet API)

For dependency details, look at extras variable in setup.py.

Command-Line Interface

Some of PyThaiNLP functionalities can be used at command line, using thainlp command.

For example, displaying a catalog of datasets:

thainlp data catalog

Showing how to use:

thainlp help

Python 2 Users

Citations

If you use PyThaiNLP in your project or publication, please cite the library as follows

Wannaphong Phatthiyaphaibun, Korakot Chaovavanich, Charin Polpanumas, Arthit Suriyawongkul, Lalita Lowphansirikul, & Pattarawat Chormai. (2016, Jun 27). PyThaiNLP: Thai Natural Language Processing in Python. Zenodo. http://doi.org/10.5281/zenodo.3519354

or BibTeX entry:

@misc{pythainlp,
    author       = {Wannaphong Phatthiyaphaibun, Korakot Chaovavanich, Charin Polpanumas, Arthit Suriyawongkul, Lalita Lowphansirikul, Pattarawat Chormai},
    title        = {{PyThaiNLP: Thai Natural Language Processing in Python}},
    month        = Jun,
    year         = 2016,
    doi          = {10.5281/zenodo.3519354},
    publisher    = {Zenodo},
    url          = {http://doi.org/10.5281/zenodo.3519354}
}

Contribute to PyThaiNLP

  • Please do fork and create a pull request :)
  • For style guide and other information, including references to algorithms we use, please refer to our contributing page.

Who uses PyThaiNLP?

You can read INTHEWILD.md.

Licenses

License
PyThaiNLP Source Code and Notebooks Apache Software License 2.0
Corpora, datasets, and documentations created by PyThaiNLP Creative Commons Zero 1.0 Universal Public Domain Dedication License (CC0)
Language models created by PyThaiNLP Creative Commons Attribution 4.0 International Public License (CC-by)
Other corpora and models that may included with PyThaiNLP See Corpus License

Model Cards

For technical details, caveats, and ethical considerations of the models developed and used in PyThaiNLP, see Model cards.

Sponsors

VISTEC-depa Thailand Artificial Intelligence Research Institute

Since 2019, our contributors Korakot Chaovavanich and Lalita Lowphansirikul have been supported by VISTEC-depa Thailand Artificial Intelligence Research Institute.


Made with ❤️ | PyThaiNLP Team 💻 | "We build Thai NLP" 🇹🇭

We have only one official repository at https://github.com/PyThaiNLP/pythainlp and another mirror at https://gitlab.com/pythainlp/pythainlp
Beware of malware if you use code from mirrors other than the official two at GitHub and GitLab.
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].