All Projects → undertheseanlp → Underthesea

undertheseanlp / Underthesea

Licence: gpl-3.0
Underthesea - Vietnamese NLP Toolkit

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to Underthesea

Danlp
DaNLP is a repository for Natural Language Processing resources for the Danish Language.
Stars: ✭ 111 (-86.51%)
Mutual labels:  natural-language-processing, nlp-library
Awesome Pytorch List
A comprehensive list of pytorch related content on github,such as different models,implementations,helper libraries,tutorials etc.
Stars: ✭ 12,475 (+1415.8%)
Mutual labels:  natural-language-processing, nlp-library
Transformers
🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
Stars: ✭ 55,742 (+6673.03%)
Mutual labels:  natural-language-processing, nlp-library
Toiro
A comparison tool of Japanese tokenizers
Stars: ✭ 95 (-88.46%)
Mutual labels:  natural-language-processing, nlp-library
Lingua
👄 The most accurate natural language detection library for Java and the JVM, suitable for long and short text alike
Stars: ✭ 341 (-58.57%)
Mutual labels:  natural-language-processing, nlp-library
Fastnlp
fastNLP: A Modularized and Extensible NLP Framework. Currently still in incubation.
Stars: ✭ 2,441 (+196.6%)
Mutual labels:  natural-language-processing, nlp-library
Lingo
package lingo provides the data structures and algorithms required for natural language processing
Stars: ✭ 113 (-86.27%)
Mutual labels:  natural-language-processing, nlp-library
Spacy
💫 Industrial-strength Natural Language Processing (NLP) in Python
Stars: ✭ 21,978 (+2570.47%)
Mutual labels:  natural-language-processing, nlp-library
Chatbot ner
chatbot_ner: Named Entity Recognition for chatbots.
Stars: ✭ 273 (-66.83%)
Mutual labels:  natural-language-processing, nlp-library
Nlp profiler
A simple NLP library allows profiling datasets with one or more text columns. When given a dataset and a column name containing text data, NLP Profiler will return either high-level insights or low-level/granular statistical information about the text in that column.
Stars: ✭ 181 (-78.01%)
Mutual labels:  natural-language-processing, nlp-library
Pynlpl
PyNLPl, pronounced as 'pineapple', is a Python library for Natural Language Processing. It contains various modules useful for common, and less common, NLP tasks. PyNLPl can be used for basic tasks such as the extraction of n-grams and frequency lists, and to build simple language model. There are also more complex data types and algorithms. Moreover, there are parsers for file formats common in NLP (e.g. FoLiA/Giza/Moses/ARPA/Timbl/CQL). There are also clients to interface with various NLP specific servers. PyNLPl most notably features a very extensive library for working with FoLiA XML (Format for Linguistic Annotation).
Stars: ✭ 426 (-48.24%)
Mutual labels:  natural-language-processing, nlp-library
Pythainlp
Thai Natural Language Processing in Python.
Stars: ✭ 582 (-29.28%)
Mutual labels:  natural-language-processing, nlp-library
Keras Attention
Visualizing RNNs using the attention mechanism
Stars: ✭ 697 (-15.31%)
Mutual labels:  natural-language-processing
Jcseg
Jcseg is a light weight NLP framework developed with Java. Provide CJK and English segmentation based on MMSEG algorithm, With also keywords extraction, key sentence extraction, summary extraction implemented based on TEXTRANK algorithm. Jcseg had a build-in http server and search modules for the latest lucene,solr,elasticsearch
Stars: ✭ 754 (-8.38%)
Mutual labels:  natural-language-processing
Bert
TensorFlow code and pre-trained models for BERT
Stars: ✭ 29,971 (+3541.68%)
Mutual labels:  natural-language-processing
Madewithml
Learn how to responsibly deliver value with ML.
Stars: ✭ 29,253 (+3454.43%)
Mutual labels:  natural-language-processing
Spacy Models
💫 Models for the spaCy Natural Language Processing (NLP) library
Stars: ✭ 796 (-3.28%)
Mutual labels:  natural-language-processing
Kuromoji
Kuromoji is a self-contained and very easy to use Japanese morphological analyzer designed for search
Stars: ✭ 745 (-9.48%)
Mutual labels:  nlp-library
Bertsearch
Elasticsearch with BERT for advanced document search.
Stars: ✭ 684 (-16.89%)
Mutual labels:  natural-language-processing
Ai Job Recommend
国内公司人工智能方向(含机器学习、深度学习、计算机视觉和自然语言处理)岗位的招聘信息(含全职、实习和校招)
Stars: ✭ 679 (-17.5%)
Mutual labels:  natural-language-processing



Open-source Vietnamese Natural Language Process Toolkit

Underthesea is:

🌊 A Vietnamese NLP toolkit. Underthesea is a suite of open source Python modules data sets and tutorials supporting research and development in Vietnamese Natural Language Processing. We provides extremely easy API to quickly apply pretrained NLP models to your Vietnamese text, such as word segmentation, part-of-speech tagging (PoS), named entity recognition (NER), text classification and dependency parsing.

🌊 A Pytorch library. Underthesea is backed by one of most popular deep learning libraries, Pytorch, make it easy to train your deep learning models and experiment with new approaches using Underthesea modules and classes.

🌊 An open-source software. Underthesea is published under the GNU General Public License v3.0 license. Permissions of this strong copyleft license are conditioned on making available complete source code of licensed works and modifications, which include larger works using a licensed work, under the same license.

💫 Version 1.3.1 out now!

Installation

To install underthesea, simply:

$ pip install underthesea
✨🍰✨

Satisfaction, guaranteed.

Tutorials

1. Sentence Segmentation

Usage

>>> from underthesea import sent_tokenize
>>> text = 'Taylor cho biết lúc đầu cô cảm thấy ngại với cô bạn thân Amanda nhưng rồi mọi thứ trôi qua nhanh chóng. Amanda cũng thoải mái với mối quan hệ này.'

>>> sent_tokenize(text)
[
  "Taylor cho biết lúc đầu cô cảm thấy ngại với cô bạn thân Amanda nhưng rồi mọi thứ trôi qua nhanh chóng.",
  "Amanda cũng thoải mái với mối quan hệ này."
]

2. Word Segmentation

Usage

>>> from underthesea import word_tokenize
>>> sentence = 'Chàng trai 9X Quảng Trị khởi nghiệp từ nấm sò'

>>> word_tokenize(sentence)
['Chàng trai', '9X', 'Quảng Trị', 'khởi nghiệp', 'từ', 'nấm', 'sò']

>>> word_tokenize(sentence, format="text")
'Chàng_trai 9X Quảng_Trị khởi_nghiệp từ nấm sò'

3. POS Tagging

Usage

>>> from underthesea import pos_tag
>>> pos_tag('Chợ thịt chó nổi tiếng ở Sài Gòn bị truy quét')
[('Chợ', 'N'),
 ('thịt', 'N'),
 ('chó', 'N'),
 ('nổi tiếng', 'A'),
 ('ở', 'E'),
 ('Sài Gòn', 'Np'),
 ('bị', 'V'),
 ('truy quét', 'V')]

4. Chunking

Usage

>>> from underthesea import chunk
>>> text = 'Bác sĩ bây giờ có thể thản nhiên báo tin bệnh nhân bị ung thư?'
>>> chunk(text)
[('Bác sĩ', 'N', 'B-NP'),
 ('bây giờ', 'P', 'I-NP'),
 ('có thể', 'R', 'B-VP'),
 ('thản nhiên', 'V', 'I-VP'),
 ('báo tin', 'N', 'B-NP'),
 ('bệnh nhân', 'N', 'I-NP'),
 ('bị', 'V', 'B-VP'),
 ('ung thư', 'N', 'I-VP'),
 ('?', 'CH', 'O')]

5. Dependency Parsing

Usage

>>> from underthesea import dependency_parse
>>> text = 'Tối 29/11, Việt Nam thêm 2 ca mắc Covid-19'
>>> dependency_parse(text)
[('Tối', 5, 'obl:tmod'),
 ('29/11', 1, 'flat:date'),
 (',', 1, 'punct'),
 ('Việt Nam', 5, 'nsubj'),
 ('thêm', 0, 'root'),
 ('2', 7, 'nummod'),
 ('ca', 5, 'obj'),
 ('mắc', 7, 'nmod'),
 ('Covid-19', 8, 'nummod')]

6. Named Entity Recognition

Usage

>>> from underthesea import ner
>>> text = 'Chưa tiết lộ lịch trình tới Việt Nam của Tổng thống Mỹ Donald Trump'
>>> ner(text)
[('Chưa', 'R', 'O', 'O'),
 ('tiết lộ', 'V', 'B-VP', 'O'),
 ('lịch trình', 'V', 'B-VP', 'O'),
 ('tới', 'E', 'B-PP', 'O'),
 ('Việt Nam', 'Np', 'B-NP', 'B-LOC'),
 ('của', 'E', 'B-PP', 'O'),
 ('Tổng thống', 'N', 'B-NP', 'O'),
 ('Mỹ', 'Np', 'B-NP', 'B-LOC'),
 ('Donald', 'Np', 'B-NP', 'B-PER'),
 ('Trump', 'Np', 'B-NP', 'I-PER')]

7. Text Classification

Usage

>>> from underthesea import classify

>>> classify('HLV đầu tiên ở Premier League bị sa thải sau 4 vòng đấu')
['The thao']

>>> classify('Hội đồng tư vấn kinh doanh Asean vinh danh giải thưởng quốc tế')
['Kinh doanh']

>> classify('Lãi suất từ BIDV rất ưu đãi', domain='bank')
['INTEREST_RATE']

8. Sentiment Analysis

Usage

>>> from underthesea import sentiment

>>> sentiment('hàng kém chất lg,chăn đắp lên dính lông lá khắp người. thất vọng')
negative
>>> sentiment('Sản phẩm hơi nhỏ so với tưởng tượng nhưng chất lượng tốt, đóng gói cẩn thận.')
positive

>>> sentiment('Đky qua đường link ở bài viết này từ thứ 6 mà giờ chưa thấy ai lhe hết', domain='bank')
['CUSTOMER_SUPPORT#negative']
>>> sentiment('Xem lại vẫn thấy xúc động và tự hào về BIDV của mình', domain='bank')
['TRADEMARK#positive']

9. Vietnamese NLP Resources

List resources

$ underthesea list-data
| Name         | Type        | License   |   Year | Directory             |
|--------------+-------------+-----------+--------+-----------------------|
| UTS2017-BANK | Categorized | Open      |   2017 | datasets/UTS2017-BANK |
| VNESES       | Plaintext   | Open      |   2012 | datasets/LTA          |
| VNTQ_BIG     | Plaintext   | Open      |   2012 | datasets/LTA          |
| VNTQ_SMALL   | Plaintext   | Open      |   2012 | datasets/LTA          |
| VNTC         | Categorized | Open      |   2007 | datasets/VNTC         |

$ underthesea list-data --all

Download resources

$ underthesea download-data VNTC
100%|██████████| 74846806/74846806 [00:09<00:00, 8243779.16B/s]
Resource VNTC is downloaded in ~/.underthesea/datasets/VNTC folder

Up Coming Features

  • Machine Translation
  • Text to Speech
  • Automatic Speech Recognition

Contributing

Do you want to contribute with underthesea development? Great! Please read more details at CONTRIBUTING.rst

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].