Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

Created with love in Canada, visit hostnodejs.com today

Feel like to post an Ad? Learn Details

All Projects → undertheseanlp → Underthesea

undertheseanlp / Underthesea

Licence: gpl-3.0

Underthesea - Vietnamese NLP Toolkit

Programming Languages

139335 projects - #7 most used programming language

Labels

nlp natural-language-processing nlp-library

Projects that are alternatives of or similar to Underthesea

DaNLP is a repository for Natural Language Processing resources for the Danish Language.

Stars: ✭ 111 (-86.51%)

Mutual labels: natural-language-processing, nlp-library

Awesome Pytorch List

A comprehensive list of pytorch related content on github,such as different models,implementations,helper libraries,tutorials etc.

Stars: ✭ 12,475 (+1415.8%)

Mutual labels: natural-language-processing, nlp-library

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.

Stars: ✭ 55,742 (+6673.03%)

Mutual labels: natural-language-processing, nlp-library

A comparison tool of Japanese tokenizers

Stars: ✭ 95 (-88.46%)

Mutual labels: natural-language-processing, nlp-library

👄 The most accurate natural language detection library for Java and the JVM, suitable for long and short text alike

Stars: ✭ 341 (-58.57%)

Mutual labels: natural-language-processing, nlp-library

fastNLP: A Modularized and Extensible NLP Framework. Currently still in incubation.

Stars: ✭ 2,441 (+196.6%)

Mutual labels: natural-language-processing, nlp-library

package lingo provides the data structures and algorithms required for natural language processing

Stars: ✭ 113 (-86.27%)

Mutual labels: natural-language-processing, nlp-library

💫 Industrial-strength Natural Language Processing (NLP) in Python

Stars: ✭ 21,978 (+2570.47%)

Mutual labels: natural-language-processing, nlp-library

chatbot_ner: Named Entity Recognition for chatbots.

Stars: ✭ 273 (-66.83%)

Mutual labels: natural-language-processing, nlp-library

A simple NLP library allows profiling datasets with one or more text columns. When given a dataset and a column name containing text data, NLP Profiler will return either high-level insights or low-level/granular statistical information about the text in that column.

Stars: ✭ 181 (-78.01%)

Mutual labels: natural-language-processing, nlp-library

PyNLPl, pronounced as 'pineapple', is a Python library for Natural Language Processing. It contains various modules useful for common, and less common, NLP tasks. PyNLPl can be used for basic tasks such as the extraction of n-grams and frequency lists, and to build simple language model. There are also more complex data types and algorithms. Moreover, there are parsers for file formats common in NLP (e.g. FoLiA/Giza/Moses/ARPA/Timbl/CQL). There are also clients to interface with various NLP specific servers. PyNLPl most notably features a very extensive library for working with FoLiA XML (Format for Linguistic Annotation).

Stars: ✭ 426 (-48.24%)

Mutual labels: natural-language-processing, nlp-library

Thai Natural Language Processing in Python.

Stars: ✭ 582 (-29.28%)

Mutual labels: natural-language-processing, nlp-library

Keras Attention

Visualizing RNNs using the attention mechanism

Stars: ✭ 697 (-15.31%)

Mutual labels: natural-language-processing

Jcseg is a light weight NLP framework developed with Java. Provide CJK and English segmentation based on MMSEG algorithm, With also keywords extraction, key sentence extraction, summary extraction implemented based on TEXTRANK algorithm. Jcseg had a build-in http server and search modules for the latest lucene,solr,elasticsearch

Stars: ✭ 754 (-8.38%)

Mutual labels: natural-language-processing

TensorFlow code and pre-trained models for BERT

Stars: ✭ 29,971 (+3541.68%)

Mutual labels: natural-language-processing

Learn how to responsibly deliver value with ML.

Stars: ✭ 29,253 (+3454.43%)

Mutual labels: natural-language-processing

💫 Models for the spaCy Natural Language Processing (NLP) library

Stars: ✭ 796 (-3.28%)

Mutual labels: natural-language-processing

Kuromoji is a self-contained and very easy to use Japanese morphological analyzer designed for search

Stars: ✭ 745 (-9.48%)

Mutual labels: nlp-library

Elasticsearch with BERT for advanced document search.

Stars: ✭ 684 (-16.89%)

Mutual labels: natural-language-processing

Ai Job Recommend

国内公司人工智能方向（含机器学习、深度学习、计算机视觉和自然语言处理）岗位的招聘信息（含全职、实习和校招）

Stars: ✭ 679 (-17.5%)

Mutual labels: natural-language-processing

View All Similar Projects ➔

Open-source Vietnamese Natural Language Process Toolkit

Underthesea is:

🌊 A Vietnamese NLP toolkit. Underthesea is a suite of open source Python modules data sets and tutorials supporting research and development in Vietnamese Natural Language Processing. We provides extremely easy API to quickly apply pretrained NLP models to your Vietnamese text, such as word segmentation, part-of-speech tagging (PoS), named entity recognition (NER), text classification and dependency parsing.

🌊 A Pytorch library. Underthesea is backed by one of most popular deep learning libraries, Pytorch, make it easy to train your deep learning models and experiment with new approaches using Underthesea modules and classes.

🌊 An open-source software. Underthesea is published under the GNU General Public License v3.0 license. Permissions of this strong copyleft license are conditioned on making available complete source code of licensed works and modifications, which include larger works using a licensed work, under the same license.

💫 Version 1.3.1 out now!

Installation

To install underthesea, simply:

$ pip install underthesea
✨🍰✨

Satisfaction, guaranteed.

Tutorials

1. Sentence Segmentation
2. Word Segmentation
3. POS Tagging
4. Chunking
5. Dependency Parsing
6. Named Entity Recognition
7. Text Classification
8. Sentiment Analysis
9. Vietnamese NLP Resources

1. Sentence Segmentation

Usage

>>> from underthesea import sent_tokenize
>>> text = 'Taylor cho biết lúc đầu cô cảm thấy ngại với cô bạn thân Amanda nhưng rồi mọi thứ trôi qua nhanh chóng. Amanda cũng thoải mái với mối quan hệ này.'

>>> sent_tokenize(text)
[
  "Taylor cho biết lúc đầu cô cảm thấy ngại với cô bạn thân Amanda nhưng rồi mọi thứ trôi qua nhanh chóng.",
  "Amanda cũng thoải mái với mối quan hệ này."
]

2. Word Segmentation

Usage

>>> from underthesea import word_tokenize
>>> sentence = 'Chàng trai 9X Quảng Trị khởi nghiệp từ nấm sò'

>>> word_tokenize(sentence)
['Chàng trai', '9X', 'Quảng Trị', 'khởi nghiệp', 'từ', 'nấm', 'sò']

>>> word_tokenize(sentence, format="text")
'Chàng_trai 9X Quảng_Trị khởi_nghiệp từ nấm sò'

3. POS Tagging

Usage

>>> from underthesea import pos_tag
>>> pos_tag('Chợ thịt chó nổi tiếng ở Sài Gòn bị truy quét')
[('Chợ', 'N'),
 ('thịt', 'N'),
 ('chó', 'N'),
 ('nổi tiếng', 'A'),
 ('ở', 'E'),
 ('Sài Gòn', 'Np'),
 ('bị', 'V'),
 ('truy quét', 'V')]

4. Chunking

Usage

>>> from underthesea import chunk
>>> text = 'Bác sĩ bây giờ có thể thản nhiên báo tin bệnh nhân bị ung thư?'
>>> chunk(text)
[('Bác sĩ', 'N', 'B-NP'),
 ('bây giờ', 'P', 'I-NP'),
 ('có thể', 'R', 'B-VP'),
 ('thản nhiên', 'V', 'I-VP'),
 ('báo tin', 'N', 'B-NP'),
 ('bệnh nhân', 'N', 'I-NP'),
 ('bị', 'V', 'B-VP'),
 ('ung thư', 'N', 'I-VP'),
 ('?', 'CH', 'O')]

5. Dependency Parsing

Usage

>>> from underthesea import dependency_parse
>>> text = 'Tối 29/11, Việt Nam thêm 2 ca mắc Covid-19'
>>> dependency_parse(text)
[('Tối', 5, 'obl:tmod'),
 ('29/11', 1, 'flat:date'),
 (',', 1, 'punct'),
 ('Việt Nam', 5, 'nsubj'),
 ('thêm', 0, 'root'),
 ('2', 7, 'nummod'),
 ('ca', 5, 'obj'),
 ('mắc', 7, 'nmod'),
 ('Covid-19', 8, 'nummod')]

6. Named Entity Recognition

Usage

>>> from underthesea import ner
>>> text = 'Chưa tiết lộ lịch trình tới Việt Nam của Tổng thống Mỹ Donald Trump'
>>> ner(text)
[('Chưa', 'R', 'O', 'O'),
 ('tiết lộ', 'V', 'B-VP', 'O'),
 ('lịch trình', 'V', 'B-VP', 'O'),
 ('tới', 'E', 'B-PP', 'O'),
 ('Việt Nam', 'Np', 'B-NP', 'B-LOC'),
 ('của', 'E', 'B-PP', 'O'),
 ('Tổng thống', 'N', 'B-NP', 'O'),
 ('Mỹ', 'Np', 'B-NP', 'B-LOC'),
 ('Donald', 'Np', 'B-NP', 'B-PER'),
 ('Trump', 'Np', 'B-NP', 'I-PER')]

7. Text Classification

Usage

>>> from underthesea import classify

>>> classify('HLV đầu tiên ở Premier League bị sa thải sau 4 vòng đấu')
['The thao']

>>> classify('Hội đồng tư vấn kinh doanh Asean vinh danh giải thưởng quốc tế')
['Kinh doanh']

>> classify('Lãi suất từ BIDV rất ưu đãi', domain='bank')
['INTEREST_RATE']

8. Sentiment Analysis

Usage

>>> from underthesea import sentiment

>>> sentiment('hàng kém chất lg,chăn đắp lên dính lông lá khắp người. thất vọng')
negative
>>> sentiment('Sản phẩm hơi nhỏ so với tưởng tượng nhưng chất lượng tốt, đóng gói cẩn thận.')
positive

>>> sentiment('Đky qua đường link ở bài viết này từ thứ 6 mà giờ chưa thấy ai lhe hết', domain='bank')
['CUSTOMER_SUPPORT#negative']
>>> sentiment('Xem lại vẫn thấy xúc động và tự hào về BIDV của mình', domain='bank')
['TRADEMARK#positive']

9. Vietnamese NLP Resources

List resources

$ underthesea list-data
| Name         | Type        | License   |   Year | Directory             |
|--------------+-------------+-----------+--------+-----------------------|
| UTS2017-BANK | Categorized | Open      |   2017 | datasets/UTS2017-BANK |
| VNESES       | Plaintext   | Open      |   2012 | datasets/LTA          |
| VNTQ_BIG     | Plaintext   | Open      |   2012 | datasets/LTA          |
| VNTQ_SMALL   | Plaintext   | Open      |   2012 | datasets/LTA          |
| VNTC         | Categorized | Open      |   2007 | datasets/VNTC         |

$ underthesea list-data --all

Download resources

$ underthesea download-data VNTC
100%|██████████| 74846806/74846806 [00:09<00:00, 8243779.16B/s]
Resource VNTC is downloaded in ~/.underthesea/datasets/VNTC folder

Up Coming Features

Machine Translation
Text to Speech
Automatic Speech Recognition

Contributing

Do you want to contribute with underthesea development? Great! Please read more details at CONTRIBUTING.rst

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Stars: ✭ 823

Visit Git Page 🔗Visit User Page 🔗Visit Issues Page (21) 🔗