All Projects → vncorenlp → Vncorenlp

vncorenlp / Vncorenlp

Licence: other
A Vietnamese natural language processing toolkit (NAACL 2018)

Programming Languages

java
68154 projects - #9 most used programming language
python3
1442 projects

Projects that are alternatives of or similar to Vncorenlp

Monpa
MONPA 罔拍是一個提供正體中文斷詞、詞性標註以及命名實體辨識的多任務模型
Stars: ✭ 203 (-42.66%)
Mutual labels:  named-entity-recognition, ner, pos-tagging, word-segmentation
Ncrfpp
NCRF++, a Neural Sequence Labeling Toolkit. Easy use to any sequence labeling tasks (e.g. NER, POS, Segmentation). It includes character LSTM/CNN, word LSTM/CNN and softmax/CRF components.
Stars: ✭ 1,767 (+399.15%)
Mutual labels:  natural-language-processing, named-entity-recognition, ner
Bond
BOND: BERT-Assisted Open-Domain Name Entity Recognition with Distant Supervision
Stars: ✭ 96 (-72.88%)
Mutual labels:  natural-language-processing, named-entity-recognition, ner
Spacy Lookup
Named Entity Recognition based on dictionaries
Stars: ✭ 212 (-40.11%)
Mutual labels:  natural-language-processing, named-entity-recognition, ner
Phobert
PhoBERT: Pre-trained language models for Vietnamese (EMNLP-2020 Findings)
Stars: ✭ 332 (-6.21%)
Mutual labels:  named-entity-recognition, ner, pos-tagging
Entity Recognition Datasets
A collection of corpora for named entity recognition (NER) and entity recognition tasks. These annotated datasets cover a variety of languages, domains and entity types.
Stars: ✭ 891 (+151.69%)
Mutual labels:  natural-language-processing, named-entity-recognition, ner
Bert Sklearn
a sklearn wrapper for Google's BERT model
Stars: ✭ 182 (-48.59%)
Mutual labels:  natural-language-processing, named-entity-recognition, ner
Phonlp
PhoNLP: A BERT-based multi-task learning toolkit for part-of-speech tagging, named entity recognition and dependency parsing (NAACL 2021)
Stars: ✭ 56 (-84.18%)
Mutual labels:  named-entity-recognition, ner, pos-tagging
Malaya
Natural Language Toolkit for bahasa Malaysia, https://malaya.readthedocs.io/
Stars: ✭ 239 (-32.49%)
Mutual labels:  natural-language-processing, ner, pos-tagging
Chatbot ner
chatbot_ner: Named Entity Recognition for chatbots.
Stars: ✭ 273 (-22.88%)
Mutual labels:  natural-language-processing, named-entity-recognition, ner
Nlpnet
A neural network architecture for NLP tasks, using cython for fast performance. Currently, it can perform POS tagging, SRL and dependency parsing.
Stars: ✭ 379 (+7.06%)
Mutual labels:  parsing, natural-language-processing, pos-tagging
Hanlp
中文分词 词性标注 命名实体识别 依存句法分析 成分句法分析 语义依存分析 语义角色标注 指代消解 风格转换 语义相似度 新词发现 关键词短语提取 自动摘要 文本分类聚类 拼音简繁转换 自然语言处理
Stars: ✭ 24,626 (+6856.5%)
Mutual labels:  natural-language-processing, named-entity-recognition, pos-tagging
Spacy Streamlit
👑 spaCy building blocks and visualizers for Streamlit apps
Stars: ✭ 360 (+1.69%)
Mutual labels:  natural-language-processing, named-entity-recognition, ner
Turkish Bert Nlp Pipeline
Bert-base NLP pipeline for Turkish, Ner, Sentiment Analysis, Question Answering etc.
Stars: ✭ 85 (-75.99%)
Mutual labels:  natural-language-processing, named-entity-recognition, ner
TweebankNLP
[LREC 2022] An off-the-shelf pre-trained Tweet NLP Toolkit (NER, tokenization, lemmatization, POS tagging, dependency parsing) + Tweebank-NER dataset
Stars: ✭ 84 (-76.27%)
Mutual labels:  named-entity-recognition, ner, pos-tagging
Vntk
Vietnamese NLP Toolkit for Node
Stars: ✭ 170 (-51.98%)
Mutual labels:  natural-language-processing, named-entity-recognition, pos-tagging
Bert Multitask Learning
BERT for Multitask Learning
Stars: ✭ 380 (+7.34%)
Mutual labels:  named-entity-recognition, ner, word-segmentation
Pytorch Bert Crf Ner
KoBERT와 CRF로 만든 한국어 개체명인식기 (BERT+CRF based Named Entity Recognition model for Korean)
Stars: ✭ 236 (-33.33%)
Mutual labels:  natural-language-processing, named-entity-recognition, ner
Dan Jurafsky Chris Manning Nlp
My solution to the Natural Language Processing course made by Dan Jurafsky, Chris Manning in Winter 2012.
Stars: ✭ 124 (-64.97%)
Mutual labels:  parsing, named-entity-recognition, ner
Pytorch-NLU
Pytorch-NLU,一个中文文本分类、序列标注工具包,支持中文长文本、短文本的多类、多标签分类任务,支持中文命名实体识别、词性标注、分词等序列标注任务。 Ptorch NLU, a Chinese text classification and sequence annotation toolkit, supports multi class and multi label classification tasks of Chinese long text and short text, and supports sequence annotation tasks such as Chinese named entity recognition, part of speech ta…
Stars: ✭ 151 (-57.34%)
Mutual labels:  named-entity-recognition, word-segmentation, pos-tagging

Table of contents

  1. Introduction
  2. Installation
  3. Usage for Python users
  4. Usage for Java users
  5. Experimental results

VnCoreNLP: A Vietnamese natural language processing toolkit

VnCoreNLP is an NLP annotation pipeline for Vietnamese, providing rich linguistic annotations through key NLP components of word segmentation, POS tagging, named entity recognition (NER) and dependency parsing:

  • ACCURATE – VnCoreNLP is the most accurate toolkit for Vietnamese NLP, obtaining state-of-the-art results on standard benchmark datasets.
  • FAST – VnCoreNLP is fast, so it can be used for dealing with large-scale data.
  • Easy-To-Use – Users do not have to install external dependencies. Users can run processing pipelines from either the command-line or the API.

The general architecture and experimental results of VnCoreNLP can be found in the following related papers:

  1. Thanh Vu, Dat Quoc Nguyen, Dai Quoc Nguyen, Mark Dras and Mark Johnson. 2018. VnCoreNLP: A Vietnamese Natural Language Processing Toolkit. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Demonstrations, NAACL 2018, pages 56-60. [.bib]
  2. Dat Quoc Nguyen, Dai Quoc Nguyen, Thanh Vu, Mark Dras and Mark Johnson. 2018. A Fast and Accurate Vietnamese Word Segmenter. In Proceedings of the 11th International Conference on Language Resources and Evaluation, LREC 2018, pages 2582-2587. [.bib]
  3. Dat Quoc Nguyen, Thanh Vu, Dai Quoc Nguyen, Mark Dras and Mark Johnson. 2017. From Word Segmentation to POS Tagging for Vietnamese. In Proceedings of the 15th Annual Workshop of the Australasian Language Technology Association, ALTA 2017, pages 108-113. [.bib]

Please CITE paper [1] whenever VnCoreNLP is used to produce published results or incorporated into other software. If you are dealing in depth with either word segmentation or POS tagging, you are encouraged to also cite paper [2] or [3], respectively.

If you are looking for light-weight versions, VnCoreNLP's word segmentation and POS tagging components have also been released as independent packages RDRsegmenter [2] and VnMarMoT [3], resepectively.

Installation

  • Python 3.4+ if using a Python wrapper of VnCoreNLP. To install this wrapper, users have to run the following command:

    $ pip3 install vncorenlp

    A special thanks goes to Khoa Duong (@dnanhkhoa) for creating this wrapper!

  • Java 1.8+

  • File VnCoreNLP-1.1.1.jar (27MB) and folder models (115MB) are placed in the same working folder.

Usage for Python users

Assume that the Python wrapper of VnCoreNLP is already installed via: $ pip3 install vncorenlp

Use as a service

  1. Run the following command:
    # To perform word segmentation, POS tagging, NER and then dependency parsing
    $ vncorenlp -Xmx2g <FULL-PATH-to-VnCoreNLP-jar-file> -p 9000 -a "wseg,pos,ner,parse"
    
    # To perform word segmentation, POS tagging and then NER
    # $ vncorenlp -Xmx2g <FULL-PATH-to-VnCoreNLP-jar-file> -p 9000 -a "wseg,pos,ner"
    # To perform word segmentation and then POS tagging
    # $ vncorenlp -Xmx2g <FULL-PATH-to-VnCoreNLP-jar-file> -p 9000 -a "wseg,pos"
    # To perform word segmentation only
    # $ vncorenlp -Xmx500m <FULL-PATH-to-VnCoreNLP-jar-file> -p 9000 -a "wseg"

The service is now available at http://127.0.0.1:9000.

  1. Use the service in your python code:
from vncorenlp import VnCoreNLP
annotator = VnCoreNLP(address="http://127.0.0.1", port=9000) 

# Input 
text = "Ông Nguyễn Khắc Chúc  đang làm việc tại Đại học Quốc gia Hà Nội. Bà Lan, vợ ông Chúc, cũng làm việc tại đây."

# To perform word segmentation, POS tagging, NER and then dependency parsing
annotated_text = annotator.annotate(text)   

# To perform word segmentation only
word_segmented_text = annotator.tokenize(text)
  • print(annotated_text) # JSON format
{'sentences': [[{'index': 1, 'form': 'Ông', 'posTag': 'Nc', 'nerLabel': 'O', 'head': 4, 'depLabel': 'sub'}, {'index': 2, 'form': 'Nguyễn_Khắc_Chúc', 'posTag': 'Np', 'nerLabel': 'B-PER', 'head': 1, 'depLabel': 'nmod'}, {'index': 3, 'form': 'đang', 'posTag': 'R', 'nerLabel': 'O', 'head': 4, 'depLabel': 'adv'}, {'index': 4, 'form': 'làm_việc', 'posTag': 'V', 'nerLabel': 'O', 'head': 0, 'depLabel': 'root'}, {'index': 5, 'form': 'tại', 'posTag': 'E', 'nerLabel': 'O', 'head': 4, 'depLabel': 'loc'}, {'index': 6, 'form': 'Đại_học', 'posTag': 'N', 'nerLabel': 'B-ORG', 'head': 5, 'depLabel': 'pob'}, {'index': 7, 'form': 'Quốc_gia', 'posTag': 'N', 'nerLabel': 'I-ORG', 'head': 6, 'depLabel': 'nmod'}, {'index': 8, 'form': 'Hà_Nội', 'posTag': 'Np', 'nerLabel': 'I-ORG', 'head': 6, 'depLabel': 'nmod'}, {'index': 9, 'form': '.', 'posTag': 'CH', 'nerLabel': 'O', 'head': 4, 'depLabel': 'punct'}], [{'index': 1, 'form': 'Bà', 'posTag': 'Nc', 'nerLabel': 'O', 'head': 9, 'depLabel': 'sub'}, {'index': 2, 'form': 'Lan', 'posTag': 'Np', 'nerLabel': 'B-PER', 'head': 1, 'depLabel': 'nmod'}, {'index': 3, 'form': ',', 'posTag': 'CH', 'nerLabel': 'O', 'head': 1, 'depLabel': 'punct'}, {'index': 4, 'form': 'vợ', 'posTag': 'N', 'nerLabel': 'O', 'head': 1, 'depLabel': 'nmod'}, {'index': 5, 'form': 'ông', 'posTag': 'Nc', 'nerLabel': 'O', 'head': 4, 'depLabel': 'nmod'}, {'index': 6, 'form': 'Chúc', 'posTag': 'Np', 'nerLabel': 'B-PER', 'head': 5, 'depLabel': 'nmod'}, {'index': 7, 'form': ',', 'posTag': 'CH', 'nerLabel': 'O', 'head': 1, 'depLabel': 'punct'}, {'index': 8, 'form': 'cũng', 'posTag': 'R', 'nerLabel': 'O', 'head': 9, 'depLabel': 'adv'}, {'index': 9, 'form': 'làm_việc', 'posTag': 'V', 'nerLabel': 'O', 'head': 0, 'depLabel': 'root'}, {'index': 10, 'form': 'tại', 'posTag': 'E', 'nerLabel': 'O', 'head': 9, 'depLabel': 'loc'}, {'index': 11, 'form': 'đây', 'posTag': 'P', 'nerLabel': 'O', 'head': 10, 'depLabel': 'pob'}, {'index': 12, 'form': '.', 'posTag': 'CH', 'nerLabel': 'O', 'head': 9, 'depLabel': 'punct'}]]}
  • print(word_segmented_text)
[['Ông', 'Nguyễn_Khắc_Chúc', 'đang', 'làm_việc', 'tại', 'Đại_học', 'Quốc_gia', 'Hà_Nội', '.'], ['Bà', 'Lan', ',', 'vợ', 'ông', 'Chúc', ',', 'cũng', 'làm_việc', 'tại', 'đây', '.']]

Use without the service

from vncorenlp import VnCoreNLP

# To perform word segmentation, POS tagging, NER and then dependency parsing
annotator = VnCoreNLP("<FULL-PATH-to-VnCoreNLP-jar-file>", annotators="wseg,pos,ner,parse", max_heap_size='-Xmx2g') 

# To perform word segmentation, POS tagging and then NER
# annotator = VnCoreNLP("<FULL-PATH-to-VnCoreNLP-jar-file>", annotators="wseg,pos,ner", max_heap_size='-Xmx2g') 
# To perform word segmentation and then POS tagging
# annotator = VnCoreNLP("<FULL-PATH-to-VnCoreNLP-jar-file>", annotators="wseg,pos", max_heap_size='-Xmx2g') 
# To perform word segmentation only
# annotator = VnCoreNLP("<FULL-PATH-to-VnCoreNLP-jar-file>", annotators="wseg", max_heap_size='-Xmx500m') 
    
# Input 
text = "Ông Nguyễn Khắc Chúc  đang làm việc tại Đại học Quốc gia Hà Nội. Bà Lan, vợ ông Chúc, cũng làm việc tại đây."

# To perform word segmentation, POS tagging, NER and then dependency parsing
annotated_text = annotator.annotate(text)

# To perform word segmentation only
word_segmented_text = annotator.tokenize(text) 

For more details, we refer users to https://github.com/dnanhkhoa/python-vncorenlp.

Usage for Java users

Using VnCoreNLP from the command line

You can run VnCoreNLP to annotate an input raw text corpus (e.g. a collection of news content) by using following commands:

// To perform word segmentation, POS tagging, NER and then dependency parsing
$ java -Xmx2g -jar VnCoreNLP-1.1.1.jar -fin input.txt -fout output.txt
// To perform word segmentation, POS tagging and then NER
$ java -Xmx2g -jar VnCoreNLP-1.1.1.jar -fin input.txt -fout output.txt -annotators wseg,pos,ner
// To perform word segmentation and then POS tagging
$ java -Xmx2g -jar VnCoreNLP-1.1.1.jar -fin input.txt -fout output.txt -annotators wseg,pos
// To perform word segmentation
$ java -Xmx2g -jar VnCoreNLP-1.1.1.jar -fin input.txt -fout output.txt -annotators wseg    

Using VnCoreNLP from the API

The following code is a simple and complete example:

import vn.pipeline.*;
import java.io.*;
public class VnCoreNLPExample {
    public static void main(String[] args) throws IOException {
    
        // "wseg", "pos", "ner", and "parse" refer to as word segmentation, POS tagging, NER and dependency parsing, respectively. 
        String[] annotators = {"wseg", "pos", "ner", "parse"}; 
        VnCoreNLP pipeline = new VnCoreNLP(annotators); 
    
        String str = "Ông Nguyễn Khắc Chúc  đang làm việc tại Đại học Quốc gia Hà Nội. Bà Lan, vợ ông Chúc, cũng làm việc tại đây."; 
        
        Annotation annotation = new Annotation(str); 
        pipeline.annotate(annotation); 
        
        System.out.println(annotation.toString());
        // 1    Ông                 Nc  O       4   sub 
        // 2    Nguyễn_Khắc_Chúc    Np  B-PER   1   nmod
        // 3    đang                R   O       4   adv
        // 4    làm_việc            V   O       0   root
        // ...
        
        //Write to file
        PrintStream outputPrinter = new PrintStream("output.txt");
        pipeline.printToFile(annotation, outputPrinter); 
    
        // You can also get a single sentence to analyze individually 
        Sentence firstSentence = annotation.getSentences().get(0);
        System.out.println(firstSentence.toString());
    }
}
vncorenlpexample

See VnCoreNLP's open-source in folder src for API details.

Experimental results

We briefly present experimental setups and obtained results in the following subsections. See details in papers [1,2,3] above or at NLP-progress.

Word segmentation

  • Training data: 75k manually word-segmented training sentences from the VLSP 2013 word segmentation shared task.
  • Test data: 2120 test sentences from the VLSP 2013 POS tagging shared task.
Model F1 (%) Speed (words/second)
VnCoreNLP (i.e. RDRsegmenter) 97.90 62k / _
UETsegmenter 97.87 48k / 33k*
vnTokenizer 97.33 _ / 5k*
JVnSegmenter-Maxent 97.00 _ / 1k*
JVnSegmenter-CRFs 97.06 _ / 1k*
DongDu 96.90 _ / 17k*
  • Speed is computed on a personal computer of Intel Core i7 2.2 GHz, except when specifically mentioned. * denotes that the speed is computed on a personal computer of Intel Core i5 1.80 GHz.
  • See paper [2] for more details.

POS tagging

  • 27,870 sentences for training and development from the VLSP 2013 POS tagging shared task:
    • 27k sentences are used for training.
    • 870 sentences are used for development.
  • Test data: 2120 test sentences from the VLSP 2013 POS tagging shared task.
Model Accuracy (%) Speed
VnCoreNLP (i.e. VnMarMoT) 95.88 25k
RDRPOSTagger 95.11 180k
BiLSTM-CRF 95.06 3k
BiLSTM-CRF + CNN-char 95.40 2.5k
BiLSTM-CRF + LSTM-char 95.31 1.5k
  • See paper [3] for more details.

Named entity recognition

  • 16,861 sentences for training and development from the VLSP 2016 NER shared task:
    • 14,861 sentences are used for training.
    • 2k sentences are used for development.
  • Test data: 2,831 test sentences from the VLSP 2016 NER shared task.
  • NOTE that in the VLSP 2016 NER data, each word representing a full personal name are separated into syllables that constitute the word. The VLSP 2016 NER data also consists of gold POS and chunking tags as reconfirmed by VLSP 2016 organizers. This scheme results in an unrealistic scenario for a pipeline evaluation:
    • The standard annotation for Vietnamese word segmentation and POS tagging forms each full name as a word token, thus all word segmenters have been trained to output a full name as a word and all POS taggers have been trained to assign a POS label to the entire full-name.
    • Gold POS and chunking tags are NOT available in a real-world application.
  • For a realistic scenario, contiguous syllables constituting a full name are merged to form a word. Then, POS tags are predicted by using our tagging component. The results are as follows:
Model F1 Speed
VnCoreNLP 88.55 18k
BiLSTM-CRF 86.48 2.8k
BiLSTM-CRF + CNN-char 88.28 1.8k
BiLSTM-CRF + LSTM-char 87.71 1.3k
BiLSTM-CRF + predicted POS 86.12 _
BiLSTM-CRF + CNN-char + predicted POS 88.06 _
BiLSTM-CRF + LSTM-char + predicted POS 87.43 _
  • Here, for VnCoreNLP, we include the time POS tagging takes in the speed.
  • See paper [1] for more details.

Dependency parsing

  • The last 1020 sentences of the benchmark Vietnamese dependency treebank VnDT are used for test, while the remaining 9k+ sentences are used for training & development. LAS and UAS scores are computed on all tokens (i.e. including punctuation).
Model LAS (%) UAS (%) Speed
Gold POS VnCoreNLP 73.39 79.02 _
BIST-bmstparser 73.17 79.39 _
BIST-barchybrid 72.53 79.33 _
MSTparser 70.29 76.47 _
MaltParser 69.10 74.91 _
Predicted POS VnCoreNLP 70.23 76.93 8k
jPTDP 69.49 77.68 700
  • See paper [1] for more details.
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].