All Projects → VinAIResearch → Phonlp

VinAIResearch / Phonlp

Licence: apache-2.0
PhoNLP: A BERT-based multi-task learning toolkit for part-of-speech tagging, named entity recognition and dependency parsing (NAACL 2021)

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to Phonlp

Bert Sklearn
a sklearn wrapper for Google's BERT model
Stars: ✭ 182 (+225%)
Mutual labels:  named-entity-recognition, ner, language-model
Vncorenlp
A Vietnamese natural language processing toolkit (NAACL 2018)
Stars: ✭ 354 (+532.14%)
Mutual labels:  named-entity-recognition, ner, pos-tagging
Bertweet
BERTweet: A pre-trained language model for English Tweets (EMNLP-2020)
Stars: ✭ 282 (+403.57%)
Mutual labels:  named-entity-recognition, ner, language-model
Ld Net
Efficient Contextualized Representation: Language Model Pruning for Sequence Labeling
Stars: ✭ 148 (+164.29%)
Mutual labels:  named-entity-recognition, ner, language-model
Monpa
MONPA 罔拍是一個提供正體中文斷詞、詞性標註以及命名實體辨識的多任務模型
Stars: ✭ 203 (+262.5%)
Mutual labels:  named-entity-recognition, ner, pos-tagging
TweebankNLP
[LREC 2022] An off-the-shelf pre-trained Tweet NLP Toolkit (NER, tokenization, lemmatization, POS tagging, dependency parsing) + Tweebank-NER dataset
Stars: ✭ 84 (+50%)
Mutual labels:  named-entity-recognition, ner, pos-tagging
Phobert
PhoBERT: Pre-trained language models for Vietnamese (EMNLP-2020 Findings)
Stars: ✭ 332 (+492.86%)
Mutual labels:  named-entity-recognition, ner, pos-tagging
Spacy Streamlit
👑 spaCy building blocks and visualizers for Streamlit apps
Stars: ✭ 360 (+542.86%)
Mutual labels:  named-entity-recognition, ner
Bert Multitask Learning
BERT for Multitask Learning
Stars: ✭ 380 (+578.57%)
Mutual labels:  named-entity-recognition, ner
Lightkg
基于Pytorch和torchtext的知识图谱深度学习框架。
Stars: ✭ 452 (+707.14%)
Mutual labels:  named-entity-recognition, ner
Yedda
YEDDA: A Lightweight Collaborative Text Span Annotation Tool. Code for ACL 2018 Best Demo Paper Nomination.
Stars: ✭ 704 (+1157.14%)
Mutual labels:  named-entity-recognition, ner
Autoner
Learning Named Entity Tagger from Domain-Specific Dictionary
Stars: ✭ 357 (+537.5%)
Mutual labels:  named-entity-recognition, ner
Snips Nlu
Snips Python library to extract meaning from text
Stars: ✭ 3,583 (+6298.21%)
Mutual labels:  named-entity-recognition, ner
Hanlp
中文分词 词性标注 命名实体识别 依存句法分析 成分句法分析 语义依存分析 语义角色标注 指代消解 风格转换 语义相似度 新词发现 关键词短语提取 自动摘要 文本分类聚类 拼音简繁转换 自然语言处理
Stars: ✭ 24,626 (+43875%)
Mutual labels:  named-entity-recognition, pos-tagging
Ner blstm Crf
LSTM-CRF for NER with ConLL-2002 dataset
Stars: ✭ 51 (-8.93%)
Mutual labels:  named-entity-recognition, ner
Nlp Experiments In Pytorch
PyTorch repository for text categorization and NER experiments in Turkish and English.
Stars: ✭ 35 (-37.5%)
Mutual labels:  named-entity-recognition, ner
Bert Bilstm Crf Ner
Tensorflow solution of NER task Using BiLSTM-CRF model with Google BERT Fine-tuning And private Server services
Stars: ✭ 3,838 (+6753.57%)
Mutual labels:  named-entity-recognition, ner
Cluener2020
CLUENER2020 中文细粒度命名实体识别 Fine Grained Named Entity Recognition
Stars: ✭ 689 (+1130.36%)
Mutual labels:  named-entity-recognition, ner
Chinesener
中文命名实体识别,实体抽取,tensorflow,pytorch,BiLSTM+CRF
Stars: ✭ 938 (+1575%)
Mutual labels:  named-entity-recognition, ner
Entity Recognition Datasets
A collection of corpora for named entity recognition (NER) and entity recognition tasks. These annotated datasets cover a variety of languages, domains and entity types.
Stars: ✭ 891 (+1491.07%)
Mutual labels:  named-entity-recognition, ner

logo

PhoNLP: A joint multi-task learning model for Vietnamese part-of-speech tagging, named entity recognition and dependency parsing

PhoNLP is a multi-task learning model for joint part-of-speech (POS) tagging, named entity recognition (NER) and dependency parsing. Experiments on Vietnamese benchmark datasets show that PhoNLP produces state-of-the-art results, outperforming a single-task learning approach that fine-tunes the pre-trained Vietnamese language model PhoBERT for each task independently.

logo

Details of the PhoNLP model architecture and experimental results can be found in our following paper:

@article{PhoNLP,
title     = {{PhoNLP: A joint multi-task learning model for Vietnamese part-of-speech tagging, named entity recognition and dependency parsing}},
author    = {Linh The Nguyen and Dat Quoc Nguyen},
journal   = {arXiv preprint},
volume    = {arXiv:2101.01476},
year      = {2021}
}

Please CITE our paper when PhoNLP is used to help produce published results or incorporated into other software.

Although we specify PhoNLP for Vietnamese, usage examples below in fact can directly work for other languages that have gold annotated corpora available for the three tasks of POS tagging, NER and dependency parsing, and a pre-trained BERT-based language model available from transformers.

Installation

  • Python version >= 3.6; PyTorch version >= 1.4.0
  • PhoNLP can be installed using pip as follows: pip3 install phonlp
  • Or PhoNLP can also be installed from source with the following commands:
     git clone https://github.com/VinAIResearch/PhoNLP
     cd PhoNLP
     pip3 install -e .
    

Usage example: Command lines

To play with the examples using command lines, please install phonlp from the source:

git clone https://github.com/VinAIResearch/PhoNLP
cd PhoNLP
pip3 install -e . 

Training

cd phonlp/models
python3 run_phonlp.py --mode train --save_dir <model_folder_path> \
	--pretrained_lm <transformers_pretrained_model> \
	--lr <float_value> --batch_size <int_value> --num_epoch <int_value> \
	--lambda_pos <float_value> --lambda_ner <float_value> --lambda_dep <float_value> \
	--train_file_pos <path_to_training_file_pos> --eval_file_pos <path_to_validation_file_pos> \
	--train_file_ner <path_to_training_file_ner> --eval_file_ner <path_to_validation_file_ner> \
	--train_file_dep <path_to_training_file_dep> --eval_file_dep <path_to_validation_file_dep>

--lambda_pos, --lambda_ner and --lambda_dep represent mixture weights associated with POS tagging, NER and dependency parsing losses, respectively, and lambda_pos + lambda_ner + lambda_dep = 1.

Example:

cd phonlp/models
python3 run_phonlp.py --mode train --save_dir ./phonlp_tmp \
	--pretrained_lm "vinai/phobert-base" \
	--lr 1e-5 --batch_size 32 --num_epoch 40 \
	--lambda_pos 0.4 --lambda_ner 0.2 --lambda_dep 0.4 \
	--train_file_pos ../sample_data/pos_train.txt --eval_file_pos ../sample_data/pos_valid.txt \
	--train_file_ner ../sample_data/ner_train.txt --eval_file_ner ../sample_data/ner_valid.txt \
	--train_file_dep ../sample_data/dep_train.conll --eval_file_dep ../sample_data/dep_valid.conll

Evaluation

cd phonlp/models
python3 run_phonlp.py --mode eval --save_dir <model_folder_path> \
	--batch_size <int_value> \
	--eval_file_pos <path_to_test_file_pos> \
	--eval_file_ner <path_to_test_file_ner> \
	--eval_file_dep <path_to_test_file_dep> 

Example:

cd phonlp/models
python3 run_phonlp.py --mode eval --save_dir ./phonlp_tmp \
	--batch_size 8 \
	--eval_file_pos ../sample_data/pos_test.txt \
	--eval_file_ner ../sample_data/ner_test.txt \
	--eval_file_dep ../sample_data/dep_test.conll 

Annotate a corpus

cd phonlp/models
python3 run_phonlp.py --mode annotate --save_dir <model_folder_path> \
	--batch_size <int_value> \
	--input_file <path_to_input_file> \
	--output_file <path_to_output_file> 

Example:

cd phonlp/models
python3 run_phonlp.py --mode annotate --save_dir ./phonlp_tmp \
	--batch_size 8 \
	--input_file ../sample_data/input.txt \
	--output_file ../sample_data/output.txt 

The pre-trained PhoNLP model for Vietnamese is available at HERE!

Usage example: Python API

import phonlp
# Automatically download the pre-trained PhoNLP model 
# and save it in a local machine folder
phonlp.download(save_dir='./pretrained_phonlp')
# Load the pre-trained PhoNLP model
model = phonlp.load(save_dir='./pretrained_phonlp')
# Annotate a corpus where each line represents a word-segmented sentence
model.annotate(input_file='input.txt', output_file='output.txt')
# Annotate a word-segmented sentence
model.print_out(model.annotate(text="Tôi đang làm_việc tại VinAI ."))

By default, the output for each input sentence is formatted with 6 columns representing word index, word form, POS tag, NER label, head index of the current word and its dependency relation type:

1	Tôi	P	O	3	sub	
2	đang	R	O	3	adv
3	làm_việc	V	O	0	root
4	tại	E	O	3	loc
5	VinAI	Np 	B-ORG	4	prob
6	.	CH	O	3	punct

In addition, the output can be formatted following the 10-column CoNLL format where the last column is used to represent NER predictions. This can be done by adding output_type='conll' into the model.annotate() function. Also, in the model.annotate() function, the value of the parameter batch_size can be adjusted to fit your computer's memory instead of using the default one at 1 (batch_size=1). Here, a larger batch_size would lead to a faster performance speed.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].