All Projects → mukhal → fairseq-tagging

mukhal / fairseq-tagging

Licence: MIT License
a Fairseq fork for sequence tagging/labeling tasks

Programming Languages

python
139335 projects - #7 most used programming language
Cuda
1817 projects

Projects that are alternatives of or similar to fairseq-tagging

Pytorch ner bilstm cnn crf
End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF implement in pyotrch
Stars: ✭ 249 (+857.69%)
Mutual labels:  ner, pos-tagging, sequence-labeling
Malaya
Natural Language Toolkit for bahasa Malaysia, https://malaya.readthedocs.io/
Stars: ✭ 239 (+819.23%)
Mutual labels:  ner, pos-tagging
Monpa
MONPA 罔拍是一個提供正體中文斷詞、詞性標註以及命名實體辨識的多任務模型
Stars: ✭ 203 (+680.77%)
Mutual labels:  ner, pos-tagging
TweebankNLP
[LREC 2022] An off-the-shelf pre-trained Tweet NLP Toolkit (NER, tokenization, lemmatization, POS tagging, dependency parsing) + Tweebank-NER dataset
Stars: ✭ 84 (+223.08%)
Mutual labels:  ner, pos-tagging
Ld Net
Efficient Contextualized Representation: Language Model Pruning for Sequence Labeling
Stars: ✭ 148 (+469.23%)
Mutual labels:  ner, sequence-labeling
Macadam
Macadam是一个以Tensorflow(Keras)和bert4keras为基础,专注于文本分类、序列标注和关系抽取的自然语言处理工具包。支持RANDOM、WORD2VEC、FASTTEXT、BERT、ALBERT、ROBERTA、NEZHA、XLNET、ELECTRA、GPT-2等EMBEDDING嵌入; 支持FineTune、FastText、TextCNN、CharCNN、BiRNN、RCNN、DCNN、CRNN、DeepMoji、SelfAttention、HAN、Capsule等文本分类算法; 支持CRF、Bi-LSTM-CRF、CNN-LSTM、DGCNN、Bi-LSTM-LAN、Lattice-LSTM-Batch、MRC等序列标注算法。
Stars: ✭ 149 (+473.08%)
Mutual labels:  ner, sequence-labeling
Paribhasha
paribhasha.herokuapp.com/
Stars: ✭ 21 (-19.23%)
Mutual labels:  nlp-machine-learning, pos-tagging
Ntagger
reference pytorch code for named entity tagging
Stars: ✭ 58 (+123.08%)
Mutual labels:  ner, sequence-labeling
CrossNER
CrossNER: Evaluating Cross-Domain Named Entity Recognition (AAAI-2021)
Stars: ✭ 87 (+234.62%)
Mutual labels:  ner, sequence-labeling
wink-nlp
Developer friendly Natural Language Processing ✨
Stars: ✭ 312 (+1100%)
Mutual labels:  ner, pos-tagging
vlainic.github.io
My GitHub blog: things you might be interested, and probably not...
Stars: ✭ 26 (+0%)
Mutual labels:  prediction, nlp-machine-learning
Ncrfpp
NCRF++, a Neural Sequence Labeling Toolkit. Easy use to any sequence labeling tasks (e.g. NER, POS, Segmentation). It includes character LSTM/CNN, word LSTM/CNN and softmax/CRF components.
Stars: ✭ 1,767 (+6696.15%)
Mutual labels:  ner, sequence-labeling
Dan Jurafsky Chris Manning Nlp
My solution to the Natural Language Processing course made by Dan Jurafsky, Chris Manning in Winter 2012.
Stars: ✭ 124 (+376.92%)
Mutual labels:  ner, nlp-machine-learning
Kashgari
Kashgari is a production-level NLP Transfer learning framework built on top of tf.keras for text-labeling and text-classification, includes Word2Vec, BERT, and GPT2 Language Embedding.
Stars: ✭ 2,235 (+8496.15%)
Mutual labels:  ner, sequence-labeling
Lightner
Inference with state-of-the-art models (pre-trained by LD-Net / AutoNER / VanillaNER / ...)
Stars: ✭ 102 (+292.31%)
Mutual labels:  ner, sequence-labeling
Pytorch-NLU
Pytorch-NLU,一个中文文本分类、序列标注工具包,支持中文长文本、短文本的多类、多标签分类任务,支持中文命名实体识别、词性标注、分词等序列标注任务。 Ptorch NLU, a Chinese text classification and sequence annotation toolkit, supports multi class and multi label classification tasks of Chinese long text and short text, and supports sequence annotation tasks such as Chinese named entity recognition, part of speech ta…
Stars: ✭ 151 (+480.77%)
Mutual labels:  pos-tagging, sequence-labeling
Named entity recognition
中文命名实体识别(包括多种模型:HMM,CRF,BiLSTM,BiLSTM+CRF的具体实现)
Stars: ✭ 995 (+3726.92%)
Mutual labels:  ner, sequence-labeling
Phonlp
PhoNLP: A BERT-based multi-task learning toolkit for part-of-speech tagging, named entity recognition and dependency parsing (NAACL 2021)
Stars: ✭ 56 (+115.38%)
Mutual labels:  ner, pos-tagging
sequence labeling tf
Sequence Labeling in Tensorflow
Stars: ✭ 18 (-30.77%)
Mutual labels:  pos-tagging, sequence-labeling
lingvo--Ner-ru
Named entity recognition (NER) in Russian texts / Определение именованных сущностей (NER) в тексте на русском языке
Stars: ✭ 38 (+46.15%)
Mutual labels:  ner, nlp-machine-learning

a Fairseq fork 🍴 adapted for sequence tagging/labeling tasks (NER, POS Tagging, etc)

Motivation

Fairseq is a great library to build sequence-to-sequence models. Unfortunately, it does not support sequence labeling tasks, and you will need to treat the task as seq2seq to make use of Fairseq. This will deprive you of fine-tuning pre-trained models such as RoBERTa XLM-R and BERT and will require you to needlessly train an extra decoder network. I adapted Fairseq here for these tasks so that one is able to utilize the full power of fairseq when training on these tasks.

Example: Training tiny BERT on NER (from scratch) on CoNLL-2003

1. Prepare Data

Assumming your data is in the following IOB format:

SOCCER NN B-NP O 
JAPAN NNP B-NP B-LOC
GET VB B-VP O
LUCKY NNP B-NP O
WIN NNP I-NP O
, , O O

CHINA NNP B-NP B-PER
IN IN B-PP O
SURPRISE DT B-NP O
DEFEAT NN I-NP O
. . O O

with the 3 splits train, valid and test in path/to/data/conll-2003

Run

python preprocess.py --seqtag-data-dir path/to/data/conll-2003 \
      --destdir path/to/data/conll-2003 \
      --nwordssrc 30000 \
      --bpe sentencepiece \
      --sentencepiece-model /path/to/sentencepiece.bpe.model

2. Train

Let's train a tiny BERT (L=2, D=128, H=2) model from scratch:

python train.py data/conll-2003/bin \ 
      --arch bert_sequence_tagger_tiny \
      --criterion sequence_tagging \
      --max-sentences 16  \
      --task sequence_tagging \
      --max-source-positions 128 \
      -s source.bpe \
      -t target.bpe \
      --no-epoch-checkpoints \
      --lr 0.005 \
      --optimizer adam \
      --clf-report \
      --max-epoch 20 \
      --best-checkpoint-metric F1-score \
      --maximize-best-checkpoint-metric

Training starts:

epoch 001 | loss 2.313 | ppl 4.97 | F1-score 0 | wps 202.2 | ups 9.09 | wpb 18 | bsz 1.5 | num_updates 2 | lr 0.005 | gnorm 4.364 | clip 0 | train_wall 0 | wall 0                            
epoch 002 | valid on 'valid' subset | loss 0.557 | ppl 1.47 | F1-score 0.666667 | wps 549.4 | wpb 18 | bsz 1.5 | num_updates 4 | best_F1-score 0.666667                                       
epoch 002:   0%|                                                                                                                                                        | 0/2 [00:00<?, ?it/s]2020-06-05 22:09:03 | INFO | fairseq.checkpoint_utils | saved checkpoint checkpoints/checkpoint_best.pt (epoch 2 @ 4 updates, score 0.6666666666666666) (writing took 0.09897447098046541 seconds)
epoch 002 | loss 1.027 | ppl 2.04 | F1-score 0 | wps 121.8 | ups 6.77 | wpb 18 | bsz 1.5 | num_updates 4 | lr 0.005 | gnorm 2.657 | clip 0 | train_wall 0 | wall 1  
...

3. Predict and Evaluate

python predict.py path/to/data/conll-2003/bin \
         --path checkpoints/checkpoint_last.pt \
         --task sequence_tagging \
         -s source.bpe -t target.bpe \
         --pred-subset test
         --results-path model_outputs/

This writes source and prediction to model_outputs/test.txt and prints:

    precision    recall  f1-score   support

     PERS     0.7156    0.7506    0.7327       429
      ORG     0.5285    0.5092    0.5187       273
      LOC     0.7275    0.7105    0.7189       342

micro avg     0.6724    0.6743    0.6734      1044
macro avg     0.6706    0.6743    0.6722      1044

TODO

  • log F1 metric on validation using Seqeva
  • save best model on validation data according to F1 score not loss
  • work with BPE
  • load and finetune pretrained BERT or RoBERTa
  • prediction/evaluation script
  • LSTM models
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].