All Projects → himkt → pyner

himkt / pyner

Licence: MIT license
🌈 Implementation of Neural Network based Named Entity Recognizer (Lample+, 2016) using Chainer.

Programming Languages

python
139335 projects - #7 most used programming language
Makefile
30231 projects
Dockerfile
14818 projects

Projects that are alternatives of or similar to pyner

Neuronlp2
Deep neural models for core NLP tasks (Pytorch version)
Stars: ✭ 397 (+782.22%)
Mutual labels:  named-entity-recognition, sequence-labeling
Anago
Bidirectional LSTM-CRF and ELMo for Named-Entity Recognition, Part-of-Speech Tagging and so on.
Stars: ✭ 1,392 (+2993.33%)
Mutual labels:  named-entity-recognition, sequence-labeling
Seqeval
A Python framework for sequence labeling evaluation(named-entity recognition, pos tagging, etc...)
Stars: ✭ 508 (+1028.89%)
Mutual labels:  named-entity-recognition, sequence-labeling
CrowdLayer
A neural network layer that enables training of deep neural networks directly from crowdsourced labels (e.g. from Amazon Mechanical Turk) or, more generally, labels from multiple annotators with different biases and levels of expertise.
Stars: ✭ 45 (+0%)
Mutual labels:  named-entity-recognition, sequence-labeling
Kashgari
Kashgari is a production-level NLP Transfer learning framework built on top of tf.keras for text-labeling and text-classification, includes Word2Vec, BERT, and GPT2 Language Embedding.
Stars: ✭ 2,235 (+4866.67%)
Mutual labels:  named-entity-recognition, sequence-labeling
Slot filling and intent detection of slu
slot filling, intent detection, joint training, ATIS & SNIPS datasets, the Facebook’s multilingual dataset, MIT corpus, E-commerce Shopping Assistant (ECSA) dataset, CoNLL2003 NER, ELMo, BERT, XLNet
Stars: ✭ 298 (+562.22%)
Mutual labels:  named-entity-recognition, sequence-labeling
Named entity recognition
中文命名实体识别(包括多种模型:HMM,CRF,BiLSTM,BiLSTM+CRF的具体实现)
Stars: ✭ 995 (+2111.11%)
Mutual labels:  named-entity-recognition, sequence-labeling
sequence labeling tf
Sequence Labeling in Tensorflow
Stars: ✭ 18 (-60%)
Mutual labels:  named-entity-recognition, sequence-labeling
Ld Net
Efficient Contextualized Representation: Language Model Pruning for Sequence Labeling
Stars: ✭ 148 (+228.89%)
Mutual labels:  named-entity-recognition, sequence-labeling
Ncrfpp
NCRF++, a Neural Sequence Labeling Toolkit. Easy use to any sequence labeling tasks (e.g. NER, POS, Segmentation). It includes character LSTM/CNN, word LSTM/CNN and softmax/CRF components.
Stars: ✭ 1,767 (+3826.67%)
Mutual labels:  named-entity-recognition, sequence-labeling
AlpacaTag
AlpacaTag: An Active Learning-based Crowd Annotation Framework for Sequence Tagging (ACL 2019 Demo)
Stars: ✭ 126 (+180%)
Mutual labels:  named-entity-recognition, sequence-labeling
Multi Task Nlp
multi_task_NLP is a utility toolkit enabling NLP developers to easily train and infer a single model for multiple tasks.
Stars: ✭ 221 (+391.11%)
Mutual labels:  named-entity-recognition, sequence-labeling
Pytorch-NLU
Pytorch-NLU,一个中文文本分类、序列标注工具包,支持中文长文本、短文本的多类、多标签分类任务,支持中文命名实体识别、词性标注、分词等序列标注任务。 Ptorch NLU, a Chinese text classification and sequence annotation toolkit, supports multi class and multi label classification tasks of Chinese long text and short text, and supports sequence annotation tasks such as Chinese named entity recognition, part of speech ta…
Stars: ✭ 151 (+235.56%)
Mutual labels:  named-entity-recognition, sequence-labeling
Autoner
Learning Named Entity Tagger from Domain-Specific Dictionary
Stars: ✭ 357 (+693.33%)
Mutual labels:  named-entity-recognition, sequence-labeling
CrossNER
CrossNER: Evaluating Cross-Domain Named Entity Recognition (AAAI-2021)
Stars: ✭ 87 (+93.33%)
Mutual labels:  named-entity-recognition, sequence-labeling
Cluener2020
CLUENER2020 中文细粒度命名实体识别 Fine Grained Named Entity Recognition
Stars: ✭ 689 (+1431.11%)
Mutual labels:  named-entity-recognition, sequence-labeling
Flair
A very simple framework for state-of-the-art Natural Language Processing (NLP)
Stars: ✭ 11,065 (+24488.89%)
Mutual labels:  named-entity-recognition, sequence-labeling
Neural sequence labeling
A TensorFlow implementation of Neural Sequence Labeling model, which is able to tackle sequence labeling tasks such as POS Tagging, Chunking, NER, Punctuation Restoration and etc.
Stars: ✭ 214 (+375.56%)
Mutual labels:  named-entity-recognition, sequence-labeling
Bilstm Lan
Hierarchically-Refined Label Attention Network for Sequence Labeling
Stars: ✭ 241 (+435.56%)
Mutual labels:  named-entity-recognition, sequence-labeling
Chainer Pose Proposal Net
Chainer implementation of Pose Proposal Networks
Stars: ✭ 119 (+164.44%)
Mutual labels:  chainer

PyNER: Toolkit for sequence labeling in Chainer

CircleCI GitHub stars GitHub issues GitHub release MIT License

PyNER is a sequence labeling toolkit that allows researcher and developer to train/evaluate neural sequence labeling methods.

QuickStart

You can try pyner on a local machine or a docker container.

1. Local Machine

  • setup (If you do not install pipenv, please install)
poetry install
  • train
# If a GPU is not available, specify `--device -1`
pipenv run python pyner/named_entity/train.py config/training/conll2003.lample.yaml --device 0

2. Docker container

  • build container
make build
  • launch container
make start
  • train

You have to execute this command in Docker container.

# If a GPU is not available, specify `--device -1`
python3 train.py config/training/conll2003.lample.yaml --device 0

This experiment uses CoNLL 2003 dataset. Please read the following "Prepare dataset" section.

Prepare dataset

We use a data format same as deep-crf.

$ head -n 15 data/processed/CoNLL2003_BIOES/train.txt
EU      S-ORG
rejects O
German  S-MISC
call    O
to      O
boycott O
British S-MISC
lamb    O
.       O

Peter   B-PER
Blackburn       E-PER

BRUSSELS        S-LOC
1996-08-22      O

For reproducing results in Lample's paper, you have to do some step to prepare datasets.

1. Prepare CoNLL 2003 Dataset

We can't include CoNLL 2003 dataset in this repository due to legal limitation. Instead, PyNER offers the way to create dataset from CoNLL 2003 dataset

If you could prepare CoNLL 2003 dataset, you would have three files like below.

  • eng.iob.testa
  • eng.iob.testb
  • eng.iob.train

Please put them to on same directoy (e.g. data/external/CoNLL2003).

$ tree data/external/CoNLL2003
data/external/CoNLL2003
├── eng.iob.testa
├── eng.iob.testb
└── eng.iob.train

Then, you can create the dataset for pyner by following command. After running the command, ./data/processed/CoNLL2003 will be generated for you.

$ python bin/parse_CoNLL2003.py \
  --data-dir     data/external/CoNLL2003 \
  --output-dir   data/processed/CoNLL2003 \
  --convert-rule iob2bio
2019-09-24 23:43:39,299 INFO root :create dataset for CoNLL2003
2019-09-24 23:43:39,299 INFO root :create corpus parser
2019-09-24 23:43:39,300 INFO root :parsing corpus for training
2019-09-24 23:44:02,240 INFO root :parsing corpus for validating
2019-09-24 23:44:04,397 INFO root :parsing corpus for testing
2019-09-24 23:44:06,507 INFO root :Create train dataset
2019-09-24 23:44:06,705 INFO root :Create valid dataset
2019-09-24 23:44:06,755 INFO root :Create test dataset
2019-09-24 23:44:06,800 INFO root :Create vocabulary
$
$ tree data/processed/CoNLL2003
data/processed/CoNLL2003
├── test.txt
├── train.txt
├── valid.txt
├── vocab.chars.txt
├── vocab.tags.txt
└── vocab.words.txt

2. Prepare pre-trained Word Embeddings used in Lample's paper

Using pre-trained word embeddings significantly improve the performance of NER. Lample et al. also use pre-trained word embeddings. They use Skip-N-Gram embeddings, which can be downloaded from Official repo's issue. To use this, please run make get-lample before running make build. (If you want to use GloVe embeddings, please run make get-glove.)

$ make get-lample
rm -rf data/external/GloveEmbeddings
mkdir -p data/external/LampleEmbeddings
mkdir -p data/processed/LampleEmbeddings
python bin/fetch_lample_embedding.py
python bin/prepare_embeddings.py \
                data/external/LampleEmbeddings/skipngram_100d.txt \
                data/processed/LampleEmbeddings/skipngram_100d \
                --format word2vec
saved model
$
$ ls -1 data/processed/LampleEmbeddings
skipngram_100d
skipngram_100d.vectors.npy

Congratulations! All preparation steps have done. Now you can train the Lample's LSTM-CRF. Please run the command:

  • Local machine: python3 pyner/named_entity/train.py config/training/conll2003.lample.yaml --device 0
  • Docker container: python3 train.py config/training/conll2003.lample.yaml --device 0

Inference and Evaluate

You can test your model using pyner/named_entity/inference.py. Only thing you have to pass to inference.py is path to model dir. Model dir is defined in config file (output).

$ cat config/training/conll2003.lample.yaml
iteration: "./config/iteration/long.yaml"
external: "./config/external/conll2003.yaml"
model: "./config/model/lample.yaml"
optimizer: "./config/optimizer/sgd_with_clipping.yaml"
preprocessing: "./config/preprocessing/znorm.yaml"
output: "./model/conll2003.lample"  # model dir is here!!

If you successfully train the model, some files are generated on model/conll2003.lample.skipngram.YYYY-MM-DDTxx:xx:xx.xxxxxx.

$ ls -1 model/conll2003.lample.skipngram.2019-09-24T07:02:33.536822
args
log
snapshot_epoch_0001
snapshot_epoch_0002
snapshot_epoch_0003
snapshot_epoch_0004
...
snapshot_epoch_0148
snapshot_epoch_0149
snapshot_epoch_0150
validation.main.fscore.epoch_031.pred  # here!!

Running python3 pyner/named_entity/inference.py will generate prediction results on model/conll2003.lample.skipngram.YYYY-MM-DDTxx:xx:xx.xxxxxx A file name would be {metrics}.epoch_{xxx}.pred. inference.py check a log and select a model which achieve most high f1 score on development set. You can use other selection criteria such as watching loss value and specifying an epoch.

  • Dev loss: python3 pyner/named_entity/inference.py --metrics validation/main/loss model/conll2003.lample.skipngram.2019-09-24T07:02:33.536822)
  • Specific epoch: python3 pyner/named_entity/inference.py --epoch 1 model/conll2003.lample.skipngram.2019-09-24T07:02:33.536822

If you could generate a prediction file, it's time to evaluate a model performance. conlleval is the standard script to evaluate CoNLL Chunking/NER tasks. First of all, we have to download conlleval. Running the command make get-conlleval would download conlleval on current directory. Then, evaluate!!!

$ ./conlleval < model/conll2003.lample.skipngram.2019-09-24T07:02:33.536822/validation.main.fscore.epoch_139.pred
processed 46435 tokens with 5628 phrases; found: 5651 phrases; correct: 5134.
accuracy:  97.82%; precision:  90.85%; recall:  91.22%; FB1:  91.04
              LOC: precision:  93.41%; recall:  92.18%; FB1:  92.79  1640
             MISC: precision:  80.66%; recall:  80.66%; FB1:  80.66  693
              ORG: precision:  88.72%; recall:  89.79%; FB1:  89.26  1676
              PER: precision:  94.76%; recall:  96.23%; FB1:  95.49  1642

F1 score on test set is 91.04, which is approximately the same as the result in Lample's paper! (90.94)

Reference

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].