All Projects → lemonhu → Ner Bert Pytorch

lemonhu / Ner Bert Pytorch

Licence: mit
PyTorch solution of named entity recognition task Using Google AI's pre-trained BERT model.

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to Ner Bert Pytorch

scikitcrf NER
Python library for custom entity recognition using Sklearn CRF
Stars: ✭ 17 (-93.17%)
Mutual labels:  named-entity-recognition, ner, entity-extraction
Dan Jurafsky Chris Manning Nlp
My solution to the Natural Language Processing course made by Dan Jurafsky, Chris Manning in Winter 2012.
Stars: ✭ 124 (-50.2%)
Mutual labels:  named-entity-recognition, ner, information-extraction
lima
The Libre Multilingual Analyzer, a Natural Language Processing (NLP) C++ toolkit.
Stars: ✭ 75 (-69.88%)
Mutual labels:  information-extraction, named-entity-recognition, entity-extraction
Entity Recognition Datasets
A collection of corpora for named entity recognition (NER) and entity recognition tasks. These annotated datasets cover a variety of languages, domains and entity types.
Stars: ✭ 891 (+257.83%)
Mutual labels:  named-entity-recognition, ner, entity-extraction
Snips Nlu
Snips Python library to extract meaning from text
Stars: ✭ 3,583 (+1338.96%)
Mutual labels:  named-entity-recognition, ner, information-extraction
InformationExtractionSystem
Information Extraction System can perform NLP tasks like Named Entity Recognition, Sentence Simplification, Relation Extraction etc.
Stars: ✭ 27 (-89.16%)
Mutual labels:  information-extraction, named-entity-recognition, entity-extraction
neural name tagging
Code for "Reliability-aware Dynamic Feature Composition for Name Tagging" (ACL2019)
Stars: ✭ 39 (-84.34%)
Mutual labels:  information-extraction, named-entity-recognition, ner
simple NER
simple rule based named entity recognition
Stars: ✭ 29 (-88.35%)
Mutual labels:  information-extraction, named-entity-recognition, ner
Bert Multitask Learning
BERT for Multitask Learning
Stars: ✭ 380 (+52.61%)
Mutual labels:  named-entity-recognition, ner, transformer
Nlp Experiments In Pytorch
PyTorch repository for text categorization and NER experiments in Turkish and English.
Stars: ✭ 35 (-85.94%)
Mutual labels:  named-entity-recognition, ner, transformer
Nlp research
NLP research:基于tensorflow的nlp深度学习项目,支持文本分类/句子匹配/序列标注/文本生成 四大任务
Stars: ✭ 141 (-43.37%)
Mutual labels:  ner, transformer
Ncrfpp
NCRF++, a Neural Sequence Labeling Toolkit. Easy use to any sequence labeling tasks (e.g. NER, POS, Segmentation). It includes character LSTM/CNN, word LSTM/CNN and softmax/CRF components.
Stars: ✭ 1,767 (+609.64%)
Mutual labels:  named-entity-recognition, ner
Bnlp
BNLP is a natural language processing toolkit for Bengali Language.
Stars: ✭ 127 (-49%)
Mutual labels:  named-entity-recognition, ner
Ner Evaluation
An implementation of a full named-entity evaluation metrics based on SemEval'13 Task 9 - not at tag/token level but considering all the tokens that are part of the named-entity
Stars: ✭ 126 (-49.4%)
Mutual labels:  named-entity-recognition, ner
Information Extraction Chinese
Chinese Named Entity Recognition with IDCNN/biLSTM+CRF, and Relation Extraction with biGRU+2ATT 中文实体识别与关系提取
Stars: ✭ 1,888 (+658.23%)
Mutual labels:  named-entity-recognition, information-extraction
Triggerner
TriggerNER: Learning with Entity Triggers as Explanations for Named Entity Recognition (ACL 2020)
Stars: ✭ 141 (-43.37%)
Mutual labels:  named-entity-recognition, information-extraction
Bert ner
Ner with Bert
Stars: ✭ 240 (-3.61%)
Mutual labels:  named-entity-recognition, ner
Sequence tagging
Named Entity Recognition (LSTM + CRF) - Tensorflow
Stars: ✭ 1,889 (+658.63%)
Mutual labels:  named-entity-recognition, ner
Spark Nlp
State of the Art Natural Language Processing
Stars: ✭ 2,518 (+911.24%)
Mutual labels:  named-entity-recognition, entity-extraction
Ld Net
Efficient Contextualized Representation: Language Model Pruning for Sequence Labeling
Stars: ✭ 148 (-40.56%)
Mutual labels:  named-entity-recognition, ner

NER-BERT-pytorch

PyTorch solution of Named Entity Recognition task with Google AI's BERT model.

利用Google AI的BERT模型进行中文命名实体识别任务的PyTorch实现。

Welcome to watch, star or fork.

MSRA dataset

Here, we take the Chinese NER data MSRA as an example. Of course, the English NER data is also fully applicable.

Named entity recognition task is one of the tasks of the Third SIGHAN Chinese Language Processing Bakeoff, we take the simplified Chinese version of the Microsoft NER dataset as the research object.

Data Formats

The NER dataset of MSRA consists of training set data/msra_train_bio and test set data/msra_test_bio, and no validation set is provided. There are 45000 training samples and 3442 test samples, and we will divide them appropriately later.

The dataset contains three types of entities: Person, Organization, Location and Other, the corresponding abbreviated tags are PER, ORG and LOC and O.

The format is similar to that of the Co-NLL NER task 2002, adapted for Chinese. The data is presented in two-column format, where the first column consists of the character and the second is a tag. The tag is specified as follows:

Tag Meaning
O Not part of a named entity
B-PER Beginning character of a person name
I-PER Non-beginning character of a person name
B-ORG Beginning character of an organization name
I-ORG Non-beginning character of an organization name
B-LOC Beginning character of a location name
I-LOC Non-beginning character of a location name
B-GPE Beginning character of a geopolitical entity
I-GPE Non-beginning character of a geopolitical entity

Dataset patition

We randomly select 3000 samples from the training set as the validation set, and the test set is unchanged. Thus, the dataset distribution is as follows.

Dataset Number
training set 42000
validation set 3000
test set 3442

Requirements

This repo was tested on Python 3.5+ and PyTorch 0.4.1/1.0.0. The requirements are:

  • tensorflow >= 1.11.0
  • torch >= 0.4.1
  • pytorch-pretrained-bert == 0.4.0
  • tqdm
  • apex

Note: The tensorflow library is only used for the conversion of pre-trained models from TensorFlow to PyTorch. apex is a tool for easy mixed precision and distributed training in Pytorch, please see https://github.com/NVIDIA/apex.

Results

We didn't search best parameters and obtained the following results.

Overall results

Based on the best performance of the model on the validation set, the overall effect of the model is as follows:

Dataset F1_score
training set 99.88
validation set 95.90
test set 94.62

Detail results on test set

Based on the best model on the validation set, we can get the recognition effect of each entity type on the test set.

NE Types Precison Recall F1_score
PER 96.36 96.43 96.39
ORG 89.64 92.07 90.84
LOC 95.92 95.13 95.52

Usage

  1. Get BERT model for PyTorch

    There are two ways to get the pre-trained BERT model in a PyTorch dump for your experiments :

    • Direct download of the converted pytorch version of the BERT model

      You can download the pytorch dump I converted from the tensorflow checkpont from my Google Cloud Drive folder bert-base-chinese-pytorch, including the BERT parameters file bert_config.json, the model file pytorch_model.bin and the vocabulary file vocab.txt.

    • Convert the TensorFlow checkpoint to a PyTorch dump by yourself

      • Download the Google's BERT base model for Chinese from BERT-Base, Chinese (Chinese Simplified and Traditional, 12-layer, 768-hidden, 12-heads, 110M parameters), and decompress it.

      • Execute the following command, convert the TensorFlow checkpoint to a PyTorch dump.

        export TF_BERT_BASE_DIR=/path/to/chinese_L-12_H-768_A-12
        export PT_BERT_BASE_DIR=/path/to/NER-BERT-pytorch/bert-base-chinese-pytorch
        
        pytorch_pretrained_bert convert_tf_checkpoint_to_pytorch \
        	$TF_BERT_BASE_DIR/bert_model.ckpt \
        	$TF_BERT_BASE_DIR/bert_config.json \
        	$PT_BERT_BASE_DIR/pytorch_model.bin
        
      • Copy the BERT parameters file bert_config.json and dictionary file vocab.txt to the directory $PT_BERT_BASE_DIR.

        cp $TF_BERT_BASE_DIR/bert_config.json $PT_BERT_BASE_DIR/bert_config.json
        cp $TF_BERT_BASE_DIR/vocab.txt $PT_BERT_BASE_DIR/vocab.txt
        
  2. Build dataset and tags

    python build_msra_dataset_tags.py
    

    It will extract the sentences and tags from the dataset data/msra_train_bio and data/msra_test_bio, split them into train/val/test and save them in a convenient format for our model, and create a file tags.txt containing a collection of tags.

  3. Set experimental hyperparameters

    We created a base_model directory for you under the experiments directory. It contains a file params.json which sets the hyperparameters for the experiment. It looks like

    {
        "full_finetuning": true,
        "max_len": 180,
    
        "learning_rate": 3e-5,
        "weight_decay": 0.01,
        "clip_grad": 5,
    }
    

    For every new experiment, you will need to create a new directory under experiments with a params.json file.

  4. Train and evaluate your experiment

    if you use default parameters, just run

    python train.py
    

    Or specify parameters on the command line

    python train.py --data_dir data/msra --bert_model_dir bert-base-chinese-pytorch --model_dir experiments/base_model --multi_gpu
    

    It will instantiate a model and train it on the training set following the hyperparameters specified in params.json. It will also evaluate some metrics on the development set.

  5. Evaluation on the test set

    Once you've run many experiments and selected your best model and hyperparameters based on the performance on the development set, you can finally evaluate the performance of your model on the test set.

    if you use default parameters, just run

    python evaluate.py
    

    Or specify parameters on the command line

    python evaluate.py --data_dir data/msra --bert_model_dir bert-base-chinese-pytorch --model_dir experiments/base_model
    

References

  • Devlin et al. BERT: Pre-training of Deep Bidirectional Trasnsformers for Language Understanding (2018) [paper]
  • google-research/bert [github]
  • huggingface/pytorch-pretrained-BERT [github]
  • NVIDIA/apex [github]
  • chakki-works/seqeval [github]
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].