All Projects → zliucr → CrossNER

zliucr / CrossNER

Licence: MIT license
CrossNER: Evaluating Cross-Domain Named Entity Recognition (AAAI-2021)

Programming Languages

python
139335 projects - #7 most used programming language
shell
77523 projects

Projects that are alternatives of or similar to CrossNER

Cluener2020
CLUENER2020 中文细粒度命名实体识别 Fine Grained Named Entity Recognition
Stars: ✭ 689 (+691.95%)
Mutual labels:  named-entity-recognition, ner, sequence-labeling
huner
Named Entity Recognition for biomedical entities
Stars: ✭ 44 (-49.43%)
Mutual labels:  named-entity-recognition, corpora, ner
Ld Net
Efficient Contextualized Representation: Language Model Pruning for Sequence Labeling
Stars: ✭ 148 (+70.11%)
Mutual labels:  named-entity-recognition, ner, sequence-labeling
Autoner
Learning Named Entity Tagger from Domain-Specific Dictionary
Stars: ✭ 357 (+310.34%)
Mutual labels:  named-entity-recognition, ner, sequence-labeling
Named entity recognition
中文命名实体识别(包括多种模型:HMM,CRF,BiLSTM,BiLSTM+CRF的具体实现)
Stars: ✭ 995 (+1043.68%)
Mutual labels:  named-entity-recognition, ner, sequence-labeling
Ncrfpp
NCRF++, a Neural Sequence Labeling Toolkit. Easy use to any sequence labeling tasks (e.g. NER, POS, Segmentation). It includes character LSTM/CNN, word LSTM/CNN and softmax/CRF components.
Stars: ✭ 1,767 (+1931.03%)
Mutual labels:  named-entity-recognition, ner, sequence-labeling
Kashgari
Kashgari is a production-level NLP Transfer learning framework built on top of tf.keras for text-labeling and text-classification, includes Word2Vec, BERT, and GPT2 Language Embedding.
Stars: ✭ 2,235 (+2468.97%)
Mutual labels:  named-entity-recognition, ner, sequence-labeling
molminer
Python library and command-line tool for extracting compounds from scientific literature. Written in Python.
Stars: ✭ 38 (-56.32%)
Mutual labels:  named-entity-recognition, ner
Pytorch Bert Crf Ner
KoBERT와 CRF로 만든 한국어 개체명인식기 (BERT+CRF based Named Entity Recognition model for Korean)
Stars: ✭ 236 (+171.26%)
Mutual labels:  named-entity-recognition, ner
Transferable-E2E-ABSA
Transferable End-to-End Aspect-based Sentiment Analysis with Selective Adversarial Learning (EMNLP'19)
Stars: ✭ 62 (-28.74%)
Mutual labels:  sequence-labeling, domain-adaptation
sequence labeling tf
Sequence Labeling in Tensorflow
Stars: ✭ 18 (-79.31%)
Mutual labels:  named-entity-recognition, sequence-labeling
Multi Task Nlp
multi_task_NLP is a utility toolkit enabling NLP developers to easily train and infer a single model for multiple tasks.
Stars: ✭ 221 (+154.02%)
Mutual labels:  named-entity-recognition, sequence-labeling
Ner Datasets
Datasets to train supervised classifiers for Named-Entity Recognition in different languages (Portuguese, German, Dutch, French, English)
Stars: ✭ 220 (+152.87%)
Mutual labels:  named-entity-recognition, ner
Bert ner
Ner with Bert
Stars: ✭ 240 (+175.86%)
Mutual labels:  named-entity-recognition, ner
Neural sequence labeling
A TensorFlow implementation of Neural Sequence Labeling model, which is able to tackle sequence labeling tasks such as POS Tagging, Chunking, NER, Punctuation Restoration and etc.
Stars: ✭ 214 (+145.98%)
Mutual labels:  named-entity-recognition, sequence-labeling
Ner Bert Pytorch
PyTorch solution of named entity recognition task Using Google AI's pre-trained BERT model.
Stars: ✭ 249 (+186.21%)
Mutual labels:  named-entity-recognition, ner
Spacy Lookup
Named Entity Recognition based on dictionaries
Stars: ✭ 212 (+143.68%)
Mutual labels:  named-entity-recognition, ner
Bilstm Lan
Hierarchically-Refined Label Attention Network for Sequence Labeling
Stars: ✭ 241 (+177.01%)
Mutual labels:  named-entity-recognition, sequence-labeling
KoBERT-NER
NER Task with KoBERT (with Naver NLP Challenge dataset)
Stars: ✭ 76 (-12.64%)
Mutual labels:  named-entity-recognition, ner
neural name tagging
Code for "Reliability-aware Dynamic Feature Composition for Name Tagging" (ACL2019)
Stars: ✭ 39 (-55.17%)
Mutual labels:  named-entity-recognition, ner

CrossNER

License: MIT

NEW (2021/1/5): Fixed several annotation errors (thanks for the help from Youliang Yuan).

CrossNER: Evaluating Cross-Domain Named Entity Recognition (Accepted in AAAI-2021) [PDF]

CrossNER is a fully-labeled collected of named entity recognition (NER) data spanning over five diverse domains (Politics, Natural Science, Music, Literature, and Artificial Intelligence) with specialized entity categories for different domains. Additionally, CrossNER also includes unlabeled domain-related corpora for the corresponding five domains. We hope that our collected dataset (CrossNER) will catalyze research in the NER domain adaptation area.

You can have a quick overview of this paper through our blog. If you use the dataset in an academic paper, please consider citing the following paper.

@article{liu2020crossner,
      title={CrossNER: Evaluating Cross-Domain Named Entity Recognition}, 
      author={Zihan Liu and Yan Xu and Tiezheng Yu and Wenliang Dai and Ziwei Ji and Samuel Cahyawijaya and Andrea Madotto and Pascale Fung},
      year={2020},
      eprint={2012.04373},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

The CrossNER Dataset

Data Statistics and Entity Categories

Data statistics of unlabeled domain corpora, labeled NER samples and entity categories for each domain.

Data Examples

Data examples for the collected five domains. Each domain has its specialized entity categories.

Domain Overlaps

Vocabulary overlaps between domains (%). Reuters denotes the Reuters News domain, “Science” denotes the natural science domain and “Litera.” denotes the literature domain.

Download

Labeled NER data: Labeled NER data for the five target domains (Politics, Science, Music, Literature, and AI) and the source domain (Reuters News from CoNLL-2003 shared task) can be found in ner_data folder.

Unlabeled Corpora: Unlabeled domain-related corpora (domain-level, entity-level, task-level and integrated) for the five target domains can be downloaded here.

Dependency

  • Install PyTorch (Tested in PyTorch 1.2.0 and Python 3.6)
  • Install transformers (Tested in transformers 3.0.2)

Domain-Adaptive Pre-Training (DAPT)

Configurations

  • --train_data_file: The file path of the pre-training corpus.
  • --output_dir: The output directory where the pre-trained model is saved.
  • --model_name_or_path: Continue pre-training on which model.
❱❱❱ python run_language_modeling.py --output_dir=politics_spanlevel_integrated --model_type=bert --model_name_or_path=bert-base-cased --do_train --train_data_file=corpus/politics_integrated.txt --mlm

This example is for span-level pre-training using integrated corpus in the politics domain. This code is modified based on run_language_modeling.py from huggingface transformers (3.0.2).

Baselines

Configurations

  • --tgt_dm: Target domain that the model needs to adapt to.
  • --conll: Using source domain data (News domain from CoNLL 2003) for pre-training.
  • --joint: Jointly train using source and target domain data.
  • --num_tag: Number of label types for the target domain (we put the details in src/dataloader.py).
  • --ckpt: Checkpoint path to load the pre-trained model.
  • --emb_file: Word-level embeddings file path.

Directly Fine-tune

Directly fine-tune the pre-trained model (span-level + integrated corpus) to the target domain (politics domain).

❱❱❱ python main.py --exp_name politics_directly_finetune --exp_id 1 --num_tag 19 --ckpt politics_spanlevel_integrated/pytorch_model.bin --tgt_dm politics --batch_size 16

Jointly Train

Initialize the model with the pre-trained model (span-level + integrated corpus). Then, jointly train the model with the source and target (politics) domain data.

❱❱❱ python main.py --exp_name politics_jointly_train --exp_id 1 --num_tag 19 --conll --joint --ckpt politics_spanlevel_integrated/pytorch_model.bin --tgt_dm politics

Pre-train then Fine-tune

Initialize the model with the pre-trained model (span-level + integrated corpus). Then fine-tune it to the target (politics) domain after pre-training on the source domain data.

❱❱❱ python main.py --exp_name politics_pretrain_then_finetune --exp_id 1 --num_tag 19 --conll --ckpt politics_spanlevel_integrated/pytorch_model.bin --tgt_dm politics --batch_size 16

BiLSTM-CRF (Lample et al. 2016)

Jointly train BiLSTM-CRF (word+Char level) on the source domain and target (politics) domain. (we use glove.6B.300d.txt for word-level embeddings and torchtext.vocab.CharNGram() for character-level embeddings).

❱❱❱ python main.py --exp_name politics_bilstm_wordchar --exp_id 1 --num_tag 19 --tgt_dm politics --bilstm --dropout 0.3 --lr 1e-3 --usechar --emb_dim 400

Coach (Liu et al. 2020)

Jointly train Coach (word+Char level) on the source domain and target (politics) domain.

❱❱❱ python main.py --exp_name politics_coach_wordchar --exp_id 1 --num_tag 3 --entity_enc_hidden_dim 200 --tgt_dm politics --coach --dropout 0.5 --lr 1e-4 --usechar --emb_dim 400

Other Notes

  • In the aforementioned baselines, we provide running commands for the politics target domain as an example. The running commands for other target domains can be found in the run.sh file.

Bug Report

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].