Multi-Cell_LSTM
Multi-Cell Compositional LSTM for NER Domain Adaptation, code for ACL 2020 paper.
Introduction
Cross-domain NER is a challenging yet practical problem. Entity mentions can be highly different across domains. However, the correlations between entity types can be relatively more stable across domains. We investigate a multi-cell compositional LSTM structure for multi-task learning, modeling each entity type using a separate cell state. With the help of entity typed units, cross-domain knowledge transfer can be made in an entity type level. Theoretically, the resulting distinct feature distributions for each entity type make it more powerful for cross-domain transfer. Empirically, experiments on four few-shot and zero-shot datasets show our method significantly outperforms a series of multi-task learning methods and achieves the best results.
For more details, please refer to our paper:
Multi-Cell Compositional LSTM for NER Domain Adaptation
Overall Structure
The Entity Typed cells (ET cells) correspond to the source- and target-domain entity types (including O, which is used as the outside tagger in NER).
Requirements
Python 3
PyTorch 1.0+
allennlp 0.8.2 (Optional)
pytorch-pretrained-bert 0.6.1 (Optional)
Word Embeddings and Pretrained LM Weights
GloVe 100-dimension word vectors (Download from Here with key ifyk
)
PubMed 200-dimension word vectors (Refer to Here) (Download from Here with key dr9k
)
ELMo Weights (Download from Here with key a9h6
)
BERT-base Weights (Download from Here with key gbn1
)
BioBERT-base Weights (Download from Here with key zsep
)
DataSet
Supervised Domain Adaptation (SDA):
CoNLL-2003 English NER data (In: SDA/data/conll03_En
)
Broad Twitter corpus (In: SDA/data/broad_twitter_corpus
) (or download from Here with key 0yko
)
BioNLP'13PC and BioNLP'13CG dataset
Twitter corpus (Refer to Here) (Download from Here with key bn75
)
Unsupervised Domain Adaptation (UDA):
CoNLL-2003 English NER data (In: SDA/data/conll03_En
).
CBS SciTech News (test set) (In: UDA/data/tech/tech.test
).
LM raw data
SciTech news domain raw data Download with key 4834
, and put it in UDA/data/tech
.
Entity dictionary used in UDA
The named entity dictionary is collected by Peng et. al. and In UDA/data/tech/conll2003_dict
.
Usage
Train
SDA
and UDA
can use the following command to make it run.
python main.py --config train.config
The file train.config
contains dataset path and model hyperparameters.
Decode
SDA
and UDA
can use the following command to make it run.
python main.py --config decode.config
The file decode.config
contains dataset path and paths for tarined models.
For example, you can download our trained models with key matp
, unzip the two files .dset
and .model
and put them into SDA/saved_models
. Then you can use the above comment to get our reported result on the broad twitter corpus. UDA models with key 2s6n
are decoded similarly.
Cite:
If you use our code, please cite our paper as follows:
@inproceedings{jia-zhang-2020-multi,
title = "Multi-Cell Compositional {LSTM} for {NER} Domain Adaptation",
author = "Jia, Chen and Zhang, Yue",
booktitle = "Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics",
year = "2020",
publisher = "Association for Computational Linguistics",
url = "https://www.aclweb.org/anthology/2020.acl-main.524",
pages = "5906--5917"
}