All Projects → jiachenwestlake → Multi-Cell_LSTM

jiachenwestlake / Multi-Cell_LSTM

Licence: other
Multi-cell compositional LSTM for NER domain adaptation, code for ACL 2020 paper

Programming Languages

python
139335 projects - #7 most used programming language

Multi-Cell_LSTM

Multi-Cell Compositional LSTM for NER Domain Adaptation, code for ACL 2020 paper.

Introduction

Cross-domain NER is a challenging yet practical problem. Entity mentions can be highly different across domains. However, the correlations between entity types can be relatively more stable across domains. We investigate a multi-cell compositional LSTM structure for multi-task learning, modeling each entity type using a separate cell state. With the help of entity typed units, cross-domain knowledge transfer can be made in an entity type level. Theoretically, the resulting distinct feature distributions for each entity type make it more powerful for cross-domain transfer. Empirically, experiments on four few-shot and zero-shot datasets show our method significantly outperforms a series of multi-task learning methods and achieves the best results.

For more details, please refer to our paper:

Multi-Cell Compositional LSTM for NER Domain Adaptation

Overall Structure

Overall Structure

The Entity Typed cells (ET cells) correspond to the source- and target-domain entity types (including O, which is used as the outside tagger in NER).

Requirements

Python 3 
PyTorch 1.0+
allennlp 0.8.2 (Optional)
pytorch-pretrained-bert 0.6.1 (Optional)

Word Embeddings and Pretrained LM Weights

GloVe 100-dimension word vectors (Download from Here with key ifyk)
PubMed 200-dimension word vectors (Refer to Here) (Download from Here with key dr9k)
ELMo Weights (Download from Here with key a9h6)
BERT-base Weights (Download from Here with key gbn1)
BioBERT-base Weights (Download from Here with key zsep)

DataSet

Supervised Domain Adaptation (SDA):

CoNLL-2003 English NER data (In: SDA/data/conll03_En)
Broad Twitter corpus (In: SDA/data/broad_twitter_corpus) (or download from Here with key 0yko)
BioNLP'13PC and BioNLP'13CG dataset
Twitter corpus (Refer to Here) (Download from Here with key bn75)

Unsupervised Domain Adaptation (UDA):

CoNLL-2003 English NER data (In: SDA/data/conll03_En).
CBS SciTech News (test set) (In: UDA/data/tech/tech.test).

LM raw data

SciTech news domain raw data Download with key 4834, and put it in UDA/data/tech.

Entity dictionary used in UDA

The named entity dictionary is collected by Peng et. al. and In UDA/data/tech/conll2003_dict.

Usage

Train

SDA and UDA can use the following command to make it run.

python main.py --config train.config

The file train.config contains dataset path and model hyperparameters.

Decode

SDA and UDA can use the following command to make it run.

python main.py --config decode.config

The file decode.config contains dataset path and paths for tarined models.
For example, you can download our trained models with key matp, unzip the two files .dset and .model and put them into SDA/saved_models. Then you can use the above comment to get our reported result on the broad twitter corpus. UDA models with key 2s6n are decoded similarly.

Cite:

If you use our code, please cite our paper as follows:

@inproceedings{jia-zhang-2020-multi,
    title = "Multi-Cell Compositional {LSTM} for {NER} Domain Adaptation",
    author = "Jia, Chen  and  Zhang, Yue",
    booktitle = "Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics",
    year = "2020",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2020.acl-main.524",
    pages = "5906--5917"
}
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].