Named Entity Recognition with Tensorflow

This repository contains a NER implementation using Tensorflow (based on BiLSTM + CRF and character embeddings) that is based on the implementation by Guillaume Genthial. We have modified this implementation including its documentation. The major changes are listed below:

Mainly, we have done the following changes:

convert from python 2 to python 3
extract parameters from source code to a single config file
create new script for testing new files
create new script and modify source code for simple transfer learning
support for several embeddings (GloVe, fasttext, word2vec)
support to load all embeddings of a model
support to dynamically load OOV embeddings during testing

Currently, we only provide models for contemporary German and historic German texts.

Table of Content

Task of Named Entity Recognition
Machine Learning Model
Requirements
Run an Existing Model
Download Models and Embeddings
- Manual Download
- Automatic Download
Train a New Model
Transfer Learning
Predict Labels for New Text
Server for Predicting Labels for New Text
Parameters in the Configuration File
Requirements
Citation
License

Task of Named Entity Recognition

The task of Named Entity Recognition (NER) is to predict the type of entity. Classical NER targets on the identification of locations (LOC), persons (PER), organization (ORG) and other (OTH). Here is an example

John   lives in New   York
B-PER  O     O  B-LOC I-LOC

Machine Learning Model

The model is similar to Lample et al. and Ma and Hovy. A more detailed description can be found here.

concatenate final states of a bi-lstm on character embeddings to get a character-based representation of each word
concatenate this representation to a standard word vector representation (GloVe, Word2Vec, FastText here)
run a bi-lstm on each sentence to extract contextual representation of each word
decode with a linear chain CRF

Requirements

To run the python code, you need python3 and the requirements from the following file which can be easily installed:

pip3 install -r requirements.txt

In addition, you need to build fastText manually, as described here, which are the following commands:

git clone https://github.com/facebookresearch/fastText.git
cd fastText
pip3 install .

Windows user might face problems installing the fastText package. One of the solutions seems to be to install the "Visual C++ 2015 Build Tools".

Run an Existing Model

To run pre-computed models, you need to install the required python packages and you need to download the model and the embeddings. This can be done automatically with a python script as described here. However, the models and the embeddings can also be downloaded manually as described here.

Here, we will fully describe, how to apply the best performing GermEval model to a new file. First, we need to download the project, the model and the embeddings:

git clone https://github.com/riedlma/sequence_tagging
cd sequence_tagging
python3 download_model_embeddings.py GermEval

Now, you can create a new file (called test.conll) that should contain one token per line and might contain the following content:

Diese 
Beispiel
wurde
von
Martin
Riedl
in
Stuttgart 
erstellt
.

To start the entity tagging, you run the following command:

python3 test.py model_transfer_learning_conll2003_germeval_emb_wiki/config test.conll

The output should be as following:

Diese diese	KNOWN	O
Beispiel beispiel	KNOWN	O
wurde wurde	KNOWN	O
von von	KNOWN	O
Martin martin	KNOWN	B-PER
Riedl riedl	KNOWN	I-PER
in in	KNOWN	O
Stuttgart stuttgart	KNOWN	B-LOC
erstellt erstellt	KNOWN	O
. .	KNOWN	O

The first column is the input word, the second column specifies the pre-processed word (here lowercased). The third column contains a flag, whether the word has been known during training (KNOWN) or not (UNKNOWN). If labels are assigned to the input file they will be in the third column. Otherwise, they will not be contained. The last column contains the predicted tags.

Download Models and Embeddings

We provide the best performing model for the following datasets:

Datasets

Name	Language	Description	Webpage
CoNLL 2003	German	NER dataset based on Newspaper	link
GermEval 2014	German	NER dataset based on Wikipedia	link
ONB	German	NER dataset based on texts of the Austrian National Library from 1710 and 1873	link
LFT	German	NER dataset based on text of the Dr Friedrich Teßmann Library from 1926	link
ICAB-NER09	Italian	NER dataset for Italian	link
CONLL2002-NL	Dutch	NER for Dutch	link

All provided models are trained using transfer learning techniques. The models and the embeddings can be downloaded manually or automatically.

Manual Download of Models

The models can be downloaded as described in the table. The models should be stored directly on the project directory. Furthermore, they need to be uncompressed (tar xfvz *tar.gz)

Optimized for	Trained	Transfer Learning	Embeddings	Download
GermEval 2014	CoNLL2003	GermEval 2014	German Wikipedia	link
CoNLL 2003 (German)	GermEval 2014	CoNLL 2003	German Wikipedia	link
ONB	GermEval 2014	ONB	German Europeana	link
LFT	GermEval 2014	LFT	German Wikipedia	link
ICAB-NER09	ICAB-NER09	none	Italian Wikipedia	link
CONLL2002-NL	CONLL2002-NL	none	Dutch newspaper	link

The embeddings should best be stored in the folder embeddings inside the project folder. We provide the full embeddings (named Complete) and the filtered embeddings, which only contain the vocabulary of the data of the task. These filtered models have also been used to train the pre-computed models. The German Wikipedia model is provided by Facebook Research.

Name	Computed on	Dimensions	Complete	Filtered
Wiki	German Wikipedia	300	link	link
Euro	German Europeana	300	link	link
Wiki	Italian Wikipedia	300	link	link
Wiki	Dutch Wikipedia	300	link	link

Automatic Download of Models

Using the python script download_model_embeddings.py the models and the embeddings can be donwloaded automatically. In addition, the files are placed at the recommended location and are uncompressed. You can choose between the several options:

~ user$ python3 download_model_embeddings.py 

No download option has been specified:
python download_model_embeddings.py options

Following download options are possible:
all                 download all models and embeddings
all_models          download all models
all_embed           download all embeddings
eval                download CoNLL 2003 evaluation script
GermEval            download best model and embeddings for GermEval
CONLL2003           download best model and embeddings for CONLL2003
ONB                 download best model and embeddings for ONB
LFT                 download best model and embeddings for LFT
ICAB-NER09-Italian  download best model and embeddings for ICAB-NER09-Italian
CONLL2002-NL        download best model and embeddings for ICAB-NER09-Italian

Train a New Model

We will describe how a new model can be trained and describe it based on training a model on the GermEval 2014 dataset using pre-computed word embeddings from German Wikipedia. First, we need to download the training data. For training a model, we expect files to have two columns with the first column specifying the word and the second column containing the label.

mkdir -p corpora/GermEval
wget -O corpora/GermEval/NER-de-train.tsv  https://sites.google.com/site/germeval2014ner/data/NER-de-train.tsv
wget -O corpora/GermEval/NER-de-dev.tsv  https://sites.google.com/site/germeval2014ner/data/NER-de-dev.tsv
wget -O corpora/GermEval/NER-de-test.tsv  https://sites.google.com/site/germeval2014ner/data/NER-de-test.tsv
cat corpora/GermEval/NER-de-train.tsv  | grep -v "^[#]" | cut -f2,3 |  sed "s/[^ \t]\+\(deriv\|part\)$/O/g" > corpora/GermEval/NER-de-train.tsv.conv
cat corpora/GermEval/NER-de-test.tsv  | grep -v "^[#]" | cut -f2,3 |  sed "s/[^ \t]\+\(deriv\|part\)$/O/g" > corpora/GermEval/NER-de-test.tsv.conv
cat corpora/GermEval/NER-de-dev.tsv  | grep -v "^[#]" | cut -f2,3|  sed "s/[^ \t]\+\(deriv\|part\)$/O/g" > corpora/GermEval/NER-de-dev.tsv.conv

For the training we use the German Wikipedia embeddings from the Facebook Research group. The embeddings can be quite large (above 10GB), especially, as the files will be decompressed. These (and all other embeddings) can be downloaded with the following command:

python3 download_model_embeddings.py all_models

If you want to train on a different language, you can also check if there are pre-computed embeddings available here. To compute new embeddings you can follow the manual from fastText.

Next, the configuration needs to be edited. First, we create a directory where the model will be stored:

mkdir model_germeval

Then, we create the configuration file. For this, we use the configuration template ( config.template) and copy it to the model folder:

cp config.template model_germeval/config

At least, all parameters that have as value TODO need to be adjusted. Using the current setting, we adjust following parameters (a more detailed description of the configuration is found here):

[PATH]
#path where the model will be written to, $PWD refers to the directory where the configuration file is located
dir_model_output = $PWD

...
filename_train = corpora/GermEval/NER-de-train.tsv.conv 
filename_dev =   corpora/GermEval/NER-de-dev.tsv.conv 
filename_test =  corpora/GermEval/NER-de-test.tsv.conv 
... 

[EMBEDDINGS]
# dimension of the words
dim_word = 300
# dimension of the characters
dim_char = 100
# path to the embeddings that are used
filename_embeddings = ./embeddings/wiki.de.bin
# path where the embeddings defined by train/dev/test are written to
filename_embeddings_trimmed = ${PATH:dir_model_output}/wiki.de.bin.trimmed.npz
...

Before we train the model, we build a matrix of the embeddings that are contained in the train/dev/test in addition to the vocabulary, with the build_data.py script. For training and testing only these smaller embeddings (specified with in the config with filename_embeddings_trimmed) are required. The larger ones (specified with filename_embeddings) can be deleted.

python3 build_data.py model_germeval/config

If you want to apply the model to other vocabulary then the one specified in train/dev/test, the model will not have any word representation and will mainly rely on the character word embedding. To prevent this, the easiest way is to add them in the CoNLL format as further parameters to the build_data.py script:

python3 build_data.py model_germeval/config vocab1.conll vocab2.conll

After that step, the new model can be trained, using the following command:

python3 train.py model_germeval/config

The model can be applied to e.g. the test file as follows:

python3 test.py model_germeval/config corpora/GermEval/NER-de-test.tsv.conv

Transfer Learning

For performing the transfer learning you first need to train a model e.g. based on the GermEval data as described here. Be aware, that you added the vocabulary and the tagsets when training the basic model. If you want to perform transfer learning you might want to copy the directory as otherwise the further learning steps will replace the previous model. Take care to adjust the * dir_model_output* value within the configuration file. The easiest way is to add them as additional parameters, when building the vocabulary, e.g.:

python3 build_data.py model_germeval/config transfer_training.conll transfer_dev.conll test_transfer.conll

However, this step needs to be performed before training the model. If you have already trained a model you would need to re-train a model with the additional vocabulary.

Whereas there is not explicit parameter fitting to these words, in this way the embeddings will be available for the model.

After the model has been trained the transfer learning step can be accomplished with the transfer_learning.py script, that expects the following parameters:

python transfer_learning.py configuration transfer_training.conll transfer_dev.conll

After the training, new text files in the domain of the transfer learning files as described here.

Predict Labels for New Text

To test a model, the test.py script is used and expects, the configuration file of the model and the test file

python3 test.py model_configuration test_file.conll

The test script has further parameters in order to process several test files, different formats and write to output files directly. By calling the script with the -h argument, these will be shown:

python3 test.py -h

usage: test.py [-h] [-i {SYSTEM,FILE}] [-o {SYSTEM,FILE}] [-of OUTPUT_FOLDER]
               [-f {CONLL,TEXT,TOKEN}]
               config_file [test_files [test_files ...]]

positional arguments:
  config_file
  test_files

optional arguments:
  -h, --help            show this help message and exit
  -i {SYSTEM,FILE}, --input {SYSTEM,FILE}
                        if FILE is selected the file has to be passed as
                        additional parameter. If SYSTEM is selected the input
                        will be read from the standard input stream.
  -o {SYSTEM,FILE}, --output {SYSTEM,FILE}
                        if FILE is selected an output folder needs to be
                        specified (-of). If SYSTEM is selected, the standard
                        output stream will be used for the output.
  -of OUTPUT_FOLDER, --output_folder OUTPUT_FOLDER
  -f {CONLL,TEXT,TOKEN}, --format {CONLL,TEXT,TOKEN}

It is possible to read input that should be tagged from the standard input stream (-i SYSTEM) or from files. Furthermore, the output can be either written to the standard output or to files. If no parameter is specified, it will be written to the standard output stream. Otherwise, it will be written to a file with the name of the test file. In order to prevent ot override existing files, we advise to create and specify an output folder with the parameter -of. Currently, we support the files to be in the CoNLL format, in token format (words are tokenized and there is one sentence per line) or in plain text format (no tokenization), as described in the Table below:

Format	Example	Description
CONLL	In O Madrid B-LOC befinden O sich O Hochschulen O , O Museen O und O Kultureinrichtungen O . O	As CONLL format we expect the files to contain the token in the first column. All remaining columns will be ignored.
TOKEN	In Madrid befinden sich Hochschulen , Museen und Kultureinrichtungen .	The text is tokenized by whitespaces
TEXT	In Madrid befinden sich Hochschulen, Museen und Kultureinrichtungen.	The text is not tokenized by whitespaces

For the plain text format, nltk is required. It can be installed as follows:

pip3 install pip3
python3
import nltk
nltk.download('punkt')
exit()

Server for Predicting Labels for New Text

If you want to use the NER tool as a service you can start a web server that gives responses to the given queries. For this you can specify a port (e.g. -p 10080) and a model configuration, e.g.:

python3 test_server.py -p 10080 model_configuration

The server processes two arguments: text expects the document for which named entity labels should be predicted. With the optional argument format, the input format can be specified (CONLL, TEXT, TOKEN). Further information about these formats is given here. We will show examples for each of the formats in the table below for the sentence: Die Hauptstadt von Spanien ist Madrid

Format	Example
CONLL	curl "localhost:10080?format=CONLL&text=Die%20O%0AHauptstadt%20O%0Avon%20O%0ASpanien%20O%0Aist%20O%0AMadrid%20O%0A.%20O%0A"
TOKEN	curl "localhost:10080?format=TOKEN&text=Die%20Hauptstadt%20von%20Spanien%20ist%20Madrid%20."
TEXT	curl "localhost:10080?format=TEXT&text=Die%20Hauptstadt%20von%20Spanien%20ist%20Madrid."

Parameters in the Configuration File

The configuration file is divided in three sections. The section PATH contains all variables that specify the locations of the model and labeled data. The EMBEDDINGS section contains all parameters for the word embeddings and the PARAM section contains all further parameters for the machine learning as well as pre-processing.

[PATH]
#path where the model will be written to
dir_model_output = $PWD
dir_vocab_output = ${dir_model_output}
dir_model = ${dir_model_output}/model.weights/
path_log = ${dir_model_output}/test.log


filename_train = TODO
filename_dev =   TODO
filename_test =  TODO

# these are the output paths for the vocabulary, the 
# tagsets and the characters used in the train/dev/test set
filename_words = ${dir_vocab_output}/words.txt
filename_tags = ${dir_vocab_output}/tags.txt
filename_chars = ${dir_vocab_output}/chars.txt


[EMBEDDINGS]
# dimension of the words
dim_word = 300
# dimension of the characters
dim_char = 100
# path to the embeddings that are used 
filename_embeddings = TODO
# path where the embeddings defined by train/dev/test are written to
filename_embeddings_trimmed =  ${PATH:dir_model_output}/embeddings.npz 
# models can also be trained with random embeddings that are 
# adjusted during training
use_pretrained = True
# currently we support: fasttext, glove and w2v
embedding_type = fasttext
# if using embeddings larger than 2GB this option needs to be switched on
use_large_embeddings = False
# number of embeddings that are dynamically changed during testing
oov_size = 0


# here, several parametesr of the machine learning and pre-processing
# can be changed
[PARAM]
lowercase = True
max_iter = None
train_embeddings = False
nepochs = 15
dropout = 0.5
batch_size = 20
lr_method = adam
lr = 0.001
lr_decay = 0.9
clip = -1
nepoch_no_imprv = 3
hidden_size_char = 100
hidden_size_lstm = 300
use_crf = True
use_chars = True

Citation

If you use this model cite the source code of Guillaume Genthial. If you use the German model and the extension, you can cite our paper:

@inproceedings{riedl18:_named_entit_recog_shoot_german,
  title = {A Named Entity Recognition Shootout for {German}},
  author = {Riedl, Martin and Padó, Sebastian},
  booktitle = {Proceedings of Annual Meeting of the Association for Computational Linguistics},
  series={ACL 2018},
  address = {Melbourne, Australia},
  note = {To appear},
  year = 2018
}

License

This project is licensed under the terms of the Apache 2.0 ASL license (as Tensorflow and derivatives). If used for research, citation would be appreciated.

Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

riedlma / sequence_tagging

Labels

Projects that are alternatives of or similar to sequence tagging