All Projects → WING-NUS → Neural-ParsCit

WING-NUS / Neural-ParsCit

Licence: other
Neuralized version of the Reference String Parser component of the ParsCit package.

Programming Languages

python
139335 projects - #7 most used programming language
perl
6916 projects
Dockerfile
14818 projects

Projects that are alternatives of or similar to Neural-ParsCit

NER BiLSTM CRF Chinese
BiLSTM_CRF中文实体命名识别
Stars: ✭ 46 (-24.59%)
Mutual labels:  bilstm-crf
naacl2019-select-pretraining-data-for-ner
BiLSTM-CRF model for NER
Stars: ✭ 15 (-75.41%)
Mutual labels:  bilstm-crf
scholarly
Retrieve author and publication information from Google Scholar in a friendly, Pythonic way without having to worry about CAPTCHAs!
Stars: ✭ 761 (+1147.54%)
Mutual labels:  scholarly-articles
BERT-BiLSTM-CRF
BERT-BiLSTM-CRF的Keras版实现
Stars: ✭ 40 (-34.43%)
Mutual labels:  bilstm-crf
deepseg
Chinese word segmentation in tensorflow 2.x
Stars: ✭ 23 (-62.3%)
Mutual labels:  bilstm-crf
NER-in-Chinese-Text
NLP Keras BiLSTM+CRF
Stars: ✭ 53 (-13.11%)
Mutual labels:  bilstm-crf
PDF-Resume-Information-Extraction
天池比赛作品整理。实现从pdf中提取出姓名、出生年月、性别、电话、最高学历、籍贯、落户市县、政治面貌、毕业院校、工作单位、工作内容、职务、项目名称、项目责任、学位、毕业时间、工作时间、项目时间共18个字段。
Stars: ✭ 64 (+4.92%)
Mutual labels:  bilstm-crf
BiLSTM-CRF-NER-PyTorch
This repo contains a PyTorch implementation of a BiLSTM-CRF model for named entity recognition task.
Stars: ✭ 109 (+78.69%)
Mutual labels:  bilstm-crf
xinlp
把李航老师《统计学习方法》的后几章的算法都用java实现了一遍,实现盒子与球的EM算法,扩展到去GMM训练,后来实现了HMM分词(实现了HMM分词的参数训练)和CRF分词(借用CRF++训练的参数模型),最后利用tensorFlow把BiLSTM+CRF实现了,然后为lucene包装了一个XinAnalyzer
Stars: ✭ 21 (-65.57%)
Mutual labels:  bilstm-crf
sequence tagging
Named Entity Recognition (LSTM + CRF + FastText) with models for [historic] German
Stars: ✭ 25 (-59.02%)
Mutual labels:  bilstm-crf
ChineseNER
中文NER的那些事儿
Stars: ✭ 241 (+295.08%)
Mutual labels:  bilstm-crf
dimensions-api-lab
Research data analytics tutorials using the Dimensions Analytics API
Stars: ✭ 68 (+11.48%)
Mutual labels:  scholarly-metadata
Zh Ner Tf
A very simple BiLSTM-CRF model for Chinese Named Entity Recognition 中文命名实体识别 (TensorFlow)
Stars: ✭ 2,063 (+3281.97%)
Mutual labels:  bilstm-crf-model

Neural ParsCit

This is the official repository of Neural-ParsCit and is under active development at National University of Singapore (NUS), Singapore

Build Status

Neural ParsCit is a citation string parser which parses reference strings into its component tags such as Author, Journal, Location, Date, etc. Neural ParsCit uses Long Short Term Memory (LSTM), a deep learning model to parse the reference strings. This deep learning algorithm is chosen as it is designed to perform sequence-to-sequence labeling tasks such as ours. Input to the model are word embeddings which are vector representation of words. We provide word embeddings as well as character embeddings as input to the network.

Initial setup

To use the tagger, you need Python 2.7 (works in Python 3 but not fully supported), with Numpy, Theano and Gensim installed. scikit-learn is needed for model evaluation if you are training a new model.

You can use environmental variables to set the following:

  • MODEL_PATH: Path to the model's parameters
  • WB_PATH: Path to the word embeddings
  • TIMEOUT: Timeout for gunicorn when starting the Flask app. Increase this if you experience the Flask app is unable to start as the model building process takes too long. [Default: 60]
  • NUM_WORKERS: Number of workers which gunicorn spawns. [Default: 1]

Using virtualenv in Linux systems

virtualenv -ppython2.7 .venv
source .venv/bin/activate
pip install -r requirements/<env>.txt

Where <env> is {prod, dev, test}

Using Docker

  1. Build the image: docker build -t theano-gensim - < Dockerfile
  2. Run the repo mounted to the container: docker run -it -v $(pwd):/usr/src --name np theano-gensim:latest /bin/bash

Word Embeddings

The word embeddings do not come with this repository. You can obtain the word embeddings with <UNK> from WING website. Please read the next section on availability of <UNK> in word embeddings.

You will need to extract the content of the word embedding archive (vectors_with_unk.tar.gz) to the root directory for this repository by running tar xfz vectors_with_unk.tar.gz.

Embeddings Without <UNK>

If the word embeddings provided do not have <UNK>, your instance will not benefit from the lazy loading of the word vectors and hence the reduction of memory requirements.

Without <UNK>, at most 7.5 GB of memory is required as the entire word vectors need to be instantiated in memory to create the new matrix. Comparing with embeddings with <UNK>, which is much lower as it only requires at most 4.5 GB.

Parse citation strings

Command Line

The fastest way to use the parser is to run state-of-the-art pre-trained model as follows:

./run.py --model_path models/neuralParsCit/ --pre_emb <vectors.bin> --run shell
./run.py --model_path models/neuralParsCit/ --pre_emb <vectors.bin> --run file -i input_file -o output_file

The script can run interactively or input can be passed in a file. In the interactive session, the strings are passed one by one. The result is displayed on standard output. If the file option is chosen, the input is given in a file specified by -i option and the output is stored in the directed file. Using the file option, multiple citation strings can be parsed.

The state-of-the-art trained model is provided in the models folder and is named neuralParsCit. The binary file for word embeddings is provided in the docker image of the current version of neural ParsCit. The hyper parameter discarded is the number of embeddings not used in our model. Retained words have a frequency of more than 0 in the ACM citation literature from 1994-2014.

Using a Web Server

Note: This service is not Python 3 compatible due to unicode.

The web server (a Flask app) provides REST API.

Running the web server, docker run --rm -it -p 8000:8000 -e TIMEOUT=60 -v $(pwd):/usr/src --name np theano-gensim:latest /bin/bash

In the container, gunicorn -b 0.0.0.0:8000 -w $NUM_WORKERS --timeout $TIMEOUT run_app:app

The REST API documentation can be found at http://localhost:8000/docs

Train a model

To train your own model, you need to use the train.py script and provide the location of the training, development and testing set:

./train.py --train train.txt --dev dev.txt --test test.txt

The training script will automatically give a name to the model and store it in ./models/ There are many parameters you can tune (CRF, dropout rate, embedding dimension, LSTM hidden layer size, etc). To see all parameters, simply run:

./train.py --help

Input files for the training script have to follow the following format: each word of the citation string and its corresponding tag has to be on a separate line. All citation strings must be separated by a blank line.

Details about the training data, experiments can be found in the following article. Training data and CRF baseline can be downloaded from https://github.com/knmnyn/ParsCit. Please consider citing following publication(s) if you use Neural ParsCit:

@article{animesh2018neuralparscit,
  title={Neural ParsCit: A Deep Learning Based Reference String Parser},
  author={Prasad, Animesh and Kaur, Manpreet and Kan, Min-Yen},
  journal={International Journal on Digital Libraries},
  volume={19},
  pages={323-337},
  year={2018},
  publisher={Springer},
  url={https://link.springer.com/article/10.1007/s00799-018-0242-1}
}
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].