Alternatives and detailed information of entity_recognition

nefujiangping / entity_recognition

Licence: other

Entity recognition codes for "2019 Datagrand Cup: Text Information Extraction Challenge"

Programming Languages

python

139335 projects - #7 most used programming language

Projects that are alternatives of or similar to entity recognition

BiLSTM-CRF-NER-PyTorch

This repo contains a PyTorch implementation of a BiLSTM-CRF model for named entity recognition task.

Stars: ✭ 109 (+319.23%)

Mutual labels: crf

crf4j

a complete Java port of crfpp(crf++)

Stars: ✭ 30 (+15.38%)

Mutual labels: crf

Computer-Vision

implemented some computer vision problems

Stars: ✭ 25 (-3.85%)

Mutual labels: crf

NLP-paper

🎨 🎨NLP 自然语言处理教程 🎨🎨 https://dataxujing.github.io/NLP-paper/

Stars: ✭ 23 (-11.54%)

Mutual labels: crf

keras-crf-layer

Implementation of CRF layer in Keras.

Stars: ✭ 76 (+192.31%)

Mutual labels: crf

CIP

Basic exercises of chinese information processing

Stars: ✭ 32 (+23.08%)

Mutual labels: crf

fastai sequence tagging

sequence tagging for NER for ULMFiT

Stars: ✭ 21 (-19.23%)

Mutual labels: crf

grobid-quantities

GROBID extension for identifying and normalizing physical quantities.

Stars: ✭ 53 (+103.85%)

Mutual labels: crf

crf-seg

crf-seg:用于生产环境的中文分词处理工具，可自定义语料、可自定义模型、架构清晰，分词效果好。java编写。

Stars: ✭ 13 (-50%)

Mutual labels: crf

StatNLP-Framework

C++ based implementation of StatNLP framework

Stars: ✭ 17 (-34.62%)

Mutual labels: crf

korean ner tagging challenge

KU_NERDY 이동엽, 임희석 (2017 국어 정보 처리 시스템경진대회 금상) - 한글 및 한국어 정보처리 학술대회

Stars: ✭ 30 (+15.38%)

Mutual labels: crf

CRFasRNNLayer

Conditional Random Fields as Recurrent Neural Networks (Tensorflow)

Stars: ✭ 76 (+192.31%)

Mutual labels: crf

Legal-Entity-Recognition

A Dataset of German Legal Documents for Named Entity Recognition

Stars: ✭ 98 (+276.92%)

Mutual labels: crf

deepseg

Chinese word segmentation in tensorflow 2.x

Stars: ✭ 23 (-11.54%)

Mutual labels: crf

giantgo-render

基于vue3 element plus的快速表单生成器

Stars: ✭ 28 (+7.69%)

Mutual labels: crf

Gumbel-CRF

Implementation of NeurIPS 20 paper: Latent Template Induction with Gumbel-CRFs

Stars: ✭ 51 (+96.15%)

Mutual labels: crf

jcrfsuite

Java interface for CRFsuite: http://www.chokkan.org/software/crfsuite/

Stars: ✭ 44 (+69.23%)

Mutual labels: crf

grobid-ner

A Named-Entity Recogniser based on Grobid.

Stars: ✭ 38 (+46.15%)

Mutual labels: crf

lstm-crf-tagging

No description or website provided.

Stars: ✭ 13 (-50%)

Mutual labels: crf

crfs-rs

Pure Rust port of CRFsuite: a fast implementation of Conditional Random Fields (CRFs)

Stars: ✭ 22 (-15.38%)

Mutual labels: crf

View All Similar Projects ➔

Models for Entity Recognition

Some Entity Recognition models for 2019 Datagrand Cup: Text Information Extraction Challenge.

Requirements

python 3.6
keras 2.2.4 (tensorflow backend)
keras-contrib 2.0.8 for CRF inference.
gensim for training word2vec.
bilm-tf for ELMo.

Components of Entity Recognition

Word Embedding

Static Word Embedding: word2vec, GloVe
Contextualized Word Representation: ELMo (_elmo), refer to Sec.

Sentence Representation

BiLSTM
DGCNN

Inference

sequence labeling (sequence_labeling.py)
- CRF
- softmax
predict start/end index of entities (_pointer)

Note

According to the three components described above, there actually exists 12 models in all. However, this repo only implemented the following 6 models:

Static Word Embedding × (BiLSTM, DGCNN) × (CRF, softmax): sequence_labeling.py
(Static Word Embedding, ELMo) × BiLSTM × pointer: bilstm_pointer.py and bilstm_pointer_elmo.py

Other models can be implemented by adding/modifying few codes.

How to run

Prepare data:
1. download official competition data to data folder
2. get sequence tagging train/dev/test data: bin/trans_data.py
3. prepare vocab, tag
  - vocab: word vocabulary, one word per line, with word word_count format
  - tag: BIOES ner tag list, one tag per line (O in first line)
4. follow the step 2 or 3 below
  - 2 is for models using static word embedding
  - 3 is for model using ELMo
Run model with static word embedding, here take word2vec as an example:
1. train word2vec: bin/train_w2v.py
2. modify config.py
3. run python sequence_labeling.py [bilstm/dgcnn] [softmax/crf] or python bilstm_pointer.py (remember to modify config.model_name before a new run, or the old model will be overridden)
Or run model with ELMo embedding (dump contextualized sentence representation for each sentence of train/dev/test to file first, then load them for train/dev/test, not run ELMo on the fly):
1. follow the instruction described here to get contextualized sentence representation for train_full/dev/test data from pre-trained ELMo weights
2. modify config.py
3. run python bilstm_pointer_elmo.py

How to train a pure token-level ELMo from scratch?

Just follow the official instruction described here.
Some notes:
- to train a token-level language model, modify bin/train_elmo.py:
  from vocab = load_vocab(args.vocab_file, 50)
  to vocab = load_vocab(args.vocab_file, None)
- modify n_train_tokens
- remove char_cnn in options
- modify lstm.dim/lstm.projection_dim as you wish.
- n_gpus=2, n_train_tokens=94114921, lstm['dim']=2048, projection_dim=256, n_epochs=10. It took about 17 hours long on 2 GTX 1080 Ti.
After finishing the last step of the instruction, you can refer to the script dump_token_level_bilm_embeddings.py to dump the dynamic sentence representations of your own dataset.

References

Blog:《基于CNN的阅读理解式问答模型：DGCNN 》
Blog:《基于DGCNN和概率图的轻量级信息抽取模型》
Named entity recognition tutorial: Named entity recognition series
Some codes
Sequence Evaluation tools: seqeval
Neural Sequence Labeling Toolkit: NCRF++
Contextualized Word Representation: ELMo

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

nefujiangping / entity_recognition

Programming Languages

Labels

Projects that are alternatives of or similar to entity recognition

Models for Entity Recognition

Requirements

Components of Entity Recognition

Word Embedding

Sentence Representation

Inference

Note

How to run

How to train a pure token-level ELMo from scratch?

References