All Projects → nefujiangping → entity_recognition

nefujiangping / entity_recognition

Licence: other
Entity recognition codes for "2019 Datagrand Cup: Text Information Extraction Challenge"

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to entity recognition

BiLSTM-CRF-NER-PyTorch
This repo contains a PyTorch implementation of a BiLSTM-CRF model for named entity recognition task.
Stars: ✭ 109 (+319.23%)
Mutual labels:  crf
crf4j
a complete Java port of crfpp(crf++)
Stars: ✭ 30 (+15.38%)
Mutual labels:  crf
Computer-Vision
implemented some computer vision problems
Stars: ✭ 25 (-3.85%)
Mutual labels:  crf
NLP-paper
🎨 🎨NLP 自然语言处理教程 🎨🎨 https://dataxujing.github.io/NLP-paper/
Stars: ✭ 23 (-11.54%)
Mutual labels:  crf
keras-crf-layer
Implementation of CRF layer in Keras.
Stars: ✭ 76 (+192.31%)
Mutual labels:  crf
CIP
Basic exercises of chinese information processing
Stars: ✭ 32 (+23.08%)
Mutual labels:  crf
fastai sequence tagging
sequence tagging for NER for ULMFiT
Stars: ✭ 21 (-19.23%)
Mutual labels:  crf
grobid-quantities
GROBID extension for identifying and normalizing physical quantities.
Stars: ✭ 53 (+103.85%)
Mutual labels:  crf
crf-seg
crf-seg:用于生产环境的中文分词处理工具,可自定义语料、可自定义模型、架构清晰,分词效果好。java编写。
Stars: ✭ 13 (-50%)
Mutual labels:  crf
StatNLP-Framework
C++ based implementation of StatNLP framework
Stars: ✭ 17 (-34.62%)
Mutual labels:  crf
korean ner tagging challenge
KU_NERDY 이동엽, 임희석 (2017 국어 정보 처리 시스템경진대회 금상) - 한글 및 한국어 정보처리 학술대회
Stars: ✭ 30 (+15.38%)
Mutual labels:  crf
CRFasRNNLayer
Conditional Random Fields as Recurrent Neural Networks (Tensorflow)
Stars: ✭ 76 (+192.31%)
Mutual labels:  crf
Legal-Entity-Recognition
A Dataset of German Legal Documents for Named Entity Recognition
Stars: ✭ 98 (+276.92%)
Mutual labels:  crf
deepseg
Chinese word segmentation in tensorflow 2.x
Stars: ✭ 23 (-11.54%)
Mutual labels:  crf
giantgo-render
基于vue3 element plus的快速表单生成器
Stars: ✭ 28 (+7.69%)
Mutual labels:  crf
Gumbel-CRF
Implementation of NeurIPS 20 paper: Latent Template Induction with Gumbel-CRFs
Stars: ✭ 51 (+96.15%)
Mutual labels:  crf
jcrfsuite
Java interface for CRFsuite: http://www.chokkan.org/software/crfsuite/
Stars: ✭ 44 (+69.23%)
Mutual labels:  crf
grobid-ner
A Named-Entity Recogniser based on Grobid.
Stars: ✭ 38 (+46.15%)
Mutual labels:  crf
lstm-crf-tagging
No description or website provided.
Stars: ✭ 13 (-50%)
Mutual labels:  crf
crfs-rs
Pure Rust port of CRFsuite: a fast implementation of Conditional Random Fields (CRFs)
Stars: ✭ 22 (-15.38%)
Mutual labels:  crf

Models for Entity Recognition

Some Entity Recognition models for 2019 Datagrand Cup: Text Information Extraction Challenge.

Requirements

Components of Entity Recognition

Word Embedding

  • Static Word Embedding: word2vec, GloVe
  • Contextualized Word Representation: ELMo (_elmo), refer to Sec.

Sentence Representation

Inference

  • sequence labeling (sequence_labeling.py)
    • CRF
    • softmax
  • predict start/end index of entities (_pointer)

Note

According to the three components described above, there actually exists 12 models in all. However, this repo only implemented the following 6 models:

  • Static Word Embedding × (BiLSTM, DGCNN) × (CRF, softmax): sequence_labeling.py
  • (Static Word Embedding, ELMo) × BiLSTM × pointer: bilstm_pointer.py and bilstm_pointer_elmo.py

Other models can be implemented by adding/modifying few codes.

How to run

  1. Prepare data:
    1. download official competition data to data folder
    2. get sequence tagging train/dev/test data: bin/trans_data.py
    3. prepare vocab, tag
      • vocab: word vocabulary, one word per line, with word word_count format
      • tag: BIOES ner tag list, one tag per line (O in first line)
    4. follow the step 2 or 3 below
      • 2 is for models using static word embedding
      • 3 is for model using ELMo
  2. Run model with static word embedding, here take word2vec as an example:
    1. train word2vec: bin/train_w2v.py
    2. modify config.py
    3. run python sequence_labeling.py [bilstm/dgcnn] [softmax/crf] or python bilstm_pointer.py (remember to modify config.model_name before a new run, or the old model will be overridden)
  3. Or run model with ELMo embedding (dump contextualized sentence representation for each sentence of train/dev/test to file first, then load them for train/dev/test, not run ELMo on the fly):
    1. follow the instruction described here to get contextualized sentence representation for train_full/dev/test data from pre-trained ELMo weights
    2. modify config.py
    3. run python bilstm_pointer_elmo.py

How to train a pure token-level ELMo from scratch?

  • Just follow the official instruction described here.
  • Some notes:
    • to train a token-level language model, modify bin/train_elmo.py:
      from vocab = load_vocab(args.vocab_file, 50)
      to vocab = load_vocab(args.vocab_file, None)
    • modify n_train_tokens
    • remove char_cnn in options
    • modify lstm.dim/lstm.projection_dim as you wish.
    • n_gpus=2, n_train_tokens=94114921, lstm['dim']=2048, projection_dim=256, n_epochs=10. It took about 17 hours long on 2 GTX 1080 Ti.
  • After finishing the last step of the instruction, you can refer to the script dump_token_level_bilm_embeddings.py to dump the dynamic sentence representations of your own dataset.

References

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].