All Projects → OnlpLab → NEMO

OnlpLab / NEMO

Licence: Apache-2.0 license
Neural Modeling for Named Entities and Morphology (Hebrew NER)

Programming Languages

python
139335 projects - #7 most used programming language
Jupyter Notebook
11667 projects
Dockerfile
14818 projects

Projects that are alternatives of or similar to NEMO

Kashgari
Kashgari is a production-level NLP Transfer learning framework built on top of tf.keras for text-labeling and text-classification, includes Word2Vec, BERT, and GPT2 Language Embedding.
Stars: ✭ 2,235 (+8840%)
Mutual labels:  ner
Ner Datasets
Datasets to train supervised classifiers for Named-Entity Recognition in different languages (Portuguese, German, Dutch, French, English)
Stars: ✭ 220 (+780%)
Mutual labels:  ner
Pytorch ner bilstm cnn crf
End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF implement in pyotrch
Stars: ✭ 249 (+896%)
Mutual labels:  ner
Persian Ner
پیکره بزرگ شناسایی موجودیت‌های نامدار فارسی برچسب خورده
Stars: ✭ 183 (+632%)
Mutual labels:  ner
Monpa
MONPA 罔拍是一個提供正體中文斷詞、詞性標註以及命名實體辨識的多任務模型
Stars: ✭ 203 (+712%)
Mutual labels:  ner
Webstruct
NER toolkit for HTML data
Stars: ✭ 230 (+820%)
Mutual labels:  ner
Sequence tagging
Named Entity Recognition (LSTM + CRF) - Tensorflow
Stars: ✭ 1,889 (+7456%)
Mutual labels:  ner
Chinese Names Corpus
中文人名语料库。人名生成器。中文姓名,姓氏,名字,称呼,日本人名,翻译人名,英文人名。可用于中文分词、人名实体识别。
Stars: ✭ 3,053 (+12112%)
Mutual labels:  ner
Spacy Lookup
Named Entity Recognition based on dictionaries
Stars: ✭ 212 (+748%)
Mutual labels:  ner
Malaya
Natural Language Toolkit for bahasa Malaysia, https://malaya.readthedocs.io/
Stars: ✭ 239 (+856%)
Mutual labels:  ner
Marktool
这是一款基于web的通用文本标注工具,支持大规模实体标注、关系标注、事件标注、文本分类、基于字典匹配和正则匹配的自动标注以及用于实现归一化的标准名标注,同时也支持文本的迭代标注和实体的嵌套标注。标注规范可自定义且同类型任务中可“一次创建多次复用”。通过分级实体集合扩大了实体类型的规模,并设计了全新高效的标注方式,提升了用户体验和标注效率。此外,本工具增加了审核环节,可对多人的标注结果进行一致性检验和调整,提高了标注语料的准确率和可靠性。
Stars: ✭ 190 (+660%)
Mutual labels:  ner
Dataturks
ML data annotations made super easy for teams. Just upload data, add your team and build training/evaluation dataset in hours.
Stars: ✭ 200 (+700%)
Mutual labels:  ner
Pytorch Bert Crf Ner
KoBERT와 CRF로 만든 한국어 개체명인식기 (BERT+CRF based Named Entity Recognition model for Korean)
Stars: ✭ 236 (+844%)
Mutual labels:  ner
Bert Sklearn
a sklearn wrapper for Google's BERT model
Stars: ✭ 182 (+628%)
Mutual labels:  ner
Zh Ner Keras
details
Stars: ✭ 252 (+908%)
Mutual labels:  ner
Jiagu
Jiagu深度学习自然语言处理工具 知识图谱关系抽取 中文分词 词性标注 命名实体识别 情感分析 新词发现 关键词 文本摘要 文本聚类
Stars: ✭ 2,368 (+9372%)
Mutual labels:  ner
Nlp Tools
😋本项目旨在通过Tensorflow基于BiLSTM+CRF实现中文分词、词性标注、命名实体识别(NER)。
Stars: ✭ 225 (+800%)
Mutual labels:  ner
KoBERT-NER
NER Task with KoBERT (with Naver NLP Challenge dataset)
Stars: ✭ 76 (+204%)
Mutual labels:  ner
Ner Bert Pytorch
PyTorch solution of named entity recognition task Using Google AI's pre-trained BERT model.
Stars: ✭ 249 (+896%)
Mutual labels:  ner
Bert ner
Ner with Bert
Stars: ✭ 240 (+860%)
Mutual labels:  ner

🐠🐠 NEMO2 - Neural Modeling for Named Entities and Morphology - Hebrew NER

Table of Contents

Introduction

Code and models for neural modeling of Hebrew NER. Described in the TACL paper "Neural Modeling for Named Entities and Morphology (NEMO2)" along with extensive experiments on the different modeling scenarios provided in this repository.

Main Features

  1. Trained on the Hebrew NER and Morphology NEMO corpus of gold annotated Modern Hebrew news articles.
  2. Multiple modeling options to go from raw Hebrew text to morpheme and/or token-level NER boundaries.
  3. Neural model implementation of NCRF++
  4. bclm is used for reading and transforming morpho-syntactic information layers.

Setup

Prerequisites:

  1. Clone this NEMO repo: git clone https://github.com/OnlpLab/NEMO.git
  2. Enter the repo directory: cd NEMO
  3. Preferably in a virtual env: pip install -r requirements.txt
  4. Unpack model files: gunzip data/*.gz
  5. Install yap: https://github.com/OnlpLab/yap

To run API server

  1. In YAP folder, run YAP API server ./yap api
  2. In NEMO folder, run NEMO API server uvicorn api_main:app --port 8090

To run on file input (CLI): nemo.py

  1. Change YAP_PATH in config.py to the path of your local yap executable.

Setup Using Docker

  1. docker-compose up (pulls, builds and/or startup will take a few minutes, depending on your bandwidth)
  2. That's it. You now have NEMO API running and available at local port 8090.
    1. YAP API docker is also running in the background, you can make it available by uncommenting the last two lines of docker-compose.yml.

Usage

API Usage

  1. Once the API server is up, check out the API documentation by opening (http://localhost:8090/docs) in your browser.
  2. You can find the available API endpoints and more usage examples in api_usage.ipynb.

File Input Usage (CLI)

  1. All you need to do is run nemo.py with a specific command (scenario), on a text file of Hebrew sentences separated by a line-break.
  2. You can run a neural NER model directly, or choose a full end-to-end scenario that includes morphological segmentation and alignments (described fully in the next section). e.g.:
    • the run_ner_model command with the token-single model will tokenize sentences and run the token-single model:
      • python nemo.py run_ner_model token-single example.txt example_output.txt
    • the morph_hybrid command runs the end-to-end segmentation and NER pipeline which provided our best performing morpheme-level NER boundaries:
      • python nemo.py morph_yap morph example.txt example_output_MORPH.txt
  3. You can find outputs of different commands on the input in example.txt in: morph_hybrid_align_tokens, morph_hybrid, morph_yap, multi_align_hybrid, single
  4. For a full list of the available commands please consult the next section and the inline documentation at the end of nemo.py.

Models and Scenarios

Models are all standard Bi-LSTM-CRF with char encoding (LSTM/CNN) of NCRFpp with pre-trained fastText embeddings. Differences between models lay in:

  1. Input units: morphemes morph vs. tokens token-*
  2. Output label set: token-single single sequence labels (e.g. B-ORG) vs. token-multi multi-labels (atomic labels, e.g. O-ORG^B-ORG^I-ORG) that predict, in order, the labels for the morphemes the token is made of.
Token-based Models Morpheme-based Model
token-based models morpheme-based Model

Morphemes must be predicted. This is done by performing morphological disambiguation (MD). We offer two options to do so:

  1. Standard pipeline: MD using YAP. This is used in the morph_yap command, which runs our morph NER model on the output of YAP joint segmentation.
  2. Hybrid pipeline: MD using our best performing Hybrid approach, which uses the output of the token-multi model to reduce the MD option space. This is used in morph_hybrid, multi_align_hybrid and morph_hybrid_align_tokens. We will explain these scenarios next.
MD Approach Commands
Standard Standard MD morph_yap
Hybrid Hybrid MD
Hybrid MD
morph_hybrid,
multi_align_hybrid,
morph_hybrid_align_tokens

Finally, to get our desired output (tokens/morphemes), we can choose between different scenarios, some involving extra post-processing alignments:

  1. To get morpheme-level labels we have two options:
    • Run our morph NER model on predicted morphemes: Commands: morph_yap or morph_hybrid (better).
    • token-multi labels can be aligned with predicted morphemes to get morpheme-level boundaries. Command: multi_align_hybrid.
Run morph NER on Predicted Morphemes Multi Predictions Aligned with Predicted Morpheme
Morph NER on Predicted Morphemes Multi Predictions Aligned with Predicted Morpheme
morph_yap,morph_hybrid multi_align_hybrid
  1. To get token-level labels we have three options:
    • run_ner_model command with token-single model.
    • the predicted labels of the token-multi can be mapped to token-single labels to get standard token-single output. The command multi_to_single does this end-to-end.
    • Morpheme-level output can be aligned back to token-level boundaries. Command: morph_hybrid_align_tokens (this achieved best token-level results in our experiments).
Run token-single Map token-multi to token-single Align morph NER with Tokens
Run token-single Map token-multi to token-single Align morph NER with Tokens
run_ner_model token-single multi_to_single morph_hybrid_align_tokens
  • Note: while the morph_hybrid* scenarios offer the best performance, they are slightly less efficient since they requires running both morph and token-multi NER models (yap calls take up most of the runtime anyway, so this is not extremely significant).

Important Notes

  1. NCRFpp was great for our experiments on the NEMO corpus (which is given constant data), but it holds some caveats for real life scenarios of arbitrary text:
    • fastText is not used on the fly to obtain vectors for OOV words (i.e. those that were not seen in our Wikipedia corpus). Instead, it is used as a regular embedding matrix. Hence the full generalization capacities of fastText, as shown in our experiments, are not available in the currently provided models, which will perform slightly worse than they could on arbitrary text. In our experiments we created such a matrix in advance with all the words in the NEMO corpus and used it during training. Information regarding training your own model with your own vocabulary in the next section.
    • If you do wish to replicate our reported results on the Hebrew treebank, download the *oov* models from here and extract to the data/ folder (they already appear in config.py).
  2. In the near future we plan to publish a cleaner end-to-end implementation, including use of our new AlephBERT pre-trained Transformer models.
  3. For archiving and reproducibility purposes, our original code used for experiments and analysis can be found in the following repos: https://github.com/cjer/NCRFpp, https://github.com/cjer/NER (beware - 2 years of Jupyter notebooks).

Training your own model

We provide template NCRF++ config files. These files already contain the hyperparameters we used in our training. To train your own model:

  1. Copy the config for the variant (token-multi, token-single, morph) you wish to use from the ncrf_train_configs folder.
  2. Change the parameter word_emb_dir to that of an embedding vectors file in standard word2vec textual format. You can use the fastText bin models we make available (in the next section) or any other embedding vectors of your choice.
  3. Run the following in your shell:
python ncrf_main.py --config <path_to_config> --device <gpu_device_number>
  1. For more information, please consult NCRF++ documentation.
  2. To evaluate your trained models, please consult the evaluation section.

Morpheme and Word Embeddings

The word embeddings we trained and used in our models are available:

  1. Space-delimited tokens (traditional word embeddings): fastText (bin, text), GloVe, word2vec
  2. Morphemes: fastText (bin, text), GloVe, word2vec

These were trained on a 2013 Wiki dump corpus by Yoav Goldberg, which we re-tokenized and then re-parsed using YAP:

  1. Space-delimited tokens
  2. Morphemes, automatic YAP segmentation (using the morpheme FORM as the unit for embedding)
  3. CONLL files of full morpho-syntactic output of YAP

Evaluation

To evaluate your predictions against gold use the ne_evaluate_mentions.py script. Evaluation looks for exact match of string and entity category, but is slightly different than the standard CoNLL2003 evaluation commonly used for NER. The reason is that predicted segmentation differs from gold, so positional indexes of sequence labels cannot be used. What we do instead, is extract multi-sets of entity mentions and use set operations to compute precision, recall and F1-score. You can find more detailed discussion of evaluation in the NEMO2 paper.

To evaluate an output prediction file against a gold file use:

python ne_evaluate_mentions.py <path_to_gold_ner> <path_to_predicted_ner>

If you're within python, just call ne_evaluate_mentions.evaluate_files(...) with the same parameters.

Ben-Mordecai Corpus

In our NEMO2 paper we also evaluate our models on the Ben-Mordecai Hebrew NER Corpus (BMC). The 3 random splits we used can be found here.

Citations

If you use any of the NEMO2 code, models, embeddings or the NEMO corpus, please cite the NEMO2 paper:

@article{10.1162/tacl_a_00404,
    author = {Bareket, Dan and Tsarfaty, Reut},
    title = "{Neural Modeling for Named Entities and Morphology (NEMO2)}",
    journal = {Transactions of the Association for Computational Linguistics},
    volume = {9},
    pages = {909-928},
    year = {2021},
    month = {09},
    abstract = "{Named Entity Recognition (NER) is a fundamental NLP task, commonly formulated as classification over a sequence of tokens. Morphologically rich languages (MRLs) pose a challenge to this basic formulation, as the boundaries of named entities do not necessarily coincide with token boundaries, rather, they respect morphological boundaries. To address NER in MRLs we then need to answer two fundamental questions, namely, what are the basic units to be labeled, and how can these units be detected and classified in realistic settings (i.e., where no gold morphology is available). We empirically investigate these questions on a novel NER benchmark, with parallel token- level and morpheme-level NER annotations, which we develop for Modern Hebrew, a morphologically rich-and-ambiguous language. Our results show that explicitly modeling morphological boundaries leads to improved NER performance, and that a novel hybrid architecture, in which NER precedes and prunes morphological decomposition, greatly outperforms the standard pipeline, where morphological decomposition strictly precedes NER, setting a new performance bar for both Hebrew NER and Hebrew morphological decomposition tasks.}",
    issn = {2307-387X},
    doi = {10.1162/tacl_a_00404},
    url = {https://doi.org/10.1162/tacl\_a\_00404},
    eprint = {https://direct.mit.edu/tacl/article-pdf/doi/10.1162/tacl\_a\_00404/1962472/tacl\_a\_00404.pdf},
}

If you use the NEMO2's NER models please also cite NCRF++:

@inproceedings{yang2018ncrf,  
 title={{NCRF}++: An Open-source Neural Sequence Labeling Toolkit},  
 author={Yang, Jie and Zhang, Yue},  
 booktitle={Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics},
 Url = {http://aclweb.org/anthology/P18-4013},
 year={2018}  
}
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].