All Projects → juand-r → Entity Recognition Datasets

juand-r / Entity Recognition Datasets

Licence: mit
A collection of corpora for named entity recognition (NER) and entity recognition tasks. These annotated datasets cover a variety of languages, domains and entity types.

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to Entity Recognition Datasets

Open Semantic Entity Search Api
Open Source REST API for named entity extraction, named entity linking, named entity disambiguation, recommendation & reconciliation of entities like persons, organizations and places for (semi)automatic semantic tagging & analysis of documents by linked data knowledge graph like SKOS thesaurus, RDF ontology, database(s) or list(s) of names
Stars: ✭ 98 (-89%)
Mutual labels:  natural-language-processing, named-entity-recognition, entity-extraction
Spacy Lookup
Named Entity Recognition based on dictionaries
Stars: ✭ 212 (-76.21%)
Mutual labels:  natural-language-processing, named-entity-recognition, ner
Ncrfpp
NCRF++, a Neural Sequence Labeling Toolkit. Easy use to any sequence labeling tasks (e.g. NER, POS, Segmentation). It includes character LSTM/CNN, word LSTM/CNN and softmax/CRF components.
Stars: ✭ 1,767 (+98.32%)
Mutual labels:  natural-language-processing, named-entity-recognition, ner
Ner Bert Pytorch
PyTorch solution of named entity recognition task Using Google AI's pre-trained BERT model.
Stars: ✭ 249 (-72.05%)
Mutual labels:  named-entity-recognition, ner, entity-extraction
Chatbot ner
chatbot_ner: Named Entity Recognition for chatbots.
Stars: ✭ 273 (-69.36%)
Mutual labels:  natural-language-processing, named-entity-recognition, ner
Bond
BOND: BERT-Assisted Open-Domain Name Entity Recognition with Distant Supervision
Stars: ✭ 96 (-89.23%)
Mutual labels:  natural-language-processing, named-entity-recognition, ner
Bert Sklearn
a sklearn wrapper for Google's BERT model
Stars: ✭ 182 (-79.57%)
Mutual labels:  natural-language-processing, named-entity-recognition, ner
Turkish Bert Nlp Pipeline
Bert-base NLP pipeline for Turkish, Ner, Sentiment Analysis, Question Answering etc.
Stars: ✭ 85 (-90.46%)
Mutual labels:  natural-language-processing, named-entity-recognition, ner
scikitcrf NER
Python library for custom entity recognition using Sklearn CRF
Stars: ✭ 17 (-98.09%)
Mutual labels:  named-entity-recognition, ner, entity-extraction
Ner Datasets
Datasets to train supervised classifiers for Named-Entity Recognition in different languages (Portuguese, German, Dutch, French, English)
Stars: ✭ 220 (-75.31%)
Mutual labels:  datasets, named-entity-recognition, ner
Projects
🪐 End-to-end NLP workflows from prototype to production
Stars: ✭ 397 (-55.44%)
Mutual labels:  datasets, natural-language-processing, annotations
Spacy Streamlit
👑 spaCy building blocks and visualizers for Streamlit apps
Stars: ✭ 360 (-59.6%)
Mutual labels:  natural-language-processing, named-entity-recognition, ner
Spark Nlp
State of the Art Natural Language Processing
Stars: ✭ 2,518 (+182.6%)
Mutual labels:  natural-language-processing, named-entity-recognition, entity-extraction
Pytorch Bert Crf Ner
KoBERT와 CRF로 만든 한국어 개체명인식기 (BERT+CRF based Named Entity Recognition model for Korean)
Stars: ✭ 236 (-73.51%)
Mutual labels:  natural-language-processing, named-entity-recognition, ner
Vncorenlp
A Vietnamese natural language processing toolkit (NAACL 2018)
Stars: ✭ 354 (-60.27%)
Mutual labels:  natural-language-processing, named-entity-recognition, ner
Yedda
YEDDA: A Lightweight Collaborative Text Span Annotation Tool. Code for ACL 2018 Best Demo Paper Nomination.
Stars: ✭ 704 (-20.99%)
Mutual labels:  named-entity-recognition, annotations, ner
Neuronlp2
Deep neural models for core NLP tasks (Pytorch version)
Stars: ✭ 397 (-55.44%)
Mutual labels:  natural-language-processing, named-entity-recognition
Spacy
💫 Industrial-strength Natural Language Processing (NLP) in Python
Stars: ✭ 21,978 (+2366.67%)
Mutual labels:  natural-language-processing, named-entity-recognition
Lightkg
基于Pytorch和torchtext的知识图谱深度学习框架。
Stars: ✭ 452 (-49.27%)
Mutual labels:  named-entity-recognition, ner
Transformers Tutorials
Github repo with tutorials to fine tune transformers for diff NLP tasks
Stars: ✭ 384 (-56.9%)
Mutual labels:  natural-language-processing, named-entity-recognition

=============================== Datasets for Entity Recognition

This repository contains datasets from several domains annotated with a variety of entity types, useful for entity recognition and named entity recognition (NER) tasks.

Datasets for NER in English

.. |check| unicode:: 0x2714

The following table shows the list of datasets for English-language entity recognition (for a list of NER datasets in other languages, see below). The data directory contains information on where to obtain those datasets which could not be shared due to licensing restrictions, as well as code to convert them (if necessary) to the CoNLL 2003 format. Links to NER corpora in other languages are also listed below.

============== =============== ======================= =============================== ================================== Dataset Domain License Reference Availablility ============== =============== ======================= =============================== ================================== CONLL 2003 News DUA Sang and Meulder, 2003 Easy <https://github.com/patverga/torch-ner-nlp-from-scratch/tree/master/data/conll2003/>_ to <https://github.com/synalp/NER/tree/master/corpus/CoNLL-2003>_ find <https://github.com/glample/tagger/tree/master/dataset>_ NIST-IEER News None NIST 1999 IE-ER NLTK data <https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/packages/corpora/ieer.zip>_ MUC-6 News LDC Grishman and Sundheim, 1996 LDC 2003T13 <https://catalog.ldc.upenn.edu/LDC2003T13>_ OntoNotes 5 Various LDC Weischedel et al., 2013 LDC 2013T19 <https://catalog.ldc.upenn.edu/LDC2013T19>_ BBN Various LDC Weischedel and Brunstein, 2005 LDC 2005T33 <https://catalog.ldc.upenn.edu/LDC2005T33>_ GMB-1.0.0 Various None Bos et al., 2017 http://gmb.let.rug.nl/data.php <http://gmb.let.rug.nl/releases/gmb-1.0.0.zip>_ GUM-3.1.0 Wiki Several (*2) Zeldes, 2016 |check| Included here wikigold Wikipedia CC-BY 4.0 Balasuriya et al., 2009 |check| Included here Ritter Twitter None Ritter et al., 2011 No split <https://github.com/aritter/twitter_nlp/blob/master/data/annotated/ner.txt>_ , Train/test/dev split <http://kimi.ml.cmu.edu/transfer/data.tar.gz>_ BTC Twitter CC-BY 4.0 Derczynski et al., 2016 |check| Included here WNUT17 Social media CC-BY 4.0 Derczynski et al., 2017 |check| Included here i2b2-2006 Medical DUA Uzuner et al., 2007 http://www.i2b2.org <https://www.i2b2.org/NLP/DataSets/Main.php>_ i2b2-2014 Medical DUA Stubbs et al., 2015 http://www.i2b2.org <https://www.i2b2.org/NLP/DataSets/Main.php>_ CADEC Medical CSIRO Karimi et al., 2015 http://data.csiro.au/ AnEM Anatomical CC-BY-SA 3.0 Ohta et al., 2012 |check| Included here MITRestaurant Queries None Liu et al., 2013a http://groups.csail.mit.edu/sls/ <https://groups.csail.mit.edu/sls/downloads/restaurant/>_ MITMovie Queries None Liu et al., 2013b http://groups.csail.mit.edu/sls/ <https://groups.csail.mit.edu/sls/downloads/movie/>_ MalwareTextDB Malware None Lim et al., 2017 http://www.statnlp.org/ <http://www.statnlp.org/research/re/MalwareTextDB-1.0.zip>_ re3d Defense Several (*1) DSTL, 2017 |check| Included here SEC-filings Finance CC-BY 3.0 Alvarado et al., 2015 |check| Included here Assembly Robotics X Costa et al., 2017 X ============== =============== ======================= =============================== ==================================

Licenses

Notes on licenses:

(1) re3d ("Relationship and Entity Extraction Evaluation Dataset") contains several datasets, with different licenses. These are:

  • CC-BY-SA 3.0 (Wikipedia dataset)
  • CC BY-NC 3.0 (BBC_Online dataset)
  • CC BY 3.0 AU (Australian_Department_of_Foreign_Affairs dataset)
  • public domain (US_State_Department dataset, CENTCOM dataset)
  • UK Open Government Licence v3.0 (UK_Government dataset)
  • Delegation_of_the_European_Union_to_Syria: see https://eeas.europa.eu/delegations/syria/8157/legal-notice_en

(2) GUM 3.1.0 comprises three datasets, with licenses CC-BY 3.0, CC-BY-SA 3.0 and CC-BY-NC-SA 3.0. The annotations are licensed under CC-BY 4.0.

More detailed license information for each dataset can be found in the corresponding subdirectory.

Later ...

Datasets for NER in other languages

Lexical Named Entity resources

Code-Switching

German

Dutch

Afrikaans

Spanish

Catalan

Galician

Basque

Portuguese

French

Italian

Romanian

Greek

Hungarian

Czech

Polish

Croatian

Slovak

Slovene

Ukrainian

Serbian

Bulgarian

  • BulTreeBank (BTB)

Icelandic

  • MIM-GOLD-NER (Ingólfsdóttir, Svanhvít Lilja, Sigurjón Þorsteinsson, and Hrafn Loftsson. "Towards High Accuracy Named Entity Recognition for Icelandic." Proceedings of the 22nd Nordic Conference on Computational Linguistics. 2019): http://www.malfong.is/index.php?pg=mim_gold_ner

Danish

Norwegian

Swedish

Finnish

Estonian

Latvian and Lithuanian

Turkish

Uyghur

  • Uyghur Named Entity Relation corpus: https://github.com/kaharjan/UyNeRel (Abiderexiti et al., Annotation Schemes for Constructing Uyghur Named Entity Relation Corpus. IALP 2016)

Armenian

Coptic

Amharic

Arabic

Persian

Urdu

Hindi

Bengali

Telugu

Marathi

Punjabi

Tamil

Malayalam

Oriya/Odia

Sinhala/Sinhalese

  • LORELEI (LDC2018E57)

Thai

Indonesian

Vietnamese

Japanese

Korean

Chinese

Russian

Yoruba

Swahili

isiNdebele

Xhosa

Zulu

Sepedi

Sesotho

Setswana

Siswati

Venda

Xitsonga

Latin

A long list can be found here: http://damien.nouvels.net/resourcesen/corpora.html

References

[Alvarado et al., 2015] Alvarado, Julio Cesar Salinas, Karin Verspoor, and Timothy Baldwin. Domain adaption of named entity recognition to support credit risk assessment. In Proceedings of the Australasian Language Technology Association Workshop 2015, pp. 84-90. 2015. Accessed: August 2018.

[Balasuriya et al., 2009] Balasuriya, Dominic, Nicky Ringland, Joel Nothman, Tara Murphy, and James R. Curran. Named entity recognition in wikipedia. In Proceedings of the 2009 Workshop on The People's Web Meets NLP: Collaboratively Constructed Semantic Resources, pp. 10-18. Association for Computational Linguistics, 2009

[Bos et al., 2017] Bos, Johan, Valerio Basile, Kilian Evang, Noortje J. Venhuizen, and Johannes Bjerva. The Groningen meaning bank. In Handbook of linguistic annotation, pp. 463-496. Springer, Dordrecht, 2017.

[Derczynski et al., 2016] Derczynski, Leon, Kalina Bontcheva, and Ian Roberts. Broad twitter corpus: A diverse named entity recognition resource. In Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, pp. 1169-1179. 2016. Available at: https://github.com/GateNLP/broad_twitter_corpus Accessed: August 2018.

[Derczynski et al., 2017] Leon Derczynski, Eric Nichols, Marieke van Erp, Nut Limsopatham (2017) Results of the WNUT2017 Shared Task on Novel and Emerging Entity Recognition, in Proceedings of the 3rd Workshop on Noisy, User-generated Text. Available at: https://noisy-text.github.io/2017/emerging-rare-entities.html

[DSTL, 2017] Defence Science and Technology Laboratory. 2017. Relationship and Entity Extraction Evaluation Dataset. https://github.com/dstl/re3d. Accessed: January 2018.

[Grishman and Sundheim, 1996] Ralph Grishman and Beth Sundheim. 1996. Message understanding conference- 6: A brief history. In COLING 1996 Volume 1: The 16th International Conference on Computational Linguistics.

[Karimi et al., 2015] Sarvnaz Karimi, Alejandro Metke-Jimenez, Madonna Kemp, and Chen Wang. 2015. Cadec: A corpus of adverse drug event annotations. Journal of biomedical informatics, 55:73-81. Available at https://data.csiro.au Accessed: November 2017.

[Lim et al., 2017] Lim, Swee Kiat, Aldrian Obaja Muis, Wei Lu, and Chen Hui Ong. MalwareTextDB: A database for annotated malware articles. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), vol. 1, pp. 1557-1567. 2017.

[Liu et al., 2013a] Jingjing Liu, Panupong Pasupat, Scott Cyphers, and Jim Glass. 2013. Asgard: A portable architecture for multilingual dialogue systems. In Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on, pages 8386-8390. IEEE. Available at https://groups.csail.mit.edu/sls/downloads/restaurant/ Accessed: January 2018

[Liu et al., 2013b] Jingjing Liu, Panupong Pasupat, Yining Wang, Scott Cyphers, and Jim Glass. 2013. Query understanding enhanced by hierarchical parsing structures. In Automatic Speech Recognition and Understanding (ASRU), 2013 IEEE Workshop on, pages 72-77. IEEE. Available at https://groups.csail.mit.edu/sls/downloads/movie/ We used the trivia10k13 portion. Accessed: January 2018

[NIST, 1999 IE-ER] NIST. 1999. Information Extraction - Entity Recognition Evaluation. http://www.nist.gov/speech/tests/ieer/er_99/er_99.htm. The newswire development test data only (included in the NLTK package).

[Ohta et al., 2012] Tomoko Ohta, Sampo Pyysalo, Jun'ichi Tsujii and Sophia Ananiadou. 2012. Open-domain Anatomical Entity Mention Detection. In Proceedings of ACL 2012 Workshop on Detecting Structure in Scholarly Discourse (DSSD), pp. 27-36. Available at: http://www.nactem.ac.uk/anatomy/ and https://github.com/openbiocorpora/anem Accessed: November 2017.

[Ritter et al., 2011] Alan Ritter, Sam Clark, Mausam, and Oren Etzioni. 2011. Named entity recognition in tweets: An experimental study. In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, pages 1524-1534, Edinburgh, Scotland, UK., July. Association for Computational Linguistics. Accessed January 2018.

[Sang and Meulder, 2003] Erik F. Tjong Kim Sang and Fien De Meulder. 2003. Introduction to the CoNLL-2003 shared task: Languageindependent named entity recognition. In Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003.

[Stubbs et al., 2015] Amber Stubbs and Ozlem Uzuner. 2015. Annotating longitudinal clinical narratives for de-identification: The 2014 i2b2/UTHealth corpus. Journal of biomedical informatics, 58:S20-S29. Available at https://www.i2b2.org/NLP/DataSets/ Accessed: February 2018.

[Uzuner et al., 2007] Ozlem Uzuner, Yuan Luo, and Peter Szolovits. 2007. Evaluating the state-of-the-art in automatic de-identification. Journal of the American Medical Informatics Association, 14(5):550-563. Available at https://www.i2b2.org/NLP/DataSets/ Accessed: February 2018.

[Weischedel and Brunstein, 2005] Ralph Weischedel and Ada Brunstein. 2005. BBN pronoun coreference and entity type corpus. Linguistic Data Consortium, Philadelphia.

[Weischedel et al., 2013] Weischedel, Ralph, Martha Palmer, Mitchell Marcus, Eduard Hovy, Sameer Pradhan, Lance Ramshaw, Nianwen Xue et al. Ontonotes release 5.0 ldc2013t19. Linguistic Data Consortium, Philadelphia, PA (2013).

[Zeldes, 2017] Amir Zeldes. 2017. The GUM corpus: creating multilayer resources in the classroom. Language Resources and Evaluation, 51(3):581-612. Available at https://github.com/amir-zeldes/gum/tree/master/coref/tsv/ Accessed: November 2017.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].