Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

Created with love in Canada, visit hostnodejs.com today

Feel like to post an Ad? Learn Details

All Projects → ZihanWangKi → Crossweigh

ZihanWangKi / Crossweigh

CrossWeigh: Training Named Entity Tagger from Imperfect Annotations

Programming Languages

python

139335 projects - #7 most used programming language

Labels

pytorch named-entity-recognition datasets

Projects that are alternatives of or similar to Crossweigh

Entity Recognition Datasets

A collection of corpora for named entity recognition (NER) and entity recognition tasks. These annotated datasets cover a variety of languages, domains and entity types.

Stars: ✭ 891 (+879.12%)

Mutual labels: datasets, named-entity-recognition

Ner Datasets

Datasets to train supervised classifiers for Named-Entity Recognition in different languages (Portuguese, German, Dutch, French, English)

Stars: ✭ 220 (+141.76%)

Mutual labels: datasets, named-entity-recognition

Open Semantic Search Apps

Python/Django based webapps and web user interfaces for search, structure (meta data management like thesaurus, ontologies, annotations and named entities) and data import (ETL like text extraction, OCR and crawling filesystems or websites)

Stars: ✭ 55 (-39.56%)

Mutual labels: named-entity-recognition

Gopup

数据接口：百度、谷歌、头条、微博指数,宏观数据，利率数据，货币汇率，千里马、独角兽公司，新闻联播文字稿，影视票房数据，高校名单，疫情数据…

Stars: ✭ 1,229 (+1250.55%)

Mutual labels: datasets

Seq2annotation

基于 TensorFlow & PaddlePaddle 的通用序列标注算法库（目前包含 BiLSTM+CRF, Stacked-BiLSTM+CRF 和 IDCNN+CRF，更多算法正在持续添加中）实现中文分词（Tokenizer / segmentation）、词性标注（Part Of Speech, POS）和命名实体识别（Named Entity Recognition, NER）等序列标注任务。

Stars: ✭ 70 (-23.08%)

Mutual labels: named-entity-recognition

Wongnai Corpus

Collection of Wongnai's datasets

Stars: ✭ 57 (-37.36%)

Mutual labels: datasets

Nested Ner Tacl2020 Transformers

Implementation of Nested Named Entity Recognition using BERT

Stars: ✭ 76 (-16.48%)

Mutual labels: named-entity-recognition

Iob2corpus

Japanese IOB2 tagged corpus for Named Entity Recognition.

Stars: ✭ 51 (-43.96%)

Mutual labels: named-entity-recognition

Turkish Bert Nlp Pipeline

Bert-base NLP pipeline for Turkish, Ner, Sentiment Analysis, Question Answering etc.

Stars: ✭ 85 (-6.59%)

Mutual labels: named-entity-recognition

Coco Annotator

✏️ Web-based image segmentation tool for object detection, localization, and keypoints

Stars: ✭ 1,138 (+1150.55%)

Mutual labels: datasets

Atis dataset

The ATIS (Airline Travel Information System) Dataset

Stars: ✭ 81 (-10.99%)

Mutual labels: datasets

Colour

Colour Science for Python

Stars: ✭ 1,131 (+1142.86%)

Mutual labels: datasets

Torchcrf

An Inplementation of CRF (Conditional Random Fields) in PyTorch 1.0

Stars: ✭ 58 (-36.26%)

Mutual labels: named-entity-recognition

Deepsequenceclassification

Deep neural network based model for sequence to sequence classification

Stars: ✭ 76 (-16.48%)

Mutual labels: named-entity-recognition

Phonlp

PhoNLP: A BERT-based multi-task learning toolkit for part-of-speech tagging, named entity recognition and dependency parsing (NAACL 2021)

Stars: ✭ 56 (-38.46%)

Mutual labels: named-entity-recognition

Openml R

R package to interface with OpenML

Stars: ✭ 81 (-10.99%)

Mutual labels: datasets

Tner

Language model finetuning on NER with an easy interface, and cross-domain evaluation. We released NER models finetuned on various domain via huggingface model hub.

Stars: ✭ 54 (-40.66%)

Mutual labels: named-entity-recognition

Wikipedia ner

📖 Labeled examples from wiki dumps in Python

Stars: ✭ 61 (-32.97%)

Mutual labels: named-entity-recognition

Bert Bilstm Crf Pytorch

bert-bilstm-crf implemented in pytorch for named entity recognition.

Stars: ✭ 71 (-21.98%)

Mutual labels: named-entity-recognition

End To End Sequence Labeling Via Bi Directional Lstm Cnns Crf Tutorial

Tutorial for End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF

Stars: ✭ 87 (-4.4%)

Mutual labels: named-entity-recognition

View All Similar Projects ➔

CrossWeigh

CrossWeigh: Training Named Entity Tagger from Imperfect Annotations

Motivation

The label annotation mistakes by human annotators brings up two challenges to NER:

mistakes in the test set can interfere the evaluation results and even lead to an inaccurate assessment of model performance.
mistakes in the training set can hurt NER model training.

We address these two problems by:

manually correcting the mistakes in the test set to form a cleaner benchmark.
develop framework CrossWeigh to handle the mistakes in the training set.

CrossWeigh works with any NER algorithm that accepts weighted training instances. It is composed of two modules. 1) mistake estimation: where potential mistakes are identified in the training data through a cross-checking process and 2) mistake re-weighing: where weights of those mistakes are lowered during training the final NER model.

Data

We formally name our corrected dataset as CoNLL++.
/data/conllpp_test.txt is the manually corrected test set, there should be exactly 186 sentences that differ from the original test set.
/data/conllpp_train.txt and /data/conllpp_dev.txt are the original dataset of CoNLL03 from Named-Entity-Recognition-NER-Papers.

Scripts

split.py can be used to generate a k-fold entity disjoint dataset from a list of datasets(usually both the train and development set)
flair_scripts/flair_ner.py can be used to train a weighted version of flair.
collect.py can be used to collect all the predictions on the k folded test set.

Steps to reproduce

Make sure you are in a python3.6+ environment.
See example.sh to reproduce the results.
Using Flair (non-pooled version), the final result should achieve around 93.19F1 on the original test dataset and 94.18F1 on the corrected test set. Using Flair without CrossWeigh gives around 92.9F1.

Results

All the results are averaged across 5 runs and standard deviation is reported.

Model	w/o CrossWeigh (original)	w/ CrossWeigh (original)	w/o CrossWeigh (corrected)	w/ CrossWeigh (corrected)
VanillaNER	91.44(±0.16)	91.78(±0.06)	92.32(±0.16)	92.64(±0.08)
Flair	92.87(±0.08)	93.19(±0.09)	93.89(±0.06)	94.18(±0.06)
Pooled-Flair	93.14(±0.14)	93.43(±0.06)	94.13(±0.11)	94.28(±0.05)
GCDT	93.33(±0.14)	93.43(±0.05)	94.58(±0.15)	94.65(±0.06)
LSTM-CRF	90.64(±0.23)		91.47(±0.15)
LSTM-CNNs-CRF	90.65(±0.57)		91.87(±0.50)
ELMo	92.28(±0.19)		93.42(±0.15)

For all models, we use their suggested parameter settings.
For GCDT, we used the weights estimated from Pooled-Flair for efficiency purposes.

Citation

Please cite the following paper if you found our dataset or framework useful. Thanks!

Zihan Wang, Jingbo Shang, Liyuan Liu, Lihao Lu, Jiacheng Liu, and Jiawei Han. "CrossWeigh: Training Named Entity Tagger from Imperfect Annotations." arXiv preprint arXiv:1909.01441 (2019).

@article{wang2019cross,
  title={CrossWeigh: Training Named Entity Tagger from Imperfect Annotations},
  author={Wang, Zihan and Shang, Jingbo and Liu, Liyuan and Lu, Lihao and Liu, Jiacheng and Han, Jiawei},
  journal={arXiv preprint arXiv:1909.01441},
  year={2019}
}

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Stars: ✭ 91

Visit Git Page 🔗Visit User Page 🔗Visit Issues Page (0) 🔗