All Projects → thunlp → Few-NERD

thunlp / Few-NERD

Licence: Apache-2.0 license
Code and data of ACL 2021 paper "Few-NERD: A Few-shot Named Entity Recognition Dataset"

Programming Languages

python
139335 projects - #7 most used programming language
shell
77523 projects

Projects that are alternatives of or similar to Few-NERD

CogIE
CogIE: An Information Extraction Toolkit for Bridging Text and CogNet. ACL 2021
Stars: ✭ 47 (-85.17%)
Mutual labels:  named-entity-recognition, entity-typing
MLMAN
ACL 2019 paper:Multi-Level Matching and Aggregation Network for Few-Shot Relation Classification
Stars: ✭ 59 (-81.39%)
Mutual labels:  few-shot-learning
NER-and-Linking-of-Ancient-and-Historic-Places
An NER tool for ancient place names based on Pleiades and Spacy.
Stars: ✭ 26 (-91.8%)
Mutual labels:  named-entity-recognition
FewShotDetection
(ECCV 2020) PyTorch implementation of paper "Few-Shot Object Detection and Viewpoint Estimation for Objects in the Wild"
Stars: ✭ 188 (-40.69%)
Mutual labels:  few-shot-learning
Wisty.js
🧚‍♀️ Chatbot library turning conversations into actions, locally, in the browser.
Stars: ✭ 24 (-92.43%)
Mutual labels:  named-entity-recognition
SynLSTM-for-NER
Code and models for the paper titled "Better Feature Integration for Named Entity Recognition", NAACL 2021.
Stars: ✭ 26 (-91.8%)
Mutual labels:  named-entity-recognition
attMPTI
[CVPR 2021] Few-shot 3D Point Cloud Semantic Segmentation
Stars: ✭ 118 (-62.78%)
Mutual labels:  few-shot-learning
ner-tagger-dynet
See http://github.com/onurgu/joint-ner-and-md-tagger This repository is basically a Bi-LSTM based sequence tagger in both Tensorflow and Dynet which can utilize several sources of information about each word unit like word embeddings, character based embeddings and morphological tags from an FST to obtain the representation for that specific wor…
Stars: ✭ 23 (-92.74%)
Mutual labels:  named-entity-recognition
ckipnlp
CKIP CoreNLP Toolkits
Stars: ✭ 92 (-70.98%)
Mutual labels:  named-entity-recognition
sib meta learn
Code of Empirical Bayes Transductive Meta-Learning with Synthetic Gradients
Stars: ✭ 56 (-82.33%)
Mutual labels:  few-shot-learning
one-shot-steel-surfaces
One-Shot Recognition of Manufacturing Defects in Steel Surfaces
Stars: ✭ 29 (-90.85%)
Mutual labels:  few-shot-learning
simple-cnaps
Source codes for "Improved Few-Shot Visual Classification" (CVPR 2020), "Enhancing Few-Shot Image Classification with Unlabelled Examples" (WACV 2022), and "Beyond Simple Meta-Learning: Multi-Purpose Models for Multi-Domain, Active and Continual Few-Shot Learning" (Neural Networks 2022 - in submission)
Stars: ✭ 88 (-72.24%)
Mutual labels:  few-shot-learning
scikitcrf NER
Python library for custom entity recognition using Sklearn CRF
Stars: ✭ 17 (-94.64%)
Mutual labels:  named-entity-recognition
few shot dialogue generation
Dialogue Knowledge Transfer Networks (DiKTNet)
Stars: ✭ 24 (-92.43%)
Mutual labels:  few-shot-learning
PersianNER
Named-Entity Recognition in Persian Language
Stars: ✭ 48 (-84.86%)
Mutual labels:  named-entity-recognition
BioMedical-NLP-corpus
Biomedical NLP Corpus or Datasets.
Stars: ✭ 44 (-86.12%)
Mutual labels:  named-entity-recognition
sinkhorn-label-allocation
Sinkhorn Label Allocation is a label assignment method for semi-supervised self-training algorithms. The SLA algorithm is described in full in this ICML 2021 paper: https://arxiv.org/abs/2102.08622.
Stars: ✭ 49 (-84.54%)
Mutual labels:  few-shot-learning
TweebankNLP
[LREC 2022] An off-the-shelf pre-trained Tweet NLP Toolkit (NER, tokenization, lemmatization, POS tagging, dependency parsing) + Tweebank-NER dataset
Stars: ✭ 84 (-73.5%)
Mutual labels:  named-entity-recognition
teanaps
자연어 처리와 텍스트 분석을 위한 오픈소스 파이썬 라이브러리 입니다.
Stars: ✭ 91 (-71.29%)
Mutual labels:  named-entity-recognition
namaco
Character Based Named Entity Recognition.
Stars: ✭ 41 (-87.07%)
Mutual labels:  named-entity-recognition

Few-NERD: Not Only a Few-shot NER Dataset

This is the source code of the ACL-IJCNLP 2021 paper: Few-NERD: A Few-shot Named Entity Recognition Dataset. Check out the website of Few-NERD.

************************************* Updates *************************************

  • 09/03/2022: We have added the training script for supervised training using BERT tagger. Run bash data/download.sh supervised to download the data, and then run bash run_supervised.sh.

  • 01/09/2021: We have modified the results of the supervised setting of Few-NERD in arxiv, thanks for the help of PedroMLF.

  • 19/08/2021: Important💥 In accompany with the released episode data, we have updated the training script. Simply add --use_sampled_data when running train_demo.py to train and test on the released episode data.

  • 02/06/2021: To simplify training, we have released the data sampled by episode. click here to download. The files are named such: {train/dev/test}_{N}_{K}.jsonl. We sampled 20000, 1000, 5000 episodes for train, dev, test, respectively.

  • 26/05/2021: The current Few-NERD (SUP) is sentence-level. We will soon release Few-NERD (SUP) 1.1, which is paragraph-level and contains more contextual information.

  • 11/06/2021: We have modified the word tokenization and we will soon update the latest results. We sincerely thank tingtingma and Chandan Akiti

Contents

Overview

Few-NERD is a large-scale, fine-grained manually annotated named entity recognition dataset, which contains 8 coarse-grained types, 66 fine-grained types, 188,200 sentences, 491,711 entities and 4,601,223 tokens. Three benchmark tasks are built, one is supervised: Few-NERD (SUP) and the other two are few-shot: Few-NERD (INTRA) and Few-NERD (INTER).

The schema of Few-NERD is:

Few-NERD is manually annotated based on the context, for example, in the sentence "London is the fifth album by the British rock band…", the named entity London is labeled as Art-Music.

Requirements

 Run the following script to install the remaining dependencies,

pip install -r requirements.txt

Few-NERD Dataset

Get the Data

  • Few-NERD contains 8 coarse-grained types, 66 fine-grained types, 188,200 sentences, 491,711 entities and 4,601,223 tokens.
  • We have splitted the data into 3 training mode. One for supervised setting-supervised, the other two for few-shot setting inter and intra. Each contains three files train.txtdev.txttest.txtsuperviseddatasets are randomly split. inter datasets are randomly split within coarse type, i.e. each file contains all 8 coarse types but different fine-grained types. intra datasets are randomly split by coarse type.
  • The splitted dataset can be downloaded automatically once you run the model. If you want to download the data manually, run data/download.sh, remember to add parameter supervised/inter/intra to indicate the type of the dataset

To obtain the three benchmark datasets of Few-NERD, simply run the bash file data/download.sh with parameter supervised/inter/intra as below

bash data/download.sh supervised

To get the data sampled by episode, run

bash data/download.sh episode-data
unzip -d data/ data/episode-data.zip

Data Format

The data are pre-processed into the typical NER data forms as below (token\tlabel).

Between	O
1789	O
and	O
1793	O
he	O
sat	O
on	O
a	O
committee	O
reviewing	O
the	O
administrative	MISC-law
constitution	MISC-law
of	MISC-law
Galicia	MISC-law
to	O
little	O
effect	O
.	O

Structure

The structure of our project is:

--util
| -- framework.py
| -- data_loader.py
| -- viterbi.py             # viterbi decoder for structshot only
| -- word_encoder
| -- fewshotsampler.py

-- proto.py                 # prototypical model
-- nnshot.py                # nnshot model

-- train_demo.py            # main training script

Key Implementations

Sampler

As established in our paper, we design an N way K~2K shot sampling strategy in our work , the implementation is sat util/fewshotsampler.py.

ProtoBERT

Prototypical nets with BERT is implemented in model/proto.py.

NNShot & StructShot

NNShot with BERT is implemented in model/nnshot.py.

StructShot is realized by adding an extra viterbi decoder in util/framework.py.

Note that the backbone BERT encoder we used for structshot model is not pre-trained with NER task

How to Run

Run train_demo.py. The arguments are presented below. The default parameters are for proto model on intermode dataset.

-- mode                 training mode, must be inter, intra, or supervised
-- trainN               N in train
-- N                    N in val and test
-- K                    K shot
-- Q                    Num of query per class
-- batch_size           batch size
-- train_iter           num of iters in training
-- val_iter             num of iters in validation
-- test_iter            num of iters in testing
-- val_step             val after training how many iters
-- model                model name, must be proto, nnshot or structshot
-- max_length           max length of tokenized sentence
-- lr                   learning rate
-- weight_decay         weight decay
-- grad_iter            accumulate gradient every x iterations
-- load_ckpt            path to load model
-- save_ckpt            path to save model
-- fp16                 use nvidia apex fp16
-- only_test            no training process, only test
-- ckpt_name            checkpoint name
-- seed                 random seed
-- pretrain_ckpt        bert pre-trained checkpoint
-- dot                  use dot instead of L2 distance in distance calculation
-- use_sgd_for_bert     use SGD instead of AdamW for BERT.
# only for structshot
-- tau                  StructShot parameter to re-normalizes the transition probabilities
  • For hyperparameter --tau in structshot, we use 0.32 in 1-shot setting, 0.318 for 5-way-5-shot setting, and 0.434 for 10-way-5-shot setting.

  • Take structshot model on inter dataset for example, the expriments can be run as follows.

5-way-1~5-shot

python3 train_demo.py  --mode inter \
--lr 1e-4 --batch_size 8 --trainN 5 --N 5 --K 1 --Q 1 \
--train_iter 10000 --val_iter 500 --test_iter 5000 --val_step 1000 \
--max_length 64 --model structshot --tau 0.32

5-way-5~10-shot

python3 train_demo.py  --mode inter \
--lr 1e-4 --batch_size 1 --trainN 5 --N 5 --K 5 --Q 5 \
--train_iter 10000 --val_iter 500 --test_iter 5000 --val_step 1000 \
--max_length 32 --model structshot --tau 0.318

10-way-1~5-shot

python3 train_demo.py  --mode inter \
--lr 1e-4 --batch_size 4 --trainN 10 --N 10 --K 1 --Q 1 \
--train_iter 10000 --val_iter 500 --test_iter 5000 --val_step 1000 \
--max_length 64 --model structshot --tau 0.32

10-way-5~10-shot

python3 train_demo.py  --mode inter \
--lr 1e-4 --batch_size 1 --trainN 10 --N 10 --K 5 --Q 1 \
--train_iter 10000 --val_iter 500 --test_iter 5000 --val_step 1000 \
--max_length 32 --model structshot --tau 0.434

Citation

If you use Few-NERD in your work, please cite our paper:

@inproceedings{ding-etal-2021-nerd,
    title = "Few-{NERD}: A Few-shot Named Entity Recognition Dataset",
    author = "Ding, Ning  and
      Xu, Guangwei  and
      Chen, Yulin  and
      Wang, Xiaobin  and
      Han, Xu  and
      Xie, Pengjun  and
      Zheng, Haitao  and
      Liu, Zhiyuan",
    booktitle = "Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)",
    month = aug,
    year = "2021",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2021.acl-long.248",
    doi = "10.18653/v1/2021.acl-long.248",
    pages = "3198--3213",
}

License

Few-NERD dataset is distributed under the CC BY-SA 4.0 license. The code is distributed under the Apache 2.0 license.

Connection

If you have any questions, feel free to contact

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].