Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

Created with love in Canada, visit hostnodejs.com today

Feel like to post an Ad? Learn Details

All Projects → CLUEbenchmark → Cluener2020

CLUEbenchmark / Cluener2020

CLUENER2020 中文细粒度命名实体识别 Fine Grained Named Entity Recognition

Programming Languages

python

139335 projects - #7 most used programming language

Labels

dataset chinese named-entity-recognition seq2seq ner sequence-labeling sequence-to-sequence

Projects that are alternatives of or similar to Cluener2020

Kashgari

Kashgari is a production-level NLP Transfer learning framework built on top of tf.keras for text-labeling and text-classification, includes Word2Vec, BERT, and GPT2 Language Embedding.

Stars: ✭ 2,235 (+224.38%)

Mutual labels: named-entity-recognition, seq2seq, ner, sequence-labeling

Named entity recognition

中文命名实体识别（包括多种模型：HMM，CRF，BiLSTM，BiLSTM+CRF的具体实现）

Stars: ✭ 995 (+44.41%)

Mutual labels: named-entity-recognition, ner, sequence-labeling

Ncrfpp

NCRF++, a Neural Sequence Labeling Toolkit. Easy use to any sequence labeling tasks (e.g. NER, POS, Segmentation). It includes character LSTM/CNN, word LSTM/CNN and softmax/CRF components.

Stars: ✭ 1,767 (+156.46%)

Mutual labels: named-entity-recognition, ner, sequence-labeling

Autoner

Learning Named Entity Tagger from Domain-Specific Dictionary

Stars: ✭ 357 (-48.19%)

Mutual labels: named-entity-recognition, ner, sequence-labeling

Chinesener

中文命名实体识别，实体抽取，tensorflow，pytorch，BiLSTM+CRF

Stars: ✭ 938 (+36.14%)

Mutual labels: chinese, named-entity-recognition, ner

Ld Net

Efficient Contextualized Representation: Language Model Pruning for Sequence Labeling

Stars: ✭ 148 (-78.52%)

Mutual labels: named-entity-recognition, ner, sequence-labeling

Bond

BOND: BERT-Assisted Open-Domain Name Entity Recognition with Distant Supervision

Stars: ✭ 96 (-86.07%)

Mutual labels: dataset, named-entity-recognition, ner

CrossNER

CrossNER: Evaluating Cross-Domain Named Entity Recognition (AAAI-2021)

Stars: ✭ 87 (-87.37%)

Mutual labels: named-entity-recognition, ner, sequence-labeling

Vncorenlp

A Vietnamese natural language processing toolkit (NAACL 2018)

Stars: ✭ 354 (-48.62%)

Mutual labels: named-entity-recognition, ner

Spacy Streamlit

👑 spaCy building blocks and visualizers for Streamlit apps

Stars: ✭ 360 (-47.75%)

Mutual labels: named-entity-recognition, ner

Bert Multitask Learning

BERT for Multitask Learning

Stars: ✭ 380 (-44.85%)

Mutual labels: named-entity-recognition, ner

Bert Bilstm Crf Ner

Tensorflow solution of NER task Using BiLSTM-CRF model with Google BERT Fine-tuning And private Server services

Stars: ✭ 3,838 (+457.04%)

Mutual labels: named-entity-recognition, ner

Snips Nlu

Snips Python library to extract meaning from text

Stars: ✭ 3,583 (+420.03%)

Mutual labels: named-entity-recognition, ner

Pytorch Chatbot

Pytorch seq2seq chatbot

Stars: ✭ 336 (-51.23%)

Mutual labels: seq2seq, sequence-to-sequence

Neural sp

End-to-end ASR/LM implementation with PyTorch

Stars: ✭ 408 (-40.78%)

Mutual labels: seq2seq, sequence-to-sequence

Neuronlp2

Deep neural models for core NLP tasks (Pytorch version)

Stars: ✭ 397 (-42.38%)

Mutual labels: named-entity-recognition, sequence-labeling

Phobert

PhoBERT: Pre-trained language models for Vietnamese (EMNLP-2020 Findings)

Stars: ✭ 332 (-51.81%)

Mutual labels: named-entity-recognition, ner

Tf Seq2seq

Sequence to sequence learning using TensorFlow.

Stars: ✭ 387 (-43.83%)

Mutual labels: seq2seq, sequence-to-sequence

Jionlp

中文 NLP 任务预处理工具包，准确、高效、零使用门槛

Stars: ✭ 449 (-34.83%)

Mutual labels: chinese, ner

Cluepretrainedmodels

高质量中文预训练模型集合：最先进大模型、最快小模型、相似度专门模型

Stars: ✭ 493 (-28.45%)

Mutual labels: chinese, dataset

View All Similar Projects ➔

CLUENER 细粒度命名实体识别

更多细节请参考我们的技术报告： https://arxiv.org/abs/2001.04351

数据类别：

数据分为10个标签类别，分别为: 地址（address），书名（book），公司（company），游戏（game），政府（government），电影（movie），姓名（name），组织机构（organization），职位（position），景点（scene）

标签类别定义 & 标注规则：

地址（address）: **省**市**区**街**号，**路，**街道，**村等（如单独出现也标记）。地址是标记尽量完全的, 标记到最细。
书名（book）: 小说，杂志，习题集，教科书，教辅，地图册，食谱，书店里能买到的一类书籍，包含电子书。
公司（company）: **公司，**集团，**银行（央行，中国人民银行除外，二者属于政府机构）, 如：新东方，包含新华网/中国军网等。
游戏（game）: 常见的游戏，注意有一些从小说，电视剧改编的游戏，要分析具体场景到底是不是游戏。
政府（government）: 包括中央行政机关和地方行政机关两级。 中央行政机关有国务院、国务院组成部门（包括各部、委员会、中国人民银行和审计署）、国务院直属机构（如海关、税务、工商、环保总局等），军队等。
电影（movie）: 电影，也包括拍的一些在电影院上映的纪录片，如果是根据书名改编成电影，要根据场景上下文着重区分下是电影名字还是书名。
姓名（name）: 一般指人名，也包括小说里面的人物，宋江，武松，郭靖，小说里面的人物绰号：及时雨，花和尚，著名人物的别称，通过这个别称能对应到某个具体人物。
组织机构（organization）: 篮球队，足球队，乐团，社团等，另外包含小说里面的帮派如：少林寺，丐帮，铁掌帮，武当，峨眉等。
职位（position）: 古时候的职称：巡抚，知州，国师等。现代的总经理，记者，总裁，艺术家，收藏家等。
景点（scene）: 常见旅游景点如：长沙公园，深圳动物园，海洋馆，植物园，黄河，长江等。

数据下载地址：

数据下载

数据分布：

训练集：10748
验证集集：1343

按照不同标签类别统计，训练集数据分布如下（注：一条数据中出现的所有实体都进行标注，如果一条数据出现两个地址（address）实体，那么统计地址（address）类别数据的时候，算两条数据）：
【训练集】标签数据分布如下：
地址（address）:2829
书名（book）:1131
公司（company）:2897
游戏（game）:2325
政府（government）:1797
电影（movie）:1109
姓名（name）:3661
组织机构（organization）:3075
职位（position）:3052
景点（scene）:1462

【验证集】标签数据分布如下：
地址（address）:364
书名（book）:152
公司（company）:366
游戏（game）:287
政府（government）:244
电影（movie）:150
姓名（name）:451
组织机构（organization）:344
职位（position）:425
景点（scene）:199

数据字段解释：

以train.json为例，数据分为两列：text & label，其中text列代表文本，label列代表文本中出现的所有包含在10个类别中的实体。
例如：
  text: "北京勘察设计协会副会长兼秘书长周荫如"
  label: {"organization": {"北京勘察设计协会": [[0, 7]]}, "name": {"周荫如": [[15, 17]]}, "position": {"副会长": [[8, 10]], "秘书长": [[12, 14]]}}
  其中，organization，name，position代表实体类别，
  "organization": {"北京勘察设计协会": [[0, 7]]}：表示原text中，"北京勘察设计协会" 是类别为 "组织机构（organization）" 的实体, 并且start_index为0，end_index为7 （注：下标从0开始计数）
  "name": {"周荫如": [[15, 17]]}：表示原text中，"周荫如" 是类别为 "姓名（name）" 的实体, 并且start_index为15，end_index为17
  "position": {"副会长": [[8, 10]], "秘书长": [[12, 14]]}：表示原text中，"副会长" 是类别为 "职位（position）" 的实体, 并且start_index为8，end_index为10，同时，"秘书长" 也是类别为 "职位（position）" 的实体,
  并且start_index为12，end_index为14

数据来源：

本数据是在清华大学开源的文本分类数据集THUCTC基础上，选出部分数据进行细粒度命名实体标注，原数据来源于Sina News RSS.

效果对比

模型	线上效果f1
Bert-base	78.82
RoBERTa-wwm-large-ext	80.42
Bi-Lstm + CRF	70.00

各个实体的评测结果(F1 score)：

实体	bilstm+crf	bert-base	roberta-wwm-large-ext	Human Performance
Person Name	74.04	88.75	89.09	74.49
Organization	75.96	79.43	82.34	65.41
Position	70.16	78.89	79.62	55.38
Company	72.27	81.42	83.02	49.32
Address	45.50	60.89	62.63	43.04
Game	85.27	86.42	86.80	80.39
Government	77.25	87.03	88.17	79.27
Scene	52.42	65.10	70.49	51.85
Book	67.20	73.68	74.60	71.70
Movie	78.97	85.82	87.46	63.21
[email protected]	70.00	78.82	80.42	63.41

基线模型（一键运行）

1.tf版本bert系列：tf_version (test, f1 80.42)

2.pytorch版本baseline：pytorch_version(79.63)

3.bilistm+crf的baseline: bilstm+crf (test, f1 70.0)

技术交流与问题讨论QQ群: 836811304 Join us on QQ group

引用我们 Cite Us

如果本目录中的内容对你的研究工作有所帮助，请在文献中引用下述报告：https://arxiv.org/abs/2001.04351

@article{xu2020cluener2020,
  title={CLUENER2020: Fine-grained Name Entity Recognition for Chinese},
  author={Xu, Liang and Dong, Qianqian and Yu, Cong and Tian, Yin and Liu, Weitang and Li, Lu and Zhang, Xuanwei},
  journal={arXiv preprint arXiv:2001.04351},
  year={2020}
 }

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Stars: ✭ 689

Visit Git Page 🔗Visit User Page 🔗Visit Issues Page (29) 🔗