All Projects → Alexzsh → FDDC

Alexzsh / FDDC

Licence: Apache-2.0 license
Named Entity Recognition & Relation Extraction 实体命名识别与关系分类

Programming Languages

HTML
75241 projects
Jupyter Notebook
11667 projects
python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to FDDC

Jointre
End-to-end neural relation extraction using deep biaffine attention (ECIR 2019)
Stars: ✭ 41 (+41.38%)
Mutual labels:  named-entity-recognition, relation-extraction
InformationExtractionSystem
Information Extraction System can perform NLP tasks like Named Entity Recognition, Sentence Simplification, Relation Extraction etc.
Stars: ✭ 27 (-6.9%)
Mutual labels:  named-entity-recognition, relation-extraction
Information Extraction Chinese
Chinese Named Entity Recognition with IDCNN/biLSTM+CRF, and Relation Extraction with biGRU+2ATT 中文实体识别与关系提取
Stars: ✭ 1,888 (+6410.34%)
Mutual labels:  named-entity-recognition, relation-extraction
IE Paper Notes
Paper notes for Information Extraction, including Relation Extraction (RE), Named Entity Recognition (NER), Entity Linking (EL), Event Extraction (EE), Named Entity Disambiguation (NED).
Stars: ✭ 14 (-51.72%)
Mutual labels:  named-entity-recognition, relation-extraction
lima
The Libre Multilingual Analyzer, a Natural Language Processing (NLP) C++ toolkit.
Stars: ✭ 75 (+158.62%)
Mutual labels:  named-entity-recognition, relation-extraction
Gigabert
Zero-shot Transfer Learning from English to Arabic
Stars: ✭ 23 (-20.69%)
Mutual labels:  named-entity-recognition, relation-extraction
Fox
Federated Knowledge Extraction Framework
Stars: ✭ 155 (+434.48%)
Mutual labels:  named-entity-recognition, relation-extraction
knowledge-graph-nlp-in-action
从模型训练到部署,实战知识图谱(Knowledge Graph)&自然语言处理(NLP)。涉及 Tensorflow, Bert+Bi-LSTM+CRF,Neo4j等 涵盖 Named Entity Recognition,Text Classify,Information Extraction,Relation Extraction 等任务。
Stars: ✭ 58 (+100%)
Mutual labels:  named-entity-recognition, relation-extraction
OpenUE
OpenUE是一个轻量级知识图谱抽取工具 (An Open Toolkit for Universal Extraction from Text published at EMNLP2020: https://aclanthology.org/2020.emnlp-demos.1.pdf)
Stars: ✭ 274 (+844.83%)
Mutual labels:  named-entity-recognition, relation-extraction
Agriculture knowledgegraph
农业知识图谱(AgriKG):农业领域的信息检索,命名实体识别,关系抽取,智能问答,辅助决策
Stars: ✭ 2,957 (+10096.55%)
Mutual labels:  named-entity-recognition, relation-extraction
Deeplearning nlp
基于深度学习的自然语言处理库
Stars: ✭ 154 (+431.03%)
Mutual labels:  named-entity-recognition, relation-extraction
CogIE
CogIE: An Information Extraction Toolkit for Bridging Text and CogNet. ACL 2021
Stars: ✭ 47 (+62.07%)
Mutual labels:  named-entity-recognition, relation-extraction
Pytorch graph Rel
A PyTorch implementation of GraphRel
Stars: ✭ 204 (+603.45%)
Mutual labels:  named-entity-recognition, relation-extraction
spert
PyTorch code for SpERT: Span-based Entity and Relation Transformer
Stars: ✭ 572 (+1872.41%)
Mutual labels:  named-entity-recognition, relation-extraction
Shukongdashi
使用知识图谱,自然语言处理,卷积神经网络等技术,基于python语言,设计了一个数控领域故障诊断专家系统
Stars: ✭ 109 (+275.86%)
Mutual labels:  named-entity-recognition, relation-extraction
metamaplite
A near real-time named-entity recognizer
Stars: ✭ 37 (+27.59%)
Mutual labels:  named-entity-recognition
pynsett
A programmable relation extraction tool
Stars: ✭ 25 (-13.79%)
Mutual labels:  relation-extraction
spacy-server
🦜 Containerized HTTP API for industrial-strength NLP via spaCy and sense2vec
Stars: ✭ 58 (+100%)
Mutual labels:  named-entity-recognition
farasapy
A Python implementation of Farasa toolkit
Stars: ✭ 69 (+137.93%)
Mutual labels:  named-entity-recognition
VERSE
Vancouver Event and Relation System for Extraction
Stars: ✭ 13 (-55.17%)
Mutual labels:  relation-extraction

整体处理流程

process 目前实现结构提取-文本预处理-实体识别阶段

*.html

dom 每一份html文件可以根据其dom树提取无结构化数据,dom层级限制说明如图

  • 不考虑任何image
  • 暂时未对table处理
  • 需要提取type标签的子属性title
  • hidden标签可以用来帮助判断table是否为跨页table 用于后期整合

*.train

对应每一份文件都由相应的字段提供,根据这些数据进行反标注为BIO格式 主键在同一句话中才标注、其他字段与主键之一在一句话中才进行标注

1153: 
    - 国家电网公司	
    - 青岛汉缆股份有限公司	
    - 国家电网公司输变电项目哈密南-郑州±800千伏特高压直流输电线路工程导线施工标段(二)导地线招标活动		
    - 169287975.4	
    - 169287975.4

NER log

BiLSTM+CRF

Network architecture

Network Hyperparameters

{
    "model_type": "bilstm", #model_type
    "num_chars": 3404, #"nums of chars" 
    "char_dim": 100, #"Embedding size for characters",
    "num_tags": 13, #"nums of entities",
    "seg_dim": 20, #"Embedding size for sentence",
    "lstm_dim": 100, #lstm length
    "batch_size": 5, #
    "emb_file": "data\\vec.txt", #pre-trained embedding
    "clip": 5.0, #clip for dimesion explore
    "dropout_keep": 0.5,
    "optimizer": "adam",
    "lr": 0.001,
    "tag_schema": "iobes",
    "pre_emb": true, #Wither use pre-trained embedding
    "zeros": false, #Wither replace digits with zero
    "lower": true #Wither lower case
}

train.log

  • 第一次尝试
    • 随意标注,由于机器配置问题,将每句话进行截断处理,尾部直接截断到最后一个字段为止
2018-09-10 10:22:48,610 - log\train.log - INFO - iteration:42 step:190/210, NER loss: 2.333041
2018-09-10 10:25:02,419 - log\train.log - INFO - evaluate:dev
2018-09-10 10:25:23,958 - log\train.log - INFO - processed 463933 tokens with 1493 phrases; found: 1195 phrases; correct: 893.

2018-09-10 10:25:23,959 - log\train.log - INFO - accuracy:  97.17%; precision:  74.73%; recall:  59.81%; FB1:  66.44

2018-09-10 10:25:23,960 - log\train.log - INFO -            hetong: precision:  76.03%; recall:  62.16%; FB1:  68.40  121

2018-09-10 10:25:23,961 - log\train.log - INFO -           jiafang: precision:  70.96%; recall:  67.63%; FB1:  69.26  427

2018-09-10 10:25:23,963 - log\train.log - INFO -           xiangmu: precision:  59.83%; recall:  48.11%; FB1:  53.33  234

2018-09-10 10:25:23,963 - log\train.log - INFO -            yifang: precision:  86.68%; recall:  59.08%; FB1:  70.26  413

2018-09-10 10:25:23,972 - log\train.log - INFO - evaluate:test
2018-09-10 10:25:51,435 - log\train.log - INFO - processed 695432 tokens with 1456 phrases; found: 1435 phrases; correct: 1014.

2018-09-10 10:25:51,436 - log\train.log - INFO - accuracy:  98.14%; precision:  70.66%; recall:  69.64%; FB1:  70.15

2018-09-10 10:25:51,438 - log\train.log - INFO -            hetong: precision:  61.82%; recall:  54.84%; FB1:  58.12  110

2018-09-10 10:25:51,438 - log\train.log - INFO -           jiafang: precision:  64.65%; recall:  69.57%; FB1:  67.02  495

2018-09-10 10:25:51,439 - log\train.log - INFO -           xiangmu: precision:  46.12%; recall:  41.61%; FB1:  43.75  258

2018-09-10 10:25:51,439 - log\train.log - INFO -            yifang: precision:  88.64%; recall:  86.52%; FB1:  87.56  572

{
  'string': '美尚生态系观股份有限公司(以下简称“公司")于近日收到招标人江苏省无锡惠山经济开发区委员会发来的《中标通知书,通知书确认
公司为“无锡古庄生态农业科技园PPPI项目"(以下器称“本项目”)的中标人', 
  'entities': [
      {'word': '江苏省无锡惠山经济开发区委员会', 'start': 29, 'end': 47, 'type': 'jiafang'}, 
      {'word': '无锡古庄生态农业科技园PPPI项目', 'start': 66, 'end': 83, 'type': 'xiangmu'}
      ]
}
epoch 42
loss 2.33
test f1 70.15
dev f1 66.44

对该句话的测试,从结果可以看到对于这样的截断方式发现了甲方以及项目,但没有发现乙方

  • 第二次尝试
    • 随意标注,这次通过前后都留十个O标注的字符
2018-09-11 03:50:25,074 - log\train.log - INFO - iteration:101 step:0/419, NER loss: 0.351078
processed 84351 tokens with 721 phrases; found: 731 phrases; correct: 602.

2018-09-11 03:50:28,917 - log\train.log - INFO - accuracy:  96.61%; precision:  82.35%; recall:  83.50%; FB1:  82.92

2018-09-11 03:50:28,917 - log\train.log - INFO -            hetong: precision:  94.55%; recall:  91.23%; FB1:  92.86  55

2018-09-11 03:50:28,917 - log\train.log - INFO -           jiafang: precision:  80.87%; recall:  77.18%; FB1:  78.98  230

2018-09-11 03:50:28,917 - log\train.log - INFO -           xiangmu: precision:  65.56%; recall:  75.00%; FB1:  69.96  151

2018-09-11 03:50:28,917 - log\train.log - INFO -            yifang: precision:  89.83%; recall:  91.07%; FB1:  90.44  295

2018-09-11 03:50:28,922 - log\train.log - INFO - evaluate:test
2018-09-11 03:50:36,830 - log\train.log - INFO - processed 178802 tokens with 1449 phrases; found: 1500 phrases; correct: 1208.

2018-09-11 03:50:36,835 - log\train.log - INFO - accuracy:  96.58%; precision:  80.53%; recall:  83.37%; FB1:  81.93

2018-09-11 03:50:36,835 - log\train.log - INFO -            hetong: precision:  88.00%; recall:  88.71%; FB1:  88.35  125

2018-09-11 03:50:36,835 - log\train.log - INFO -           jiafang: precision:  84.57%; recall:  83.66%; FB1:  84.11  460

2018-09-11 03:50:36,835 - log\train.log - INFO -           xiangmu: precision:  61.95%; recall:  71.64%; FB1:  66.44  318

2018-09-11 03:50:36,835 - log\train.log - INFO -            yifang: precision:  85.76%; recall:  87.52%; FB1:  86.63  597
{
    'string': '美尚生态系观股份有限公司(以下简称“公司")于近日收到招标人江苏省无锡惠山经济开发区委员会发来的《中标通知书,通知书确认 公
司为“无锡古庄生态农业科技园PPPI项目"(以下器称“本项目”)的中标人', 
    'entities': [
        {'word': '江苏省无锡惠山经济开发区委员会', 'start':30, 'end': 45, 'type': 'jiafang'}
        ]
}
epoch 101
loss 0.35
test f1 81.93
dev f1 82.92
  • 与上次相比,dev与test的f1值更加接近,相对而言更加鲁棒
  • 由于训练次数较多,所以loss相对更小
  • 但是对于这样的截断方式只能识别出甲方
  • 第三次尝试
    • 主键出现在同一句话中才进行标注,batch_size减为2,相对耗时
2018-09-13 23:58:51,355 - log\train.log - INFO - iteration:44 step:830/840, NER loss: 1.698312
2018-09-13 23:59:37,254 - log\train.log - INFO - evaluate:dev
2018-09-14 00:00:10,623 - log\train.log - INFO - processed 819272 tokens with 2354 phrases; found: 2151 phrases; correct: 1770.

2018-09-14 00:00:10,624 - log\train.log - INFO - accuracy:  98.30%; precision:  82.29%; recall:  75.19%; FB1:  78.58

2018-09-14 00:00:10,625 - log\train.log - INFO -            hetong: precision:  85.09%; recall:  63.13%; FB1:  72.49  161

2018-09-14 00:00:10,626 - log\train.log - INFO -           jiafang: precision:  81.05%; recall:  80.70%; FB1:  80.87  686

2018-09-14 00:00:10,626 - log\train.log - INFO -           xiangmu: precision:  73.39%; recall:  68.23%; FB1:  70.72  436

2018-09-14 00:00:10,626 - log\train.log - INFO -            yifang: precision:  87.21%; recall:  77.32%; FB1:  81.97  868

2018-09-14 00:00:10,641 - log\train.log - INFO - evaluate:test
2018-09-14 00:01:02,448 - log\train.log - INFO - processed 1416367 tokens with 3188 phrases; found: 2953 phrases; correct: 2306.

2018-09-14 00:01:02,449 - log\train.log - INFO - accuracy:  98.42%; precision:  78.09%; recall:  72.33%; FB1:  75.10

2018-09-14 00:01:02,450 - log\train.log - INFO -            hetong: precision:  82.85%; recall:  62.86%; FB1:  71.48  239

2018-09-14 00:01:02,451 - log\train.log - INFO -           jiafang: precision:  78.76%; recall:  78.00%; FB1:  78.38  923

2018-09-14 00:01:02,451 - log\train.log - INFO -           xiangmu: precision:  64.81%; recall:  58.77%; FB1:  61.64  574

2018-09-14 00:01:02,452 - log\train.log - INFO -            yifang: precision:  82.91%; recall:  77.14%; FB1:  79.92  1217

2018-09-14 00:01:02,552 - log\train.log - INFO - new best test f1 score:75.100
2018-09-14 00:04:52,635 - log\train.log - INFO - iteration:45 step:40/840, NER loss: 1.249973

{
  'string': '美尚生态系观股份有限公司(以下简称“公司")于近日收到招标人江苏省无锡惠山经济开发区委员会发来的《中标通知书,通知书确认 公 司为“
无锡古庄生态农业科技园PPPI项目"(以下器称“本项目”)的中标人', 
  'entities': [
    {'word': '美尚生态系观股份有限公司', 'start': 0, 'end': 12, 'type': 'yifang'},
    {'word': '江苏省无锡惠山经济开发区委员会', 'start': 30, 'end': 45, 'type': 'jiafang'}, 
    {'word': '无锡古庄生态农业科技园PPPI项目', 'start': 66, 'end': 83, 'type': 'xiangmu'}
    ]
}
  • 该方式较为耗时但是保留了较多的上下文信息,也识别出了所有的实体
  • 与第一次尝试相比训练44轮,loss 减小到1.6 test与dev的f1也相对提高

idcnn 目前陷入局部最优解的问题,loss居高不下,正在解决

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].