houking-can / Ccks2019 Task5
CCKS2019评测任务五-公众公司公告信息抽取,第3名
Stars: ✭ 87
Programming Languages
python
139335 projects - #7 most used programming language
Projects that are alternatives of or similar to Ccks2019 Task5
Bolt Python
A framework to build Slack apps using Python
Stars: ✭ 190 (+118.39%)
Mutual labels: web-api, flask
Spacy Graphql
🤹♀️ Query spaCy's linguistic annotations using GraphQL
Stars: ✭ 81 (-6.9%)
Mutual labels: flask
Turkish Bert Nlp Pipeline
Bert-base NLP pipeline for Turkish, Ner, Sentiment Analysis, Question Answering etc.
Stars: ✭ 85 (-2.3%)
Mutual labels: ner
Emnist
A project designed to explore CNN and the effectiveness of RCNN on classifying the EMNIST dataset.
Stars: ✭ 81 (-6.9%)
Mutual labels: flask
Flask Wtform Tutorial
📝😎Tutorial to implement forms in your Flask app.
Stars: ✭ 84 (-3.45%)
Mutual labels: flask
Phormatics
Using A.I. and computer vision to build a virtual personal fitness trainer. (Most Startup-Viable Hack - HackNYU2018)
Stars: ✭ 79 (-9.2%)
Mutual labels: flask
Plantuml Service
High-performance HTTP service for PlantUML, used in Kibela
Stars: ✭ 86 (-1.15%)
Mutual labels: web-api
Steam Condenser Java
A library for querying the Steam Community, Source, GoldSrc servers and Steam master servers
Stars: ✭ 82 (-5.75%)
Mutual labels: web-api
Deploying Flask To Heroku
Deploying a Flask App To Heroku Tutorial
Stars: ✭ 81 (-6.9%)
Mutual labels: flask
Terraformize
Apply\Destory Terraform modules via a simple REST API endpoint.
Stars: ✭ 84 (-3.45%)
Mutual labels: flask
Flask Log Request Id
Flask extension to track and log Request-ID headers produced by PaaS like Heroku and load balancers like Amazon ELB
Stars: ✭ 81 (-6.9%)
Mutual labels: flask
Deep Learning Training Gui
Train and predict your model on pre-trained deep learning models through the GUI (web app). No more many parameters, no more data preprocessing.
Stars: ✭ 85 (-2.3%)
Mutual labels: flask
Deerlet
[Deprecated] A markdown online-editable-resume with pdf generator
Stars: ✭ 79 (-9.2%)
Mutual labels: flask
Flask Restplus Server Example
Real-life RESTful server example on Flask-RESTplus
Stars: ✭ 1,240 (+1325.29%)
Mutual labels: flask
Flask Todolist
exemplary flask application - small to-do list WebApp example
Stars: ✭ 85 (-2.3%)
Mutual labels: flask
CCKS2019-Task5
引言
目前,PDF已成为电子文档发行和数字化信息传播的一个标准,其广泛应用于学术界的交流以及各类公告的发行。如何从非结构化的PDF文档中抽取结构化数据是知识图谱领域所面临的一大挑战。本文利用Adobe公司开发的Acrobat DC SDK对PDF进行格式转换,从半结构化的中间文件进行信息抽取。相比已有的开源PDF解析方法,Acrobat导出的中间文件保存了更完整更准确的表格和文本段落信息,能应用于不同需求的信息抽取任务。 在CCKS 2019公众公司公告评测中,我们的方法获得总成绩第三名。在本次评测中,我们将公告文件(PDF格式)转换成XML。对于任务一,我们通过查找Table标签,获取PDF中所有的表格;然后根据表格的上下文,确定其名称,抽出符合条件的表格。对于任务二,我们首先抽出所有文本段落,使用Bi-LSTM-CRF进行命名实体识别,最后结合规则抽取信息点。
任务相关
-
训练数据
网盘链接:https://pan.baidu.com/s/1ali_-IHCCrxlLBkMm0gmGA
提取码:y4t5
基于Acrobat DC SDK的PDF内容抽取系统
该部分为独立组件,项目地址为:https://github.com/houking-can/PDFConverter
解决方案
子任务1:表格抽取
将PDF转为XML后,直接解析XML抽取表格
子任务2:高管离职信息点抽取
- 将PDF转为XML,抽取出文本段落,分句,使用人工标注的JSON反标数据,得到BIO训练数据。
- 词向量: 本次比赛使用金融领域预训练好的词向量
- 训练Bi-LSTM-CRF直至收敛
- 根据触发词,如离职,辞职,因...原因等,设计模板,先识别具有信息点的句子,然后对该句进行实体识别,抽出信息点
分数与排名
两项任务总排名
子任务1
子任务2
说明
- 在子任务1——表格抽取任务中,由于PDF内容抽取参数设置不当,有1个测试用例输出为空(共10个用例),这严重影响了我们在这个任务中的表现,本来F1值可以达到0.96左右(理论值可达到0.99,忽略空格以及不区分0和0.00)。本次评测是使用web api方式进行评测,最终结果只测试一次,因此没有再次修改的机会。
- 在子任务2——高管信息抽取中,前两名公司都采用了人工标注得到BIO训练数据,深圳证券还使用额外数据。我们只使用启发式规则标注数据,存在大量噪声。前几名基本都使用BERT,效果提升很大,我们只单纯使用预训练好的词向量。
Note that the project description data, including the texts, logos, images, and/or trademarks,
for each open source project belongs to its rightful owner.
If you wish to add or remove any projects, please contact us at [email protected].