All Projects → SunnyMarkLiu → les-military-mrc-rank7

SunnyMarkLiu / les-military-mrc-rank7

Licence: other
莱斯杯:全国第二届“军事智能机器阅读”挑战赛 - Rank7 解决方案

Programming Languages

Jupyter Notebook
11667 projects
python
139335 projects - #7 most used programming language
shell
77523 projects

Projects that are alternatives of or similar to les-military-mrc-rank7

COVID-19-Tweet-Classification-using-Roberta-and-Bert-Simple-Transformers
Rank 1 / 216
Stars: ✭ 24 (-35.14%)
Mutual labels:  transformer, bert, roberta
Bertviz
Tool for visualizing attention in the Transformer model (BERT, GPT-2, Albert, XLNet, RoBERTa, CTRL, etc.)
Stars: ✭ 3,443 (+9205.41%)
Mutual labels:  transformer, bert, roberta
vietnamese-roberta
A Robustly Optimized BERT Pretraining Approach for Vietnamese
Stars: ✭ 22 (-40.54%)
Mutual labels:  transformer, bert, roberta
Filipino-Text-Benchmarks
Open-source benchmark datasets and pretrained transformer models in the Filipino language.
Stars: ✭ 22 (-40.54%)
Mutual labels:  transformer, bert
golgotha
Contextualised Embeddings and Language Modelling using BERT and Friends using R
Stars: ✭ 39 (+5.41%)
Mutual labels:  transformer, bert
are-16-heads-really-better-than-1
Code for the paper "Are Sixteen Heads Really Better than One?"
Stars: ✭ 128 (+245.95%)
Mutual labels:  transformer, bert
SIGIR2021 Conure
One Person, One Model, One World: Learning Continual User Representation without Forgetting
Stars: ✭ 23 (-37.84%)
Mutual labels:  transformer, bert
text-generation-transformer
text generation based on transformer
Stars: ✭ 36 (-2.7%)
Mutual labels:  transformer, bert
bert in a flask
A dockerized flask API, serving ALBERT and BERT predictions using TensorFlow 2.0.
Stars: ✭ 32 (-13.51%)
Mutual labels:  transformer, bert
Nlp Tutorial
Natural Language Processing Tutorial for Deep Learning Researchers
Stars: ✭ 9,895 (+26643.24%)
Mutual labels:  transformer, bert
Transformers
🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
Stars: ✭ 55,742 (+150554.05%)
Mutual labels:  transformer, bert
transformer-models
Deep Learning Transformer models in MATLAB
Stars: ✭ 90 (+143.24%)
Mutual labels:  transformer, bert
Xpersona
XPersona: Evaluating Multilingual Personalized Chatbot
Stars: ✭ 54 (+45.95%)
Mutual labels:  transformer, bert
semantic-document-relations
Implementation, trained models and result data for the paper "Pairwise Multi-Class Document Classification for Semantic Relations between Wikipedia Articles"
Stars: ✭ 21 (-43.24%)
Mutual labels:  transformer, bert
PDN
The official PyTorch implementation of "Pathfinder Discovery Networks for Neural Message Passing" (WebConf '21)
Stars: ✭ 44 (+18.92%)
Mutual labels:  transformer, bert
bert-as-a-service TFX
End-to-end pipeline with TFX to train and deploy a BERT model for sentiment analysis.
Stars: ✭ 32 (-13.51%)
Mutual labels:  transformer, bert
NLP-paper
🎨 🎨NLP 自然语言处理教程 🎨🎨 https://dataxujing.github.io/NLP-paper/
Stars: ✭ 23 (-37.84%)
Mutual labels:  transformer, bert
Kevinpro-NLP-demo
All NLP you Need Here. 个人实现了一些好玩的NLP demo,目前包含13个NLP应用的pytorch实现
Stars: ✭ 117 (+216.22%)
Mutual labels:  transformer, bert
Bert Pytorch
Google AI 2018 BERT pytorch implementation
Stars: ✭ 4,642 (+12445.95%)
Mutual labels:  transformer, bert
cmrc2019
A Sentence Cloze Dataset for Chinese Machine Reading Comprehension (CMRC 2019)
Stars: ✭ 118 (+218.92%)
Mutual labels:  reading-comprehension, bert

les-military-mrc

莱斯杯:全国第二届“军事智能机器阅读”挑战赛 Rank7 解决方案(baseline)。

Architecture

本次竞赛数据呈现如下特点:

  • 每个问题包含五篇长度较长且存在一定噪声的文档;
  • 部分问题需要基于桥接实体的深层次的推理;
  • 部分问题可能包含多答案,多答案可能来自一个文档或多个文档。

为解决上述问题,本团队采用如下图所示的整体技术架构:

Text Preprocess

为方便后续模型训练处理,将数据集转化成 dureader 格式。由于原始文本 中包含大量噪声文本,采用的数据清洗包括:

  • \u200b、\x10、\f、\r 等(unicode)空字符的去除; l 相关 url 链接、html 标签的去除
  • 处理------,.....,等类型的重复字符
  • 广告文本的去除
  • 去除空段落和重复段落

Paragraph Selection

由于文档长度较长,为保证筛选的上下文长度尽量短以及答案覆盖率,我 们采用以答案为基本中心,截取的最大长度 max_doc_len 为 1024,具体做法(此方法未进行复杂的段落筛选,简化成以答案为基本中心的裁剪):

  • 对于长度小于 1024 的文档,全部保留;
  • 长度大于 1024 且答案位置在偏左侧上下文中,截取前 1024 长度;
  • 长度大于 1024 且答案位置在偏右侧上下文中,截取前 1024 长度;
  • 以上均不满足,则以答案为基本中心(中心点存在随机性),截取 1024长度

注意,在文档长度较长且答案基本处于中间位置的情况,为避免截断过程中存在的答案位置的偏置,本方案设置了答案开始下标距离文档左边界的随机性,截断方法如下图所示:

Features

  • 利用 jieba 分词工具提取问题和文档的 POS、 Keyword 特征,同时针对文档的每个字符提取是否在问题中出现的 doc_char_in_question 特征;
  • 利用 foolnltk 工具提取 问题和文档的命名实体,一共包含 7 类实体,并进行 one-hot 处理

Experiment

Teammates

Lucky Boys

License

This project is licensed under the terms of the MIT license.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].