All Projects → freefuiiismyname → G Reader

freefuiiismyname / G Reader

2018年机器阅读理解技术竞赛模型,国内外1000多支队伍中BLEU-4评分排名第6, ROUGE-L评分排名第14。(未ensemble,未嵌入训练好的词向量,无dropout)

Projects that are alternatives of or similar to G Reader

Cracking The Da Vinci Code With Google Interview Problems And Nlp In Python
A guide on how to crack combinatorics puzzles shown in The Da Vinci Code movie using CS fundamentals and NLP
Stars: ✭ 75 (-35.9%)
Mutual labels:  nlp-machine-learning
Wiki Split
One million English sentences, each split into two sentences that together preserve the original meaning, extracted from Wikipedia edits.
Stars: ✭ 95 (-18.8%)
Mutual labels:  nlp-machine-learning
Textaugmentation Gpt2
Fine-tuned pre-trained GPT2 for custom topic specific text generation. Such system can be used for Text Augmentation.
Stars: ✭ 104 (-11.11%)
Mutual labels:  nlp-machine-learning
Summarus
Models for automatic abstractive summarization
Stars: ✭ 83 (-29.06%)
Mutual labels:  nlp-machine-learning
Datascience
It consists of examples, assignments discussed in data science course taken at algorithmica.
Stars: ✭ 92 (-21.37%)
Mutual labels:  nlp-machine-learning
Question Generation
Given a sentence automatically generate reading comprehension style factual questions from that sentence, such that the sentence contains answers to those questions.
Stars: ✭ 100 (-14.53%)
Mutual labels:  nlp-machine-learning
Intent classifier
Stars: ✭ 67 (-42.74%)
Mutual labels:  nlp-machine-learning
Bertqa Attention On Steroids
BertQA - Attention on Steroids
Stars: ✭ 112 (-4.27%)
Mutual labels:  nlp-machine-learning
Writeup Frontend
Beat Writer's Block with AI
Stars: ✭ 94 (-19.66%)
Mutual labels:  nlp-machine-learning
Repo 2016
R, Python and Mathematica Codes in Machine Learning, Deep Learning, Artificial Intelligence, NLP and Geolocation
Stars: ✭ 103 (-11.97%)
Mutual labels:  nlp-machine-learning
Text classification
Text Classification Algorithms: A Survey
Stars: ✭ 1,276 (+990.6%)
Mutual labels:  nlp-machine-learning
Doc2vec
📓 Long(er) text representation and classification using Doc2Vec embeddings
Stars: ✭ 92 (-21.37%)
Mutual labels:  nlp-machine-learning
Codesearchnet
Datasets, tools, and benchmarks for representation learning of code.
Stars: ✭ 1,378 (+1077.78%)
Mutual labels:  nlp-machine-learning
Russian news corpus
Russian mass media stemmed texts corpus / Корпус лемматизированных (морфологически нормализованных) текстов российских СМИ
Stars: ✭ 76 (-35.04%)
Mutual labels:  nlp-machine-learning
Lemminflect
A python module for English lemmatization and inflection.
Stars: ✭ 105 (-10.26%)
Mutual labels:  nlp-machine-learning
Nlp Paper
自然语言处理领域下的对话语音领域,整理相关论文(附阅读笔记),复现模型以及数据处理等(代码含TensorFlow和PyTorch两版本)
Stars: ✭ 67 (-42.74%)
Mutual labels:  nlp-machine-learning
Monkeylearn
⛔️ ARCHIVED ⛔️ 🐒 R package for text analysis with Monkeylearn 🐒
Stars: ✭ 95 (-18.8%)
Mutual labels:  nlp-machine-learning
Lingo
package lingo provides the data structures and algorithms required for natural language processing
Stars: ✭ 113 (-3.42%)
Mutual labels:  nlp-machine-learning
Atnre
Adversarial Training for Neural Relation Extraction
Stars: ✭ 108 (-7.69%)
Mutual labels:  nlp-machine-learning
Mrc book
《机器阅读理解:算法与实践》代码
Stars: ✭ 102 (-12.82%)
Mutual labels:  nlp-machine-learning

G-Reader

机器阅读理解(Machine Reading Comprehension)是指让机器阅读文本,然后回答和阅读内容相关的问题。“2018机器阅读理解技术竞赛”由中国中文信息学会、中国计算机学会和百度公司联手举办,使用了百度提供的面向真实应用场景的大规模中文阅读理解数据集。

国内外1000多支队伍中BLEU-4评分排名第6, ROUGE-L评分排名第14。(未ensemble,未嵌入训练好的词向量,无dropout)

模型架构

针对一个问题,文档集里有多答案的情况非常普遍,我们认为‘一边提高某个答案作为答案的概率,另一边又降低其它答案作为答案的概率’是不合理的。

因此我们的模型采用先从每篇文章中独立抽取候选答案,再从候选答案集中抽取最佳答案的结构,以解决多答案致使神经网络难以学习的问题。架构的具体实现中,我们通过BiDAF+ Passage Self-Matching从单篇文章中抽取答案,构成候选答案集,再使用em和xgboost决策树从候选答案集中抽取最佳答案。

image1

即模型分为以下两部分:

1、候选答案抽取层——BiDAF+Passage Self-Matching

2、答案选择层——em算法、xgboost

关于数据

移步比赛官网的数据下载页面,来自百度知道和搜索的真实场景数据集共包含30万问题,其中包括27万的训练集,1万开发集和2万测试集,分为4个部分供参赛用户下载。

Em算法部分包含了百度知道集的tfidf模型文件,只需下载百度知道的数据文件便可用java运行,暂未做python实现。它在整个模型(Bidaf抽取答案、xgboost决策答案)作为特征扩充,交互答案之间的信息。

算法效果

最终效果:

image3

无监督EM算法效果: image2

致谢

本模型由华南理工大学的G-scuter团队完成。 致谢广州极天信息技术股份有限公司、华南理工大学软件学院。

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].