All Projects → awslabs → unsupervised-qa

awslabs / unsupervised-qa

Licence: Apache-2.0 license
Template-Based Question Generation from Retrieved Sentences for Improved Unsupervised Question Answering

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to unsupervised-qa

explicit memory tracker
[ACL 2020] Explicit Memory Tracker with Coarse-to-Fine Reasoning for Conversational Machine Reading
Stars: ✭ 35 (-25.53%)
Mutual labels:  question-answering, question-generation
MLH-Quizzet
This is a smart Quiz Generator that generates a dynamic quiz from any uploaded text/PDF document using NLP. This can be used for self-analysis, question paper generation, and evaluation, thus reducing human effort.
Stars: ✭ 23 (-51.06%)
Mutual labels:  question-answering, question-generation
text2text
Text2Text: Cross-lingual natural language processing and generation toolkit
Stars: ✭ 188 (+300%)
Mutual labels:  question-answering, question-generation
Awesome Deep Learning And Machine Learning Questions
【不定期更新】收集整理的一些网站中(如知乎、Quora、Reddit、Stack Exchange等)与深度学习、机器学习、强化学习、数据科学相关的有价值的问题
Stars: ✭ 203 (+331.91%)
Mutual labels:  question-answering
Forum
Ama Laravel? Torne se um Jedi e Ajude outros Padawans
Stars: ✭ 233 (+395.74%)
Mutual labels:  question-answering
CS-DisMo
[ICCVW 2021] Rethinking Content and Style: Exploring Bias for Unsupervised Disentanglement
Stars: ✭ 20 (-57.45%)
Mutual labels:  unsupervised-learning
question-generation
Neural Models for Key Phrase Detection and Question Generation
Stars: ✭ 29 (-38.3%)
Mutual labels:  question-generation
Flowqa
Implementation of conversational QA model: FlowQA (with slight improvement)
Stars: ✭ 194 (+312.77%)
Mutual labels:  question-answering
FinBERT-QA
Financial Domain Question Answering with pre-trained BERT Language Model
Stars: ✭ 70 (+48.94%)
Mutual labels:  question-answering
esapp
An unsupervised Chinese word segmentation tool.
Stars: ✭ 13 (-72.34%)
Mutual labels:  unsupervised-learning
TA3N
[ICCV 2019 Oral] TA3N: https://github.com/cmhungsteve/TA3N (Most updated repo)
Stars: ✭ 45 (-4.26%)
Mutual labels:  unsupervised-learning
Dmn Tensorflow
Dynamic Memory Networks (https://arxiv.org/abs/1603.01417) in Tensorflow
Stars: ✭ 236 (+402.13%)
Mutual labels:  question-answering
DrFAQ
DrFAQ is a plug-and-play question answering NLP chatbot that can be generally applied to any organisation's text corpora.
Stars: ✭ 29 (-38.3%)
Mutual labels:  question-answering
Tensorflow Dsmm
Tensorflow implementations of various Deep Semantic Matching Models (DSMM).
Stars: ✭ 217 (+361.7%)
Mutual labels:  question-answering
Joint-Motion-Estimation-and-Segmentation
[MICCAI'18] Joint Learning of Motion Estimation and Segmentation for Cardiac MR Image Sequences
Stars: ✭ 45 (-4.26%)
Mutual labels:  unsupervised-learning
Kb Qa
基于知识库的中文问答系统(biLSTM)
Stars: ✭ 195 (+314.89%)
Mutual labels:  question-answering
VideoNavQA
An alternative EQA paradigm and informative benchmark + models (BMVC 2019, ViGIL 2019 spotlight)
Stars: ✭ 22 (-53.19%)
Mutual labels:  question-answering
Agriculture knowledgegraph
农业知识图谱(AgriKG):农业领域的信息检索,命名实体识别,关系抽取,智能问答,辅助决策
Stars: ✭ 2,957 (+6191.49%)
Mutual labels:  question-answering
Jack
Jack the Reader
Stars: ✭ 242 (+414.89%)
Mutual labels:  question-answering
cmrc2017
The First Evaluation Workshop on Chinese Machine Reading Comprehension (CMRC 2017)
Stars: ✭ 90 (+91.49%)
Mutual labels:  question-answering

Template-Based Question Generation from Retrieved Sentences for Improved Unsupervised Question Answering

Code and synthetic data from our ACL 2020 paper

Abstract

Question Answering (QA) is in increasing demand as the amount of information available online and the desire for quick access to this content grows. A common approach to QA has been to fine-tune a pretrained language model on a task-specific labeled dataset. This paradigm, however, relies on scarce, and costly to obtain, large-scale human-labeled data. We propose an unsupervised approach to training QA models with generated pseudo-training data. We show that generating questions for QA training by applying a simple template on a related, retrieved sentence rather than the original context sentence improves downstream QA performance by allowing the model to learn more complex context-question relationships. Training a QA model on this data gives a relative improvement over a previous unsupervised model in F1 score on the SQuAD dataset by about 14%, and 20% when the answer is a named entity, achieving state-of-the-art performance on SQuAD for unsupervised QA.

Synthetic data

Generated synthetic data for the publication is located under enwiki_synthetic/

Requirements

  1. PySpark
  2. ElasticSearch 6

Instruction to generate retrieval-based synthetic data

Tokenize and perform NER:

spark-submit --master local[90] --driver-memory 200G spark_scripts/tokenize_and_ner_inputs.py --corpus=enwiki/clean/*/*.raw  --output outputs/sent-tok-rollup

Then we write the tokenized sentences to ElasticSearch index. This uses AES_HOSTS environment variable.

spark-submit --master local[90] --driver-memory 4G spark_scripts/write_sentence_level_es_index.py --corpus=outputs/sent-tok-rollup/rollup/ --es-index uqa-es-index --output outputs/write-es

Create QA synthetic dataset

spark-submit --master local[90] --driver-memory 300G spark_scripts/create_ds_synthetic_dataset.py --corpus=outputs/sent-tok-rollup/rollup/ --output outputs/synthetic-uqa-auxqs1awc1 --aux-qs=1 --aux-awc=1 --ulim-count=500000

Citation

You can cite our paper:

@inproceedings{fabbri-etal-2020-template,
    title = "Template-Based Question Generation from Retrieved Sentences for Improved Unsupervised Question Answering",
    author = "Fabbri, Alexander  and
      Ng, Patrick  and
      Wang, Zhiguo  and
      Nallapati, Ramesh  and
      Xiang, Bing",
    booktitle = "Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics",
    month = jul,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2020.acl-main.413",
    doi = "10.18653/v1/2020.acl-main.413",
    pages = "4508--4513",
    abstract = "Question Answering (QA) is in increasing demand as the amount of information available online and the desire for quick access to this content grows. A common approach to QA has been to fine-tune a pretrained language model on a task-specific labeled dataset. This paradigm, however, relies on scarce, and costly to obtain, large-scale human-labeled data. We propose an unsupervised approach to training QA models with generated pseudo-training data. We show that generating questions for QA training by applying a simple template on a related, retrieved sentence rather than the original context sentence improves downstream QA performance by allowing the model to learn more complex context-question relationships. Training a QA model on this data gives a relative improvement over a previous unsupervised model in F1 score on the SQuAD dataset by about 14{\%}, and 20{\%} when the answer is a named entity, achieving state-of-the-art performance on SQuAD for unsupervised QA.",
}

License

This project is licensed under the Apache-2.0 License.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].