All Projects → AMontgomerie → question_generator

AMontgomerie / question_generator

Licence: MIT license
An NLP system for generating reading comprehension questions

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to question generator

text2text
Text2Text: Cross-lingual natural language processing and generation toolkit
Stars: ✭ 188 (+0%)
Mutual labels:  transformers, natural-language-generation, bert, question-generation
classy
classy is a simple-to-use library for building high-performance Machine Learning models in NLP.
Stars: ✭ 61 (-67.55%)
Mutual labels:  transformers, natural-language-generation, bert
Text-Summarization
Abstractive and Extractive Text summarization using Transformers.
Stars: ✭ 38 (-79.79%)
Mutual labels:  transformers, bert, t5
HugsVision
HugsVision is a easy to use huggingface wrapper for state-of-the-art computer vision
Stars: ✭ 154 (-18.09%)
Mutual labels:  transformers, bert
OpenDialog
An Open-Source Package for Chinese Open-domain Conversational Chatbot (中文闲聊对话系统,一键部署微信闲聊机器人)
Stars: ✭ 94 (-50%)
Mutual labels:  transformers, bert
bangla-bert
Bangla-Bert is a pretrained bert model for Bengali language
Stars: ✭ 41 (-78.19%)
Mutual labels:  transformers, bert
Text and Audio classification with Bert
Text Classification in Turkish Texts with Bert
Stars: ✭ 34 (-81.91%)
Mutual labels:  transformers, bert
Haystack
🔍 Haystack is an open source NLP framework that leverages Transformer models. It enables developers to implement production-ready neural search, question answering, semantic document search and summarization for a wide range of applications.
Stars: ✭ 3,409 (+1713.3%)
Mutual labels:  transformers, bert
policy-data-analyzer
Building a model to recognize incentives for landscape restoration in environmental policies from Latin America, the US and India. Bringing NLP to the world of policy analysis through an extensible framework that includes scraping, preprocessing, active learning and text analysis pipelines.
Stars: ✭ 22 (-88.3%)
Mutual labels:  transformers, bert
Fast Bert
Super easy library for BERT based NLP models
Stars: ✭ 1,678 (+792.55%)
Mutual labels:  transformers, bert
Pytorch Sentiment Analysis
Tutorials on getting started with PyTorch and TorchText for sentiment analysis.
Stars: ✭ 3,209 (+1606.91%)
Mutual labels:  transformers, bert
Spark Nlp
State of the Art Natural Language Processing
Stars: ✭ 2,518 (+1239.36%)
Mutual labels:  transformers, bert
Nlp Architect
A model library for exploring state-of-the-art deep learning topologies and techniques for optimizing Natural Language Processing neural networks
Stars: ✭ 2,768 (+1372.34%)
Mutual labels:  transformers, bert
erc
Emotion recognition in conversation
Stars: ✭ 34 (-81.91%)
Mutual labels:  transformers, bert
text2class
Multi-class text categorization using state-of-the-art pre-trained contextualized language models, e.g. BERT
Stars: ✭ 15 (-92.02%)
Mutual labels:  transformers, bert
Tokenizers
💥 Fast State-of-the-Art Tokenizers optimized for Research and Production
Stars: ✭ 5,077 (+2600.53%)
Mutual labels:  transformers, bert
beir
A Heterogeneous Benchmark for Information Retrieval. Easy to use, evaluate your models across 15+ diverse IR datasets.
Stars: ✭ 738 (+292.55%)
Mutual labels:  bert, question-generation
robo-vln
Pytorch code for ICRA'21 paper: "Hierarchical Cross-Modal Agent for Robotics Vision-and-Language Navigation"
Stars: ✭ 34 (-81.91%)
Mutual labels:  transformers, bert
TorchBlocks
A PyTorch-based toolkit for natural language processing
Stars: ✭ 85 (-54.79%)
Mutual labels:  transformers, bert
Clue
中文语言理解测评基准 Chinese Language Understanding Evaluation Benchmark: datasets, baselines, pre-trained models, corpus and leaderboard
Stars: ✭ 2,425 (+1189.89%)
Mutual labels:  transformers, bert

question_generator

Question Generator is an NLP system for generating reading comprehension-style questions from texts such as news articles or pages excerpts from books. The system is built using pretrained models from HuggingFace Transformers. There are two models: the question generator itself, and the QA evaluator which ranks and filters the question-answer pairs based on their acceptability.

Update 2021/11/29

Updated training scripts

The training notebooks have been updated with training scripts. To run:

python question_generator/training/qg_train.py
python question_generator/training/qa_eval_train.py

Hyperparameters can be changed using commandline arguments. See the scripts for the list of available arguments.

Datasets uploaded to Huggingface Hub

The datasets have been uploaded to the Huggingface Hub:

Usage

The easiest way to generate some questions is to clone the github repo and then run qg_run.py like this:

git clone https://github.com/amontgomerie/question_generator
cd question_generator
pip install -r requirements.txt -qq
python run_qg.py --text_file articles/twitter_hack.txt

This will generate 10 question-answer pairs of mixed style (full-sentence and multiple choice) based on the article specified in --text_file and print them to the console. For more information see the qg_commandline_example notebook.

The QuestionGenerator class can also be instantiated and used like this:

from questiongenerator import QuestionGenerator
qg = QuestionGenerator()
qg.generate(text, num_questions=10)

This will generate 10 questions of mixed style and return a list of dictionaries containing question-answer pairs. In the case of multiple choice questions, the answer will contain a list of dictionaries containing the answers and a boolean value stating if the answer is correct or not. The output can be easily printed using the print_qa() function. For more information see the question_generation_example notebook.

Choosing the number of questions

The desired number of questions can be passed as a command line argument using --num_questions or as an argument when calling qg.generate(text, num_questions=20. If the chosen number of questions is too large, then the model may not be able to generate enough. The maximum number of questions will depend on the length of the input text, or more specifically the number of sentences and named entities containined within text. Note that the quality of some of the outputs will decrease for larger numbers of questions, as the QA Evaluator ranks generated questions and returns the best ones.

Answer styles

The system can generate questions with full-sentence answers ('sentences'), questions with multiple-choice answers ('multiple_choice'), or a mix of both ('all'). This can be selected using the --answer_style or qg.generate(answer_style=<style>) arguments.

Models

Question Generator

The question generator model takes a text as input and outputs a series of question and answer pairs. The answers are sentences and phrases extracted from the input text. The extracted phrases can be either full sentences or named entities extracted using spaCy. Named entities are used for multiple-choice answers. The wrong answers will be other entities of the same type found in the text. The questions are generated by concatenating the extracted answer with the full text (up to a maximum of 512 tokens) as context in the following format:

answer_token <extracted answer> context_token <context>

The concatenated string is then encoded and fed into the question generator model. The model architecture is t5-base. The pretrained model was finetuned as a sequence-to-sequence model on a dataset made up several well-known QA datasets (SQuAD, RACE, CoQA, and MSMARCO). The datasets were restructured by concatenating the answer and context fields into the previously mentioned format. The concatenated answer and context was then used as an input for training, and the question field became the targets.

The datasets can be found here.

QA Evaluator

The QA evaluator takes a question answer pair as an input and outputs a value representing its prediction about whether the input was a valid question and answer pair or not. The model is bert-base-cased with a sequence classification head. The pretrained model was finetuned on the same data as the question generator model, but the context was removed. The question and answer were concatenated 50% of the time. In the other 50% of the time a corruption operation was performed (either swapping the answer for an unrelated answer, or by copying part of the question into the answer). The model was then trained to predict whether the input sequence represented one of the original QA pairs or a corrupted input.

The input for the QA evaluator follows the format for BertForSequenceClassification, but using the question and answer as the two sequences. It is the following format:

[CLS] <question> [SEP] <answer [SEP]
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].