All Projects → KristiyanVachev → Question Generation

KristiyanVachev / Question Generation

Licence: gpl-3.0
Generating multiple choice questions from text using Machine Learning.

Projects that are alternatives of or similar to Question Generation

Machinelearning ng
吴恩达机器学习coursera课程,学习代码(2017年秋) The Stanford Coursera course on MachineLearning with Andrew Ng
Stars: ✭ 181 (-20.26%)
Mutual labels:  ai, jupyter-notebook
Hyperdash Sdk Py
Official Python SDK for Hyperdash
Stars: ✭ 190 (-16.3%)
Mutual labels:  ai, jupyter-notebook
Deeplearning.ai Note
网易云课堂终于官方发布了吴恩达经过授权的汉化课程-“”深度学习专项课程“”,这是自己做的一些笔记以及代码。下为网易云学习链接
Stars: ✭ 181 (-20.26%)
Mutual labels:  ai, jupyter-notebook
Fixy
Amacımız Türkçe NLP literatüründeki birçok farklı sorunu bir arada çözebilen, eşsiz yaklaşımlar öne süren ve literatürdeki çalışmaların eksiklerini gideren open source bir yazım destekleyicisi/denetleyicisi oluşturmak. Kullanıcıların yazdıkları metinlerdeki yazım yanlışlarını derin öğrenme yaklaşımıyla çözüp aynı zamanda metinlerde anlamsal analizi de gerçekleştirerek bu bağlamda ortaya çıkan yanlışları da fark edip düzeltebilmek.
Stars: ✭ 165 (-27.31%)
Mutual labels:  ai, jupyter-notebook
Atari Model Zoo
A binary release of trained deep reinforcement learning models trained in the Atari machine learning benchmark, and a software release that enables easy visualization and analysis of models, and comparison across training algorithms.
Stars: ✭ 198 (-12.78%)
Mutual labels:  ai, jupyter-notebook
Aulas
Aulas da Escola de Inteligência Artificial de São Paulo
Stars: ✭ 166 (-26.87%)
Mutual labels:  ai, jupyter-notebook
Germanwordembeddings
Toolkit to obtain and preprocess german corpora, train models using word2vec (gensim) and evaluate them with generated testsets
Stars: ✭ 189 (-16.74%)
Mutual labels:  jupyter-notebook, word-embeddings
Lessonmaterials
Open Sourced Curriculum and Lessons for an Introductory AI/ML Course
Stars: ✭ 142 (-37.44%)
Mutual labels:  ai, jupyter-notebook
Thinc
🔮 A refreshing functional take on deep learning, compatible with your favorite libraries
Stars: ✭ 2,422 (+966.96%)
Mutual labels:  ai, spacy
Imodels
Interpretable ML package 🔍 for concise, transparent, and accurate predictive modeling (sklearn-compatible).
Stars: ✭ 194 (-14.54%)
Mutual labels:  ai, jupyter-notebook
Workshops
Workshops organized to introduce students to security, AI, AR/VR, hardware and software
Stars: ✭ 162 (-28.63%)
Mutual labels:  ai, jupyter-notebook
Deep Learning In Production
Develop production ready deep learning code, deploy it and scale it
Stars: ✭ 216 (-4.85%)
Mutual labels:  ai, jupyter-notebook
Datasciencevm
Tools and Docs on the Azure Data Science Virtual Machine (http://aka.ms/dsvm)
Stars: ✭ 153 (-32.6%)
Mutual labels:  ai, jupyter-notebook
Debiaswe
Remove problematic gender bias from word embeddings.
Stars: ✭ 175 (-22.91%)
Mutual labels:  jupyter-notebook, word-embeddings
Elmo Tutorial
A short tutorial on Elmo training (Pre trained, Training on new data, Incremental training)
Stars: ✭ 145 (-36.12%)
Mutual labels:  jupyter-notebook, word-embeddings
Microsoft Student Partner Workshop Learning Materials Ai Nlp
This repository contains all codes and materials of the current session. It contains the required code on Natural Language Processing, Artificial intelligence.
Stars: ✭ 187 (-17.62%)
Mutual labels:  ai, jupyter-notebook
Nlpaug
Data augmentation for NLP
Stars: ✭ 2,761 (+1116.3%)
Mutual labels:  ai, jupyter-notebook
All4nlp
All For NLP, especially Chinese.
Stars: ✭ 141 (-37.89%)
Mutual labels:  ai, jupyter-notebook
Deep Learning Paper Review And Practice
꼼꼼한 딥러닝 논문 리뷰와 코드 실습
Stars: ✭ 184 (-18.94%)
Mutual labels:  ai, jupyter-notebook
Learnopencv
Learn OpenCV : C++ and Python Examples
Stars: ✭ 15,385 (+6677.53%)
Mutual labels:  ai, jupyter-notebook

Question Generation

This project was originally intended for an AI course at Sofia University. During it's execution, I was constraint on time and couldn't implement all the ideas I had, but I plan to continue working on it.

General idea

The idea is to generate multiple choice answers from text, by splitting this complex problem to simpler steps:

  • Identify keywords from the text and use them as answers to the questions.
  • Replace the answer from the sentence with blank space and use it as the base for the question.
  • Transform the sentence with a blank space for answer to a more question-like sentence.
  • Generate distractors, words that are similar to the answer, as incorrect answers.

Question generation step by step gif

Installation

Creating a virtual environment (optional)

To avoid any conflicts with python packages from other projects, it is a good practice to create a virtual environment in which the packages will be installed. If you do not want to this you can skip the next commands and directly install the the requirements.txt file.

Create a virtual environment :

python -m venv venv

Enter the virtual environment:

Windows:

. .\venv\Scripts\activate

Linux or MacOS

source .\venv\Scripts\activate

Install ipython inside the venv:

ipython kernel install --user --name=.venv

Install jupyter lab inside the venv:

pip install jupyterlab

Installing packages

pip install -r .\requirements.txt 

Run jupyter

jupyter lab

Execution

Data Exploration

Before I could to anything, I wanted to understand more about how questions are made and what kind of words are it's answers.

I used the SQuAD 1.0 dataset which has about 100 000 questions generated from Wikipedia articles.

You can read about the insights I've found in the Data Exploration jupyter notebook.

Identifying answers

My assumption was that words from the text would be great answers for questions. All I needed to do was to decide which words, or short phrases, are good enough to become answers.

I decided to do a binary classification on each word from the text. spaCy really helped me with the word tagging.

Feature engineering

I pretty much needed to create the entire dataset for the binary classification. I extracted each non-stop word from the paragraphs of each question in the SQuAD dataset and added some features on it like:

  • Part of speech
  • Is it a Named entity
  • Are only alpha characters used
  • Shape - whether it's only alpha characters, digits, has punctuation (xxxx, dddd, Xxx X. Xxxx)
  • Word count

And the label isAnswer - whether the word extracted from the paragraph is the same and in the same place as the answer of the SQuAD question.

Some other features like TF-IDF score and cosine similarity to the title would be the great, but I didn't have the time to add them.

Other than those, it's up to our imagination to create new features - maybe whether it's in the start, middle or end of a sentence, information about the words surrounding it and more... Though before adding more feature it would be nice to have a metric to assess whether the feature is going to be useful or not.

Model training

I found the problem similar to spam filtering, where a common approach is to tag each word of an email as coming from a spam or not a spam email.

I used scikit-learn's Gaussian Naive Bayes algorithm to classify each word whether it's an answer.

The results were surprisingly good - at a quick glance, the algorithm classified most of the words as answers. The ones it didn't were in fact unfit.

The cool thing about Naive Bayes is that you get the probability for each word. In the demo I've used that to order the words from the most likely answer to the least likely.

Creating questions

Another assumption I had was that the sentence of an answer could easily be turned to a question. Just by placing a blank space in the position of the answer in the text I get a "cloze" question (sentence with a blank space for the missing word)

Answer: Oxygen

Question: _____ is a chemical element with symbol O and atomic number 8.

I decided it wasn't worth it to transform the cloze question to a more question-looking sentence, but I imagine it could be done with a seq2seq neural network, similarly to the way text is translated from one language to another.

Generating incorrect answers

The part turned out really well.

For each answer I generate it's most similar words using word embeddings and cosine similarity.

Most similar words to oxygen

Most of the words are just fine and could easily be mistaken for the correct answer. But there are some which are obviously not appropriate.

Since I didn't have a dataset with incorrect answers I fell back on a more classical approach.

I removed the words that weren't the same part of speech or the same named entity as the answer, and added some more context from the question.

I would like to find a dataset with multiple choice answers and see if I can create a ML model for generating better incorrect answers.

Results

After adding a Demo project, the generated questions aren't really fit to go into a classroom instantly, but they are't bad either.

The cool thing is the simplicity and modularity of the approach, where you could find where it's doing bad (say it's classifying verbs) and plug a fix into it.

Having a complex Neural Network (like all the papers on the topics do) will probably do better, especially in the age we're living. But the great thing I found out about this approach, is that it's like a gateway for a software engineer, with his software engineering mindset, to get into the field of AI and see meaningful results.

Future work (updated)

I find this topic quite interesting and with a lot of potential. I would probably continue working in this field.

I even enrolled in a Masters of Data Mining and will probably do some similar projects. I will link anything useful here.

I've already put some more time on finishing the project, but I would like to transform it more to a tutorial about getting into the field of AI while having the ability to easily extend it with new custom features.

Updates

Update - 29.12.19: The repository has become pretty popular, so I added a new notebook (Demo.ipynb) that combines all the modules and generates questions for any text. I reordered the other notebooks and documented the code (a bit better).

Update - 09.03.21: Added a requirements.txt file with instructions to run a virtual environment and fixed the bug a with ValueError: operands could not be broadcast together with shapes (230, 121) (83, )

I have also started working on my Master's thesis with a similar topic of Question Generation. If you are interested in the field or looking into improving upon this repo, you can check this great article by Adam Montgomerie or this repo by Patil Suraj where they are using transformers which seems to be the current trend in NLP.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].