All Projects → xwhan → ProQA

xwhan / ProQA

Licence: other
Progressively Pretrained Dense Corpus Index for Open-Domain QA and Information Retrieval

Programming Languages

python
139335 projects - #7 most used programming language
shell
77523 projects

Projects that are alternatives of or similar to ProQA

cherche
📑 Neural Search
Stars: ✭ 196 (+345.45%)
Mutual labels:  information-retrieval, question-answering
COVID19-IRQA
No description or website provided.
Stars: ✭ 32 (-27.27%)
Mutual labels:  information-retrieval, question-answering
text2text
Text2Text: Cross-lingual natural language processing and generation toolkit
Stars: ✭ 188 (+327.27%)
Mutual labels:  information-retrieval, question-answering
HAR
Code for WWW2019 paper "A Hierarchical Attention Retrieval Model for Healthcare Question Answering"
Stars: ✭ 22 (-50%)
Mutual labels:  information-retrieval, question-answering
Flexneuart
Flexible classic and NeurAl Retrieval Toolkit
Stars: ✭ 99 (+125%)
Mutual labels:  information-retrieval, question-answering
cdQA-ui
⛔ [NOT MAINTAINED] A web interface for cdQA and other question answering systems.
Stars: ✭ 19 (-56.82%)
Mutual labels:  information-retrieval, question-answering
Cdqa
⛔ [NOT MAINTAINED] An End-To-End Closed Domain Question Answering System.
Stars: ✭ 500 (+1036.36%)
Mutual labels:  information-retrieval, question-answering
Awesome Neural Models For Semantic Match
A curated list of papers dedicated to neural text (semantic) matching.
Stars: ✭ 669 (+1420.45%)
Mutual labels:  information-retrieval, question-answering
Bert Vietnamese Question Answering
Vietnamese question answering system with BERT
Stars: ✭ 57 (+29.55%)
Mutual labels:  information-retrieval, question-answering
Knowledge Graphs
A collection of research on knowledge graphs
Stars: ✭ 845 (+1820.45%)
Mutual labels:  information-retrieval, question-answering
Dan Jurafsky Chris Manning Nlp
My solution to the Natural Language Processing course made by Dan Jurafsky, Chris Manning in Winter 2012.
Stars: ✭ 124 (+181.82%)
Mutual labels:  information-retrieval, question-answering
Haystack
🔍 Haystack is an open source NLP framework that leverages Transformer models. It enables developers to implement production-ready neural search, question answering, semantic document search and summarization for a wide range of applications.
Stars: ✭ 3,409 (+7647.73%)
Mutual labels:  information-retrieval, question-answering
FinBERT-QA
Financial Domain Question Answering with pre-trained BERT Language Model
Stars: ✭ 70 (+59.09%)
Mutual labels:  information-retrieval, question-answering
expmrc
ExpMRC: Explainability Evaluation for Machine Reading Comprehension
Stars: ✭ 58 (+31.82%)
Mutual labels:  question-answering
lets-quiz
A quiz website for organizing online quizzes and tests. It's build using Python/Django and Bootstrap4 frameworks. 🤖
Stars: ✭ 165 (+275%)
Mutual labels:  question-answering
perke
A keyphrase extractor for Persian
Stars: ✭ 60 (+36.36%)
Mutual labels:  information-retrieval
ADNC
Advanced Differentiable Neural Computer (ADNC) with application to bAbI task and CNN RC task.
Stars: ✭ 90 (+104.55%)
Mutual labels:  question-answering
CONVEX
As far as we know, CONVEX is the first unsupervised method for conversational question answering over knowledge graphs. A demo and our benchmark (and more) can be found at
Stars: ✭ 24 (-45.45%)
Mutual labels:  question-answering
LuceneTutorial
A simple tutorial of Lucene for LIS 501 Introduction to Text Mining students at the University of Wisconsin-Madison (Fall 2021).
Stars: ✭ 62 (+40.91%)
Mutual labels:  information-retrieval
ImageRetrieval
Content Based Image Retrieval Techniques (e.g. knn, svm using MatLab GUI)
Stars: ✭ 51 (+15.91%)
Mutual labels:  information-retrieval

ProQA

Resource-efficient method for pretraining a dense corpus index for open-domain QA and IR. Given a question, you could use this code to retrieval relevant paragraphs from Wikipedia and extract answers.

1. Set up the environments

conda create -n proqa -y python=3.6.9 && conda activate proqa
pip install -r requirements.txt

If you want to used mixed precision training, you need to follow Nvidia Apex repo to install Apex if your GPUs support fp16.

2. Download data (including the corpus, paragraphs paired with the generated questions, etc.)

gdown https://drive.google.com/uc?id=17IMQ5zzfkCNsTZNJqZI5KveoIsaG2ZDt && unzip data.zip
cd data && gdown https://drive.google.com/uc?id=1T1SntmAZxJ6QfNBN39KbAHcMw0JR5MwL

The data folder includes the QA datasets and also the paragraph database nq_paras.db which can be used with sqlite3. If the command line fails to download the file, please use your brower instead.

2. Use pretrained index and models

Download the pretrained models and data from google drive:

gdown https://drive.google.com/uc?id=1fDRHsLk5emLqHSMkkoockoHjRSOEBaZw && unzip pretrained_models.zip

Test the Retrieval Performance Before QA finetuning

  • First, encode all the questions as embeddings (use WebQuestions text for this example):
cd retrieval
CUDA_VISIBLE_DEVICES=0 python get_embed.py \
    --do_predict \
    --predict_batch_size 512 \
    --bert_model_name bert-base-uncased \
    --fp16 \
    --predict_file ../data/WebQuestions-test.txt \
    --init_checkpoint ../pretrained_models/retriever.pt \
    --is_query_embed \
    --embed_save_path ../data/wq_test_query_embed.npy
  • Retrieval topk (k=80) paragraphs from the corpus and evaluate recall with simple string matching
python eval_retrieval.py ../data/WebQuestions-test.txt ../pretrained_models/para_embed.npy ../data/wq_test_query_embed.npy ../data/nq_paras.db

The arguments are the dataset file, dense corpus index, question embeddings and the paragraph database. The results should be like:

Top 80 Recall for 2032 QA pairs: 0.7839566929133859 ...
Top 5 Recall for 2032 QA pairs: 0.5196850393700787 ...
Top 10 Recall for 2032 QA pairs: 0.610236220472441 ...
Top 20 Recall for 2032 QA pairs: 0.687007874015748 ...
Top 50 Recall for 2032 QA pairs: 0.7554133858267716 ...

3. Retriever pretraining

Use a single pretraining file:

  • Under the retrieval directory:
cd retrieval
./train_retriever_single.sh

This script will use the unclustered the data for pretraining. After certain updates, we will pause the training and use the following steps to cluster the data and continue training. This will save a checkpoint under retrieval/logs/.

Use clutered data for pretraining:

Generate paragraph clusters

  • Generate the paragraph embeddings using the checkpoint from last step:
mkdir encodings
CUDA_VISIBLE_DEVICES=0 python get_embed.py --do_predict --prefix eval-para \
    --predict_batch_size 300 \
    --bert_model_name bert-base-uncased \
    --fp16 \
    --predict_file ../data/retrieve_train.txt \
    --init_checkpoint ../pretrained_models/retriever.pt \
    --embed_save_path encodings/train_para_embed.npy \
    --eval-workers 32 \
    --fp16
  • Generate clusters using the paragraph embeddings:
python group_paras.py

Clustering hyperparameter settings such as num of clusters can be found in group_paras.py.

Pretraining using clusters

  • Then run the retrieval script:
./train_retriever_cluster.sh

4. QA finetuning

  • Generate the paragraph dense index under "retrieval" directory: ./get_para_embed.sh
  • Finetune the pretraining model on the QA dataset under "qa" directory: ./train_dense_qa.sh
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].