Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

Created with love in Canada, visit hostnodejs.com today

Feel like to post an Ad? Learn Details

All Projects → xwhan → ProQA

xwhan / ProQA

Licence: other

Progressively Pretrained Dense Corpus Index for Open-Domain QA and Information Retrieval

Programming Languages

139335 projects - #7 most used programming language

77523 projects

Labels

natural-language-processing information-retrieval pytorch question-answering

Projects that are alternatives of or similar to ProQA

📑 Neural Search

Stars: ✭ 196 (+345.45%)

Mutual labels: information-retrieval, question-answering

No description or website provided.

Stars: ✭ 32 (-27.27%)

Mutual labels: information-retrieval, question-answering

Text2Text: Cross-lingual natural language processing and generation toolkit

Stars: ✭ 188 (+327.27%)

Mutual labels: information-retrieval, question-answering

Code for WWW2019 paper "A Hierarchical Attention Retrieval Model for Healthcare Question Answering"

Stars: ✭ 22 (-50%)

Mutual labels: information-retrieval, question-answering

Flexible classic and NeurAl Retrieval Toolkit

Stars: ✭ 99 (+125%)

Mutual labels: information-retrieval, question-answering

⛔ [NOT MAINTAINED] A web interface for cdQA and other question answering systems.

Stars: ✭ 19 (-56.82%)

Mutual labels: information-retrieval, question-answering

⛔ [NOT MAINTAINED] An End-To-End Closed Domain Question Answering System.

Stars: ✭ 500 (+1036.36%)

Mutual labels: information-retrieval, question-answering

Awesome Neural Models For Semantic Match

A curated list of papers dedicated to neural text (semantic) matching.

Stars: ✭ 669 (+1420.45%)

Mutual labels: information-retrieval, question-answering

Bert Vietnamese Question Answering

Vietnamese question answering system with BERT

Stars: ✭ 57 (+29.55%)

Mutual labels: information-retrieval, question-answering

Knowledge Graphs

A collection of research on knowledge graphs

Stars: ✭ 845 (+1820.45%)

Mutual labels: information-retrieval, question-answering

Dan Jurafsky Chris Manning Nlp

My solution to the Natural Language Processing course made by Dan Jurafsky, Chris Manning in Winter 2012.

Stars: ✭ 124 (+181.82%)

Mutual labels: information-retrieval, question-answering

🔍 Haystack is an open source NLP framework that leverages Transformer models. It enables developers to implement production-ready neural search, question answering, semantic document search and summarization for a wide range of applications.

Stars: ✭ 3,409 (+7647.73%)

Mutual labels: information-retrieval, question-answering

Financial Domain Question Answering with pre-trained BERT Language Model

Stars: ✭ 70 (+59.09%)

Mutual labels: information-retrieval, question-answering

ExpMRC: Explainability Evaluation for Machine Reading Comprehension

Stars: ✭ 58 (+31.82%)

Mutual labels: question-answering

A quiz website for organizing online quizzes and tests. It's build using Python/Django and Bootstrap4 frameworks. 🤖

Stars: ✭ 165 (+275%)

Mutual labels: question-answering

A keyphrase extractor for Persian

Stars: ✭ 60 (+36.36%)

Mutual labels: information-retrieval

Advanced Differentiable Neural Computer (ADNC) with application to bAbI task and CNN RC task.

Stars: ✭ 90 (+104.55%)

Mutual labels: question-answering

As far as we know, CONVEX is the first unsupervised method for conversational question answering over knowledge graphs. A demo and our benchmark (and more) can be found at

Stars: ✭ 24 (-45.45%)

Mutual labels: question-answering

A simple tutorial of Lucene for LIS 501 Introduction to Text Mining students at the University of Wisconsin-Madison (Fall 2021).

Stars: ✭ 62 (+40.91%)

Mutual labels: information-retrieval

Content Based Image Retrieval Techniques (e.g. knn, svm using MatLab GUI)

Stars: ✭ 51 (+15.91%)

Mutual labels: information-retrieval

View All Similar Projects ➔

ProQA

Resource-efficient method for pretraining a dense corpus index for open-domain QA and IR. Given a question, you could use this code to retrieval relevant paragraphs from Wikipedia and extract answers.

1. Set up the environments

conda create -n proqa -y python=3.6.9 && conda activate proqa
pip install -r requirements.txt

If you want to used mixed precision training, you need to follow Nvidia Apex repo to install Apex if your GPUs support fp16.

2. Download data (including the corpus, paragraphs paired with the generated questions, etc.)

gdown https://drive.google.com/uc?id=17IMQ5zzfkCNsTZNJqZI5KveoIsaG2ZDt && unzip data.zip
cd data && gdown https://drive.google.com/uc?id=1T1SntmAZxJ6QfNBN39KbAHcMw0JR5MwL

The data folder includes the QA datasets and also the paragraph database nq_paras.db which can be used with sqlite3. If the command line fails to download the file, please use your brower instead.

2. Use pretrained index and models

Download the pretrained models and data from google drive:

gdown https://drive.google.com/uc?id=1fDRHsLk5emLqHSMkkoockoHjRSOEBaZw && unzip pretrained_models.zip

Test the Retrieval Performance Before QA finetuning

First, encode all the questions as embeddings (use WebQuestions text for this example):

cd retrieval
CUDA_VISIBLE_DEVICES=0 python get_embed.py \
    --do_predict \
    --predict_batch_size 512 \
    --bert_model_name bert-base-uncased \
    --fp16 \
    --predict_file ../data/WebQuestions-test.txt \
    --init_checkpoint ../pretrained_models/retriever.pt \
    --is_query_embed \
    --embed_save_path ../data/wq_test_query_embed.npy

Retrieval topk (k=80) paragraphs from the corpus and evaluate recall with simple string matching

python eval_retrieval.py ../data/WebQuestions-test.txt ../pretrained_models/para_embed.npy ../data/wq_test_query_embed.npy ../data/nq_paras.db

The arguments are the dataset file, dense corpus index, question embeddings and the paragraph database. The results should be like:

Top 80 Recall for 2032 QA pairs: 0.7839566929133859 ...
Top 5 Recall for 2032 QA pairs: 0.5196850393700787 ...
Top 10 Recall for 2032 QA pairs: 0.610236220472441 ...
Top 20 Recall for 2032 QA pairs: 0.687007874015748 ...
Top 50 Recall for 2032 QA pairs: 0.7554133858267716 ...

3. Retriever pretraining

Use a single pretraining file:

Under the retrieval directory:

cd retrieval
./train_retriever_single.sh

This script will use the unclustered the data for pretraining. After certain updates, we will pause the training and use the following steps to cluster the data and continue training. This will save a checkpoint under retrieval/logs/.

Use clutered data for pretraining:

Generate paragraph clusters

Generate the paragraph embeddings using the checkpoint from last step:

mkdir encodings
CUDA_VISIBLE_DEVICES=0 python get_embed.py --do_predict --prefix eval-para \
    --predict_batch_size 300 \
    --bert_model_name bert-base-uncased \
    --fp16 \
    --predict_file ../data/retrieve_train.txt \
    --init_checkpoint ../pretrained_models/retriever.pt \
    --embed_save_path encodings/train_para_embed.npy \
    --eval-workers 32 \
    --fp16

Generate clusters using the paragraph embeddings:

python group_paras.py

Clustering hyperparameter settings such as num of clusters can be found in group_paras.py.

Pretraining using clusters

Then run the retrieval script:

./train_retriever_cluster.sh

4. QA finetuning

Generate the paragraph dense index under "retrieval" directory: ./get_para_embed.sh
Finetune the pretraining model on the QA dataset under "qa" directory: ./train_dense_qa.sh

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Stars: ✭ 44

Visit Git Page 🔗Visit User Page 🔗Visit Issues Page (0) 🔗