All Projects → zh-zheng → BERT-QE

zh-zheng / BERT-QE

Licence: Apache-2.0 license
Code and resources for the paper "BERT-QE: Contextualized Query Expansion for Document Re-ranking".

Programming Languages

python
139335 projects - #7 most used programming language
shell
77523 projects

Projects that are alternatives of or similar to BERT-QE

text2text
Text2Text: Cross-lingual natural language processing and generation toolkit
Stars: ✭ 188 (+337.21%)
Mutual labels:  information-retrieval, bert
beir
A Heterogeneous Benchmark for Information Retrieval. Easy to use, evaluate your models across 15+ diverse IR datasets.
Stars: ✭ 738 (+1616.28%)
Mutual labels:  information-retrieval, bert
FinBERT-QA
Financial Domain Question Answering with pre-trained BERT Language Model
Stars: ✭ 70 (+62.79%)
Mutual labels:  information-retrieval, bert
SWDM
SIGIR 2017: Embedding-based query expansion for weighted sequential dependence retrieval model
Stars: ✭ 35 (-18.6%)
Mutual labels:  information-retrieval, query-expansion
cdQA-ui
⛔ [NOT MAINTAINED] A web interface for cdQA and other question answering systems.
Stars: ✭ 19 (-55.81%)
Mutual labels:  information-retrieval, bert
Haystack
🔍 Haystack is an open source NLP framework that leverages Transformer models. It enables developers to implement production-ready neural search, question answering, semantic document search and summarization for a wide range of applications.
Stars: ✭ 3,409 (+7827.91%)
Mutual labels:  information-retrieval, bert
gpl
Powerful unsupervised domain adaptation method for dense retrieval. Requires only unlabeled corpus and yields massive improvement: "GPL: Generative Pseudo Labeling for Unsupervised Domain Adaptation of Dense Retrieval" https://arxiv.org/abs/2112.07577
Stars: ✭ 216 (+402.33%)
Mutual labels:  information-retrieval, bert
awesome-pretrained-models-for-information-retrieval
A curated list of awesome papers related to pre-trained models for information retrieval (a.k.a., pretraining for IR).
Stars: ✭ 278 (+546.51%)
Mutual labels:  information-retrieval
R-AT
Regularized Adversarial Training
Stars: ✭ 19 (-55.81%)
Mutual labels:  bert
solr
Apache Solr open-source search software
Stars: ✭ 651 (+1413.95%)
Mutual labels:  information-retrieval
BERTOverflow
A Pre-trained BERT on StackOverflow Corpus
Stars: ✭ 40 (-6.98%)
Mutual labels:  bert
rust-stemmers
A rust implementation of some popular snowball stemming algorithms
Stars: ✭ 85 (+97.67%)
Mutual labels:  information-retrieval
TriB-QA
吹逼我们是认真的
Stars: ✭ 45 (+4.65%)
Mutual labels:  bert
BERT-chinese-text-classification-pytorch
This repo contains a PyTorch implementation of a pretrained BERT model for text classification.
Stars: ✭ 92 (+113.95%)
Mutual labels:  bert
COVID19-IRQA
No description or website provided.
Stars: ✭ 32 (-25.58%)
Mutual labels:  information-retrieval
LuceneTutorial
A simple tutorial of Lucene for LIS 501 Introduction to Text Mining students at the University of Wisconsin-Madison (Fall 2021).
Stars: ✭ 62 (+44.19%)
Mutual labels:  information-retrieval
ProQA
Progressively Pretrained Dense Corpus Index for Open-Domain QA and Information Retrieval
Stars: ✭ 44 (+2.33%)
Mutual labels:  information-retrieval
bert attn viz
Visualize BERT's self-attention layers on text classification tasks
Stars: ✭ 41 (-4.65%)
Mutual labels:  bert
netizenship
a commandline #OSINT tool to find the online presence of a username in popular social media websites like Facebook, Instagram, Twitter, etc.
Stars: ✭ 33 (-23.26%)
Mutual labels:  information-retrieval
TabFormer
Code & Data for "Tabular Transformers for Modeling Multivariate Time Series" (ICASSP, 2021)
Stars: ✭ 209 (+386.05%)
Mutual labels:  bert

BERT-QE

This repo contains the code and resources for the paper:

BERT-QE: Contextualized Query Expansion for Document Re-ranking. In Findings of ACL: EMNLP 2020.

Introduction

BERT-QE leverages the strength of BERT to select relevant document chunks for query expansion. The BERT-QE model consists of three phases, in which BERT models of different sizes can be used to balance effectiveness and efficiency. Some experimental results on Robust04 are listed below:

Model FLOPs P@20 NDCG@20 MAP
BERT-Large 1.00x 0.4769 0.5397 0.3743
BERT-QE-LLL 11.19x 0.4888 0.5533 0.3865
BERT-QE-LMT 1.03x 0.4839 0.5483 0.3765
BERT-QE-LLS 1.30x 0.4869 0.5501 0.3798

Requirements

We recommend to install Anaconda. Then install the packages using Anaconda:

conda install --yes --file requirements.txt

NOTE: in the paper, we run the experiments using a TPU. Alternatively, you can use GPUs and install tensorflow-gpu (see requirements.txt).

Getting Started

In this repo, we provide instructions on how to run BERT-QE on Robust04 and GOV2 datasets.

Data preparation

You need to obtain Robust04 and GOV2 collections.

The (useful) directory structure of Robust04:

disk4/
├── FR94
└── FT

disk5/
├── FBIS
└── LATIMES

The directory structure of GOV2:

gov2/
├── GX000
├── GX001
├── ...
└── GX272

Preprocess

To preprocess the datasets, in config.py, you need to specify the root path to each dataset and the output path in which the processed data will be placed, e.g. robust04_collection_path and robust04_output_path for Robust04. As the collection is huge, you can choose to only process documents in the initial ranking.

For example, given an initial ranking Robust04_DPH_KL.res, extract all unique document ids by:

cut -d ' ' -f3 Robust04_DPH_KL.res | sort | uniq > robust04_docno_list.txt

And assign its path to robust04_docno_list in config.py.

You can now preprocess Robust04 and GOV2 using robust04_preprocess.py and gov2_preprocess.py, respectively. Finally, you need to merge all the processed text files into a single file, which will be used as dataset_file in run.sh.

For Robust04:

cat ${robust04_output_path}/* > robust04_collection.txt

As titles are available in Robust04, in the output file, the first column is the document id, the second column is the title, and the third column is the document text.

For GOV2:

cat ${gov2_output_path}/*/*.txt > gov2_collection.txt

In the output file, the first column is the document id and the second column is the document text.

Training and evaluation

We first need to fine-tune the BERT models of different sizes from the BERT repo on the MS MARCO collection. For details, please refer to dl4marco-bert. After fine-tuning the models, you should specify the paths of the fine-tuned BERT checkpoints and the config files (bert_config.json) in config.py. If you want to skip this step, you can refer to PARADE and dl4marco-bert (for BERT-Large) to download the trained checkpoints.

Then we continue to fine-tune BERT models on the target dataset, i.e. Robust04 or GOV2, and select chunks to perform query expansion. You can download our partitions of cross-validation from here and the TREC evaluation script from here, and set the cv_folder_path and trec_eval_script_path in config.py. The last step is to fill in the configurations in run.sh (see comments for instructions) and run

bash run.sh

The training and evaluation of BERT-QE will be conducted automatically!

NOTE: if you plan to use BERT models of different sizes in three phases (e.g. BERT-QE-LMT), you need to first fine-tune those models on the target dataset. Specifically, you should specify the first_model_size and run the code before line 69 (i.e. before "phase2") in run.sh for each model.

Resources

We release the run files, BERT models fine-tuned on two collections, and the partitions of cross-validation to help the community reproduce our results.

  • Fine-tuned BERT models (incl. bert_config.json)
Model Robust04 GOV2
BERT-Large Download Download
BERT-Base Download Download
BERT-Medium Download Download
BERT-Small* Download Download
BERT-Tiny Download Download

* Note that the BERT-Small corresponds to BERT-Mini in the BERT repo, for the sake of convenient descriptions in the paper.

Usage: take BERT-Large fine-tuned on Robust04 for example, you need to first unzip all the fold-*.zip files, then rename the root folder from BERT-Large-Robust04 to large, and put the folder in the directory ${main_path}/robust04/model/. Note that main_path is specified in run.sh.

Citation

If you use our code or resources, please cite this paper:

@inproceedings{zheng-etal-2020-bert,
    title = "{BERT-QE}: {C}ontextualized {Q}uery {E}xpansion for {D}ocument {R}e-ranking",
    author = "Zheng, Zhi  and
      Hui, Kai  and
      He, Ben  and
      Han, Xianpei  and
      Sun, Le  and
      Yates, Andrew",
    booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2020",
    month = nov,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2020.findings-emnlp.424",
    pages = "4718--4728"
}

Acknowledgement

Some snippets of the code are borrowed from dl4marco-bert and NPRF.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].