All Projects → jingtaozhan → DRhard

jingtaozhan / DRhard

Licence: BSD-3-Clause license
SIGIR'21: Optimizing DR with hard negatives and achieving SOTA first-stage retrieval performance on TREC DL Track.

Programming Languages

python
139335 projects - #7 most used programming language
shell
77523 projects

Projects that are alternatives of or similar to DRhard

JPQ
CIKM'21: JPQ substantially improves the efficiency of Dense Retrieval with 30x compression ratio, 10x CPU speedup and 2x GPU speedup.
Stars: ✭ 39 (-58.06%)
Mutual labels:  information-retrieval, web-search
awesome-pretrained-models-for-information-retrieval
A curated list of awesome papers related to pre-trained models for information retrieval (a.k.a., pretraining for IR).
Stars: ✭ 278 (+198.92%)
Mutual labels:  information-retrieval, web-search
netizenship
a commandline #OSINT tool to find the online presence of a username in popular social media websites like Facebook, Instagram, Twitter, etc.
Stars: ✭ 33 (-64.52%)
Mutual labels:  information-retrieval
GNN-Recommender-Systems
An index of recommendation algorithms that are based on Graph Neural Networks.
Stars: ✭ 505 (+443.01%)
Mutual labels:  information-retrieval
MixGCF
MixGCF: An Improved Training Method for Graph Neural Network-based Recommender Systems, KDD2021
Stars: ✭ 73 (-21.51%)
Mutual labels:  information-retrieval
COVID19-IRQA
No description or website provided.
Stars: ✭ 32 (-65.59%)
Mutual labels:  information-retrieval
kex
Kex is a python library for unsupervised keyword extraction from a document, providing an easy interface and benchmarks on 15 public datasets.
Stars: ✭ 46 (-50.54%)
Mutual labels:  information-retrieval
LuceneTutorial
A simple tutorial of Lucene for LIS 501 Introduction to Text Mining students at the University of Wisconsin-Madison (Fall 2021).
Stars: ✭ 62 (-33.33%)
Mutual labels:  information-retrieval
tutorials
A tutorial series by Preferred.AI
Stars: ✭ 136 (+46.24%)
Mutual labels:  information-retrieval
EMNLP2020
This is official Pytorch code and datasets of the paper "Where Are the Facts? Searching for Fact-checked Information to Alleviate the Spread of Fake News", EMNLP 2020.
Stars: ✭ 55 (-40.86%)
Mutual labels:  information-retrieval
ml-nlp-services
机器学习、深度学习、自然语言处理
Stars: ✭ 23 (-75.27%)
Mutual labels:  information-retrieval
BERT-QE
Code and resources for the paper "BERT-QE: Contextualized Query Expansion for Document Re-ranking".
Stars: ✭ 43 (-53.76%)
Mutual labels:  information-retrieval
rust-stemmers
A rust implementation of some popular snowball stemming algorithms
Stars: ✭ 85 (-8.6%)
Mutual labels:  information-retrieval
HAR
Code for WWW2019 paper "A Hierarchical Attention Retrieval Model for Healthcare Question Answering"
Stars: ✭ 22 (-76.34%)
Mutual labels:  information-retrieval
ml4ir
Machine Learning for Information Retrieval
Stars: ✭ 75 (-19.35%)
Mutual labels:  information-retrieval
beir
A Heterogeneous Benchmark for Information Retrieval. Easy to use, evaluate your models across 15+ diverse IR datasets.
Stars: ✭ 738 (+693.55%)
Mutual labels:  information-retrieval
naacl2018-fever
Fact Extraction and VERification baseline published in NAACL2018
Stars: ✭ 109 (+17.2%)
Mutual labels:  information-retrieval
3d model retriever
Experimenting with a newly published deep learning paper and how it can be used for content-based 3D model retrieval. (info retrieval for CAD)
Stars: ✭ 45 (-51.61%)
Mutual labels:  information-retrieval
BM25Transformer
(Python) transform a document-term matrix to an Okapi/BM25 representation
Stars: ✭ 50 (-46.24%)
Mutual labels:  information-retrieval
src
tools for fast reading of docs
Stars: ✭ 40 (-56.99%)
Mutual labels:  information-retrieval

Optimizing Dense Retrieval Model Training with Hard Negatives

Jingtao Zhan, Jiaxin Mao, Yiqun Liu, Jiafeng Guo, Min Zhang, Shaoping Ma

This repo provides code, retrieval results, and trained models for our SIGIR Full paper Optimizing Dense Retrieval Model Training with Hard Negatives. The previous version is Learning To Retrieve: How to Train a Dense Retrieval Model Effectively and Efficiently.

We achieve very impressive retrieval results on both passage and document retrieval bechmarks. The proposed two algorithms (STAR and ADORE) are very efficient. IMHO, they are well worth trying and most likely improve your retriever's performance by a large margin.

The following figure shows the pros and cons of different training methods. You can train an effective Dense Retrieval model in three steps. Firstly, warmup your model using random negatives or BM25 top negatives. Secondly, use our proposed STAR to train the query encoder and document encoder. Thirdly, use our proposed ADORE to train the query encoder. image

Retrieval Results and Trained Models

Passage Retrieval Dev MRR@10 Dev R@100 Test NDCG@10 Files
Inbatch-Neg 0.264 0.837 0.583 Model
Rand-Neg 0.301 0.853 0.612 Model
STAR 0.340 0.867 0.642 Model Train Dev TRECTest
ADORE (Inbatch-Neg) 0.316 0.860 0.658 Model
ADORE (Rand-Neg) 0.326 0.865 0.661 Model
ADORE (STAR) 0.347 0.876 0.683 Model Train Dev TRECTest Leaderboard
Doc Retrieval Dev MRR@100 Dev R@100 Test NDCG@10 Files
Inbatch-Neg 0.320 0.864 0.544 Model
Rand-Neg 0.330 0.859 0.572 Model
STAR 0.390 0.867 0.605 Model Train Dev TRECTest
ADORE (Inbatch-Neg) 0.362 0.884 0.580 Model
ADORE (Rand-Neg) 0.361 0.885 0.585 Model
ADORE (STAR) 0.405 0.919 0.628 Model Train Dev TRECTest Leaderboard

If you want to use our first-stage leaderboard runs, contact me and I will send you the file.

If any links fail or the files go wrong, please contact me or open a issue.

Requirements

torch
transformers
faiss-gpu 
tensorboard
boto3

Data Download

To download all the needed data, run:

bash download_data.sh

Data Preprocess

Run the following codes.

python preprocess.py --data_type 0; python preprocess.py --data_type 1

Note: We utilized Transformers 2.x version to tokenize text when we conducted this research. However, when Transformers library updates to 3.x or 4.x versions, the RobertaTokenizer behaves differently. To support REPRODUCIBILITY, we copy the RobertaTokenizer source codes from 2.x version to star_tokenizer.py. During preprocessing, we use from star_tokenizer import RobertaTokenizer instead of from transformers import RobertaTokenizer. It is also necessary for you to do this if you use our trained model on other datasets.

Inference

With our provided trained models, you can easily replicate our reported experimental results. Note that minor variance may be observed due to environmental difference.

STAR

The following codes use the provided STAR model to compute query/passage embeddings and perform similarity search on the dev set. (You can use --faiss_gpus option to use gpus for much faster similarity search.)

python ./star/inference.py --data_type passage --max_doc_length 256 --mode dev   
python ./star/inference.py --data_type doc --max_doc_length 512 --mode dev   

Run the following code to evaluate on MSMARCO Passage dataset.

python ./msmarco_eval.py ./data/passage/preprocess/dev-qrel.tsv ./data/passage/evaluate/star/dev.rank.tsv
Eval Started
#####################
MRR @10: 0.3404237731386721
QueriesRanked: 6980
#####################

Run the following code to evaluate on MSMARCO Document dataset.

python ./msmarco_eval.py ./data/doc/preprocess/dev-qrel.tsv ./data/doc/evaluate/star/dev.rank.tsv 100
Eval Started
#####################
MRR @100: 0.3903422772218344
QueriesRanked: 5193
#####################

ADORE

ADORE computes the query embeddings. The document embeddings are pre-computed by other DR models, like STAR. The following codes use the provided ADORE(STAR) model to compute query embeddings and perform similarity search on the dev set. (You can use --faiss_gpus option to use gpus for much faster similarity search.)

python ./adore/inference.py --model_dir ./data/passage/trained_models/adore-star --output_dir ./data/passage/evaluate/adore-star --preprocess_dir ./data/passage/preprocess --mode dev --dmemmap_path ./data/passage/evaluate/star/passages.memmap
python ./adore/inference.py --model_dir ./data/doc/trained_models/adore-star --output_dir ./data/doc/evaluate/adore-star --preprocess_dir ./data/doc/preprocess --mode dev --dmemmap_path ./data/doc/evaluate/star/passages.memmap

Evaluate ADORE(STAR) model on dev passage dataset:

python ./msmarco_eval.py ./data/passage/preprocess/dev-qrel.tsv ./data/passage/evaluate/adore-star/dev.rank.tsv

You will get

Eval Started
#####################
MRR @10: 0.34660697230181425
QueriesRanked: 6980
#####################

Evaluate ADORE(STAR) model on dev document dataset:

python ./msmarco_eval.py ./data/doc/preprocess/dev-qrel.tsv ./data/doc/evaluate/adore-star/dev.rank.tsv 100

You will get

Eval Started
#####################
MRR @100: 0.4049777020859768
QueriesRanked: 5193
#####################

Convert QID/PID Back

Our data preprocessing reassigns new ids for each query and document. Therefore, you may want to convert the ids back. We provide a script for this.

The following code shows an example to convert ADORE-STAR's ranking results on the dev passage dataset.

python ./cvt_back.py --input_dir ./data/passage/evaluate/adore-star/ --preprocess_dir ./data/passage/preprocess --output_dir ./data/passage/official_runs/adore-star --mode dev --dataset passage
python ./msmarco_eval.py ./data/passage/dataset/qrels.dev.small.tsv ./data/passage/official_runs/adore-star/dev.rank.tsv

You will get

Eval Started
#####################
MRR @10: 0.34660697230181425
QueriesRanked: 6980
#####################

Train

In the following instructions, we show how to replicate our experimental results on MSMARCO Passage Retrieval task.

STAR

We use the same warmup model as ANCE, the most competitive baseline, to enable a fair comparison. Please download it and extract it at ./data/passage/warmup

Next, we use this warmup model to extract static hard negatives, which will be utilized by STAR.

python ./star/prepare_hardneg.py \
--data_type passage \
--max_query_length 32 \
--max_doc_length 256 \
--mode dev \
--topk 200

It will automatically use all available gpus to retrieve documents. If all available cuda memory is less than 26GB (the index size), you can add --not_faiss_cuda to use CPU for retrieval.

Run the following command to train the DR model with STAR. In our experiments, we only use one GPU to train.

python ./star/train.py --do_train \
    --max_query_length 24 \
    --max_doc_length 120 \
    --preprocess_dir ./data/passage/preprocess \
    --hardneg_path ./data/passage/warmup_retrieve/hard.json \
    --init_path ./data/passage/warmup \
    --output_dir ./data/passage/star_train/models \
    --logging_dir ./data/passage/star_train/log \
    --optimizer_str lamb \
    --learning_rate 1e-4 \
    --gradient_checkpointing --fp16

Although we set number of training epcohs a very large value in the script, it is likely to converge within 50k steps (1.5 days) and you can manually kill the process. Using multiple gpus should speed up a lot, which requires some changes in the codes.

ADORE

Now we show how to use ADORE to finetune the query encoder. Here we use our provided STAR checkpoint as the fixed document encoder. You can also use another document encoder.

The passage embeddings by STAR should be located at ./data/passage/evaluate/star/passages.memmap. If not, follow the STAR inference procedure as shown above.

python ./adore/train.py \
--metric_cut 200 \
--init_path ./data/passage/trained_models/star \
--pembed_path ./data/passage/evaluate/star/passages.memmap \
--model_save_dir ./data/passage/adore_train/models \
--log_dir ./data/passage/adore_train/log \
--preprocess_dir ./data/passage/preprocess \
--model_gpu_index 0 \
--faiss_gpu_index 1 2 3

The above command uses the first gpu for encoding, and the 2nd~4th gpu for dense retrieval. You can change the faiss_gpu_index values based on your available cuda memory. For example, if you have a 32GB gpu, you can set model_gpu_index and faiss_gpu_index both to 0 because the CUDA memory is large enough. But if you only have 11GB gpus, three gpus are required for faiss.

Empirically, ADORE significantly improves retrieval performance after training for only one epoch, which only costs 1 hour if using GPUs to retrieve dynamic hard negatives.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].