All Projects → UKPLab → gpl

UKPLab / gpl

Licence: Apache-2.0 license
Powerful unsupervised domain adaptation method for dense retrieval. Requires only unlabeled corpus and yields massive improvement: "GPL: Generative Pseudo Labeling for Unsupervised Domain Adaptation of Dense Retrieval" https://arxiv.org/abs/2112.07577

Programming Languages

python
139335 projects - #7 most used programming language
shell
77523 projects

Projects that are alternatives of or similar to gpl

text2text
Text2Text: Cross-lingual natural language processing and generation toolkit
Stars: ✭ 188 (-12.96%)
Mutual labels:  information-retrieval, transformers, bert
Haystack
🔍 Haystack is an open source NLP framework that leverages Transformer models. It enables developers to implement production-ready neural search, question answering, semantic document search and summarization for a wide range of applications.
Stars: ✭ 3,409 (+1478.24%)
Mutual labels:  information-retrieval, transformers, bert
pqlite
⚡ A fast embedded library for approximate nearest neighbor search
Stars: ✭ 141 (-34.72%)
Mutual labels:  information-retrieval, vector-search
Clue
中文语言理解测评基准 Chinese Language Understanding Evaluation Benchmark: datasets, baselines, pre-trained models, corpus and leaderboard
Stars: ✭ 2,425 (+1022.69%)
Mutual labels:  transformers, bert
Nlp Architect
A model library for exploring state-of-the-art deep learning topologies and techniques for optimizing Natural Language Processing neural networks
Stars: ✭ 2,768 (+1181.48%)
Mutual labels:  transformers, bert
policy-data-analyzer
Building a model to recognize incentives for landscape restoration in environmental policies from Latin America, the US and India. Bringing NLP to the world of policy analysis through an extensible framework that includes scraping, preprocessing, active learning and text analysis pipelines.
Stars: ✭ 22 (-89.81%)
Mutual labels:  transformers, bert
Tokenizers
💥 Fast State-of-the-Art Tokenizers optimized for Research and Production
Stars: ✭ 5,077 (+2250.46%)
Mutual labels:  transformers, bert
Pytorch Sentiment Analysis
Tutorials on getting started with PyTorch and TorchText for sentiment analysis.
Stars: ✭ 3,209 (+1385.65%)
Mutual labels:  transformers, bert
OpenDialog
An Open-Source Package for Chinese Open-domain Conversational Chatbot (中文闲聊对话系统,一键部署微信闲聊机器人)
Stars: ✭ 94 (-56.48%)
Mutual labels:  transformers, bert
BERT-QE
Code and resources for the paper "BERT-QE: Contextualized Query Expansion for Document Re-ranking".
Stars: ✭ 43 (-80.09%)
Mutual labels:  information-retrieval, bert
cherche
📑 Neural Search
Stars: ✭ 196 (-9.26%)
Mutual labels:  information-retrieval, vector-search
HugsVision
HugsVision is a easy to use huggingface wrapper for state-of-the-art computer vision
Stars: ✭ 154 (-28.7%)
Mutual labels:  transformers, bert
erc
Emotion recognition in conversation
Stars: ✭ 34 (-84.26%)
Mutual labels:  transformers, bert
Fast Bert
Super easy library for BERT based NLP models
Stars: ✭ 1,678 (+676.85%)
Mutual labels:  transformers, bert
bangla-bert
Bangla-Bert is a pretrained bert model for Bengali language
Stars: ✭ 41 (-81.02%)
Mutual labels:  transformers, bert
Spark Nlp
State of the Art Natural Language Processing
Stars: ✭ 2,518 (+1065.74%)
Mutual labels:  transformers, bert
text2class
Multi-class text categorization using state-of-the-art pre-trained contextualized language models, e.g. BERT
Stars: ✭ 15 (-93.06%)
Mutual labels:  transformers, bert
classy
classy is a simple-to-use library for building high-performance Machine Learning models in NLP.
Stars: ✭ 61 (-71.76%)
Mutual labels:  transformers, bert
beir
A Heterogeneous Benchmark for Information Retrieval. Easy to use, evaluate your models across 15+ diverse IR datasets.
Stars: ✭ 738 (+241.67%)
Mutual labels:  information-retrieval, bert
cdQA-ui
⛔ [NOT MAINTAINED] A web interface for cdQA and other question answering systems.
Stars: ✭ 19 (-91.2%)
Mutual labels:  information-retrieval, bert

Generative Pseudo Labeling (GPL)

GPL is an unsupervised domain adaptation method for training dense retrievers. It is based on query generation and pseudo labeling with powerful cross-encoders. To train a domain-adapted model, it needs only the unlabeled target corpus and can achieve significant improvement over zero-shot models.

For more information, checkout our publication:

For reproduction, please refer to this snapshot branch.

Installation

One can either install GPL via pip

pip install gpl

or via git clone

git clone https://github.com/UKPLab/gpl.git && cd gpl
pip install -e .

Meanwhile, please make sure the correct version of PyTorch has been installed according to your CUDA version.

Usage

GPL accepts data in the BeIR-format. For example, we can download the FiQA dataset hosted by BeIR:

wget https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/fiqa.zip
unzip fiqa.zip
head -n 2 fiqa/corpus.jsonl  # One can check this data format. Actually GPL only need this `corpus.jsonl` as data input for training.

Then we can either use the python -m function to run GPL training directly:

export dataset="fiqa"
python -m gpl.train \
    --path_to_generated_data "generated/$dataset" \
    --base_ckpt "distilbert-base-uncased" \
    --gpl_score_function "dot" \
    --batch_size_gpl 32 \
    --gpl_steps 140000 \
    --new_size -1 \
    --queries_per_passage -1 \
    --output_dir "output/$dataset" \
    --evaluation_data "./$dataset" \
    --evaluation_output "evaluation/$dataset" \
    --generator "BeIR/query-gen-msmarco-t5-base-v1" \
    --retrievers "msmarco-distilbert-base-v3" "msmarco-MiniLM-L-6-v3" \
    --retriever_score_functions "cos_sim" "cos_sim" \
    --cross_encoder "cross-encoder/ms-marco-MiniLM-L-6-v2" \
    --qgen_prefix "qgen" \
    --do_evaluation \
    # --use_amp   # Use this for efficient training if the machine supports AMP

# One can run `python -m gpl.train --help` for the information of all the arguments
# To reproduce the experiments in the paper, set `base_ckpt` to "GPL/msmarco-distilbert-margin-mse" (https://huggingface.co/GPL/msmarco-distilbert-margin-mse)

or import GPL's trainining method in a python script:

import gpl

dataset = 'fiqa'
gpl.train(
    path_to_generated_data=f"generated/{dataset}",
    base_ckpt="distilbert-base-uncased",  
    # base_ckpt='GPL/msmarco-distilbert-margin-mse',  
    # The starting checkpoint of the experiments in the paper
    gpl_score_function="dot",
    # Note that GPL uses MarginMSE loss, which works with dot-product
    batch_size_gpl=32,
    gpl_steps=140000,
    new_size=-1,
    # Resize the corpus to `new_size` (|corpus|) if needed. When set to None (by default), the |corpus| will be the full size. When set to -1, the |corpus| will be set automatically: If QPP * |corpus| <= 250K, |corpus| will be the full size; else QPP will be set 3 and |corpus| will be set to 250K / 3
    queries_per_passage=-1,
    # Number of Queries Per Passage (QPP) in the query generation step. When set to -1 (by default), the QPP will be chosen automatically: If QPP * |corpus| <= 250K, then QPP will be set to 250K / |corpus|; else QPP will be set 3 and |corpus| will be set to 250K / 3
    output_dir=f"output/{dataset}",
    evaluation_data=f"./{dataset}",
    evaluation_output=f"evaluation/{dataset}",
    generator="BeIR/query-gen-msmarco-t5-base-v1",
    retrievers=["msmarco-distilbert-base-v3", "msmarco-MiniLM-L-6-v3"],
    retriever_score_functions=["cos_sim", "cos_sim"],
    # Note that these two retriever model work with cosine-similarity
    cross_encoder="cross-encoder/ms-marco-MiniLM-L-6-v2",
    qgen_prefix="qgen",
    # This prefix will appear as part of the (folder/file) names for query-generation results: For example, we will have "qgen-qrels/" and "qgen-queries.jsonl" by default.
    do_evaluation=True,
    # use_amp=True   # One can use this flag for enabling the efficient float16 precision
)

One can also refer to this toy example on Google Colab for better understanding how the code works.

How does GPL work?

The workflow of GPL is shown as follows:

  1. GPL first use a seq2seq (we use BeIR/query-gen-msmarco-t5-base-v1 by default) model to generate queries_per_passage queries for each passage in the unlabeled corpus. The query-passage pairs are viewed as positive examples for training.

    Result files (under path $path_to_generated_data): (1) ${qgen}-qrels/train.tsv, (2) ${qgen}-queries.jsonl and also (3) corpus.jsonl (copied from $evaluation_data/);

  2. Then, it runs negative mining with the generated queries as input on the target corpus. The mined passages will be viewed as negative examples for training. One can specify any dense retrievers (SBERT or Huggingface/transformers checkpoints, we use msmarco-distilbert-base-v3 + msmarco-MiniLM-L-6-v3 by default) or BM25 to the argument retrievers as the negative miner.

    Result file (under path $path_to_generated_data): hard-negatives.jsonl;

  3. Finally, it does pseudo labeling with the powerful cross-encoders (we use cross-encoder/ms-marco-MiniLM-L-6-v2 by default.) on the query-passage pairs that we have so far (for both positive and negative examples).

    Result file (under path $path_to_generated_data): gpl-training-data.tsv. It contains (gpl_steps * batch_size_gpl) tuples in total.

Up to now, we have the actual training data ready. One can look at sample-data/generated/fiqa for a quick example about the data format. The very last step is to apply the MarginMSE loss to teach the student retriever to mimic the margin scores, CE(query, positive) - CE(query, negative) labeled by the teacher model (Cross-Encoder, CE). And of course, the MarginMSE step is included in GPL and will be done automatically:). Note that MarginMSE works with dot-product and thus the final models trained with GPL works with dot-product.

PS: The --retrievers are for negative mining. They can be any dense retrievers trained on the general domain (e.g. MS MARCO) and do not need to be strong for the target task/domain. Please refer to the paper for more details (cf. Table 7).

Customized data

One can also replace/put the customized data for any intermediate step under the path $path_to_generated_data with the same name fashion. GPL will skip the intermediate steps by using these provided data.

As a typical workflow, one might only have the (English) unlabeld corpus and want a good model performing well for this corpus. To run GPL training under such condition, one just needs these steps:

  1. Prepare your corpus in the same format as the data sample;
  2. Put your corpus.jsonl under a folder, e.g. named as "generated" for data loading and data generation by GPL;
  3. Call gpl.train with the folder path as an input argument: (other arguments work as usual)
python -m gpl.train \
    --path_to_generated_data "generated" \
    --output_dir "output" \
    --new_size -1 \
    --queries_per_passage -1

Pre-trained checkpoints and generated data

Pre-trained checkpoints

We now release the pre-trained GPL models via the https://huggingface.co/GPL. There are currently five types of models:

  1. GPL/${dataset}-msmarco-distilbert-gpl: Model with training order of (1) MarginMSE on MSMARCO -> (2) GPL on ${dataset};
  2. GPL/${dataset}-tsdae-msmarco-distilbert-gpl: Model with training order of (1) TSDAE on ${dataset} -> (2) MarginMSE on MSMARCO -> (3) GPL on ${dataset};
  3. GPL/msmarco-distilbert-margin-mse: Model trained on MSMARCO with MarginMSE;
  4. GPL/${dataset}-tsdae-msmarco-distilbert-margin-mse: Model with training order of (1) TSDAE on ${dataset} -> (2) MarginMSE on MSMARCO;
  5. GPL/${dataset}-distilbert-tas-b-gpl-self_miner: Starting from the tas-b model, the models were trained with GPL on the target corpus ${dataset} with the base model itself as the negative miner (here noted as "self_miner").

Models 1. and 2. were actually trained on top of models 3. and 4. resp. All GPL models were trained the automatic setting of new_size and queries_per_passage (by setting them to -1). This automatic setting can keep the performance while being efficient. For more details, please refer to the section 4.1 in the paper.

Among these models, GPL/${dataset}-distilbert-tas-b-gpl-self_miner ones works the best on the BeIR benchmark:

For reproducing the results with the same package versions used in the experiments, please refer to the conda environment file, environment.yml.

Generated data

We now release the generated data used in the experiments of the GPL paper:

  1. The generated data for the main experiments on the 6 BeIR datasets: https://public.ukp.informatik.tu-darmstadt.de/kwang/gpl/generated-data/main/;
  2. The generated data for the experiments on the full 18 BeIR datasets: https://public.ukp.informatik.tu-darmstadt.de/kwang/gpl/generated-data/beir.

Please note that the 4 datasets of bioasq, robust04, trec-news and signal1m are only available after registration with the original official authorities. We only release the document IDs for these corpora with the file name corpus.doc_ids.txt. For more details, please refer to the BeIR repository.

Citation

If you use the code for evaluation, feel free to cite our publication GPL: Generative Pseudo Labeling for Unsupervised Domain Adaptation of Dense Retrieval:

@article{wang2021gpl,
    title = "GPL: Generative Pseudo Labeling for Unsupervised Domain Adaptation of Dense Retrieval",
    author = "Kexin Wang and Nandan Thakur and Nils Reimers and Iryna Gurevych", 
    journal= "arXiv preprint arXiv:2112.07577",
    month = "4",
    year = "2021",
    url = "https://arxiv.org/abs/2112.07577",
}

Contact person and main contributor: Kexin Wang, [email protected]

https://www.ukp.tu-darmstadt.de/

https://www.tu-darmstadt.de/

Don't hesitate to send us an e-mail or report an issue, if something is broken (and it shouldn't be) or if you have further questions.

This repository contains experimental software and is published for the sole purpose of giving additional background details on the respective publication.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].