All Projects → sebastian-hofstaetter → neural-ranking-kd

sebastian-hofstaetter / neural-ranking-kd

Licence: Apache-2.0 license
Improving Efficient Neural Ranking Models with Cross-Architecture Knowledge Distillation

Programming Languages

Jupyter Notebook
11667 projects
python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to neural-ranking-kd

BERTOverflow
A Pre-trained BERT on StackOverflow Corpus
Stars: ✭ 40 (-45.95%)
Mutual labels:  bert
R-AT
Regularized Adversarial Training
Stars: ✭ 19 (-74.32%)
Mutual labels:  bert
js-symbol-tree
Turn any collection of objects into its own efficient tree or linked list using Symbol
Stars: ✭ 86 (+16.22%)
Mutual labels:  efficiency
Flutter-StoryBoard
A Flutter based application to showcase your custom widgets in your app that helps in easy review of the design.
Stars: ✭ 20 (-72.97%)
Mutual labels:  efficiency
GEANet-BioMed-Event-Extraction
Code for the paper Biomedical Event Extraction with Hierarchical Knowledge Graphs
Stars: ✭ 52 (-29.73%)
Mutual labels:  bert
ExpBERT
Code for our ACL '20 paper "Representation Engineering with Natural Language Explanations"
Stars: ✭ 28 (-62.16%)
Mutual labels:  bert
AliceMind
ALIbaba's Collection of Encoder-decoders from MinD (Machine IntelligeNce of Damo) Lab
Stars: ✭ 1,479 (+1898.65%)
Mutual labels:  bert
CheXbert
Combining Automatic Labelers and Expert Annotations for Accurate Radiology Report Labeling Using BERT
Stars: ✭ 51 (-31.08%)
Mutual labels:  bert
rasa-bert-finetune
支持rasa-nlu 的bert finetune
Stars: ✭ 46 (-37.84%)
Mutual labels:  bert
korpatbert
특허분야 특화된 한국어 AI언어모델 KorPatBERT
Stars: ✭ 48 (-35.14%)
Mutual labels:  bert
beir
A Heterogeneous Benchmark for Information Retrieval. Easy to use, evaluate your models across 15+ diverse IR datasets.
Stars: ✭ 738 (+897.3%)
Mutual labels:  bert
bert extension tf
BERT Extension in TensorFlow
Stars: ✭ 29 (-60.81%)
Mutual labels:  bert
TabFormer
Code & Data for "Tabular Transformers for Modeling Multivariate Time Series" (ICASSP, 2021)
Stars: ✭ 209 (+182.43%)
Mutual labels:  bert
bert attn viz
Visualize BERT's self-attention layers on text classification tasks
Stars: ✭ 41 (-44.59%)
Mutual labels:  bert
TriB-QA
吹逼我们是认真的
Stars: ✭ 45 (-39.19%)
Mutual labels:  bert
LAMB Optimizer TF
LAMB Optimizer for Large Batch Training (TensorFlow version)
Stars: ✭ 119 (+60.81%)
Mutual labels:  bert
OffsetGuided
Code for "Greedy Offset-Guided Keypoint Grouping for Human Pose Estimation"
Stars: ✭ 31 (-58.11%)
Mutual labels:  efficiency
BiaffineDependencyParsing
BERT+Self-attention Encoder ; Biaffine Decoder ; Pytorch Implement
Stars: ✭ 67 (-9.46%)
Mutual labels:  bert
BERT-QE
Code and resources for the paper "BERT-QE: Contextualized Query Expansion for Document Re-ranking".
Stars: ✭ 43 (-41.89%)
Mutual labels:  bert
BERT-chinese-text-classification-pytorch
This repo contains a PyTorch implementation of a pretrained BERT model for text classification.
Stars: ✭ 92 (+24.32%)
Mutual labels:  bert

Neural IR: Cross-Architecture Knowledge Distillation

Welcome 🙌 to the hub-repo of our paper:

Improving Efficient Neural Ranking Models with Cross-Architecture Knowledge Distillation Sebastian Hofstätter, Sophia Althammer, Michael Schröder, Mete Sertkan and Allan Hanbury

https://arxiv.org/abs/2010.02666

tl;dr We utilize an ensemble of BERTCAT models (the vanilla BERT passage re-ranking model) to teach & improve a range of other more efficient architectures for (re-)ranking with a Margin-MSE loss. We publish the teacher training files for everyone to use here 🎉 We are sure the community can do very cool things with these training files 😊

If you have any questions, suggestions, or want to collaborate please don't hesitate to get in contact with us via Twitter or mail to [email protected]

Knowledge Distillation Workflow Diagram The knowledge distillation workflow; we provide the "Result Store" in this repo.

Please cite our work as:

@misc{hofstaetter2020_crossarchitecture_kd,
      title={Improving Efficient Neural Ranking Models with Cross-Architecture Knowledge Distillation}, 
      author={Sebastian Hofst{\"a}tter and Sophia Althammer and Michael Schr{\"o}der and Mete Sertkan and Allan Hanbury},
      year={2020},
      eprint={2010.02666},
      archivePrefix={arXiv},
      primaryClass={cs.IR}
}

Pre-Trained Models

We provide the following full-trained 6 layer DistilBERT-based models (trained with Margin-MSE using a 3 teacher BERTCAT ensemble (T2 in the paper) on MSMARCO-Passage) via the HuggingFace model hub:

If you have a specific request for a pre-trained model from the paper, please create an issue here :)

Teacher Training Files (MSMARCO-Passage)

We publish the training files without the text content instead using the ids from MSMARCO; for the text content please download the files from the MSMARCO Github Page and use the helper script (teacher_id_to_text.py) in this repo to expand the id files with the fulltext, if necessary.

The teacher files (using the data from "Train Triples Small" with ~40 million triples) with the format pos_score neg_score query_id pos_passage_id neg_passage_id (with tab separation) are:

  • (T1) BERT-BaseCAT
  • (T2) Mean-Ensemble of BERT-BaseCAT + BERT-LargeCAT + ALBERT-LargeCAT

available at Zenodo: https://zenodo.org/record/4068216

Source Code

The full source-code for our paper is here, as part of our matchmaker library: https://github.com/sebastian-hofstaetter/matchmaker

We have getting started guides for training teachers & students, as well as a range of other possibilities surrounding the Margin-MSE loss.

Cross Architecture Knowledge Distillation

The latency of neural ranking models at query time is largely dependent on the architecture and deliberate choices by their designers to trade-off effectiveness for higher efficiency. This focus on low query latency of a rising number of efficient ranking architectures make them feasible for production deployment. In machine learning an increasingly common approach to close the effectiveness gap of more efficient models is to apply knowledge distillation from a large teacher model to a smaller student model. We find that different ranking architectures tend to produce output scores in different magnitudes. Based on this finding, we propose a cross-architecture training procedure with a margin focused loss (Margin-MSE), that adapts knowledge distillation to the varying score output distributions of different BERT and non-BERT ranking architectures. We apply the teachable information as additional fine-grained labels to existing training triples of the MSMARCO-Passage collection. We evaluate our procedure of distilling knowledge from state-of-the-art concatenated BERT models to four different efficient architectures (TK, ColBERT, PreTT, and a BERT CLS dot product model). We show that across our evaluated architectures our Margin-MSE knowledge distillation significantly improves their effectiveness without compromising their efficiency.

Efficiency vs. Effectiveness Results from the paper Efficiency vs. effectiveness on two query sets and their main effectiveness metrics, T1 and T2 are the knowledge distillation trained models with BERT-BaseCAT (T1) and the ensemble (T2)

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].