All Projects → sebastian-hofstaetter → sigir19-neural-ir

sebastian-hofstaetter / sigir19-neural-ir

Licence: Apache-2.0 license
Source code for: On the Effect of Low-Frequency Terms on Neural-IR Models, SIGIR'19

Programming Languages

python
139335 projects - #7 most used programming language
shell
77523 projects

Projects that are alternatives of or similar to sigir19-neural-ir

Terrier Core
Terrier IR Platform
Stars: ✭ 156 (+254.55%)
Mutual labels:  information-retrieval
Vec4ir
Word Embeddings for Information Retrieval
Stars: ✭ 188 (+327.27%)
Mutual labels:  information-retrieval
Aquiladb
Drop in solution for Decentralized Neural Information Retrieval. Index latent vectors along with JSON metadata and do efficient k-NN search.
Stars: ✭ 222 (+404.55%)
Mutual labels:  information-retrieval
Sf1r Lite
Search Formula-1——A distributed high performance massive data engine for enterprise/vertical search
Stars: ✭ 158 (+259.09%)
Mutual labels:  information-retrieval
K Nrm
K-NRM: End-to-End Neural Ad-hoc Ranking with Kernel Pooling
Stars: ✭ 183 (+315.91%)
Mutual labels:  information-retrieval
Rank bm25
A Collection of BM25 Algorithms in Python
Stars: ✭ 187 (+325%)
Mutual labels:  information-retrieval
Tutorial Utilizing Kg
Resources for Tutorial on "Utilizing Knowledge Graphs in Text-centric Information Retrieval"
Stars: ✭ 148 (+236.36%)
Mutual labels:  information-retrieval
Conceptualsearch
Train a Word2Vec model or LSA model, and Implement Conceptual Search\Semantic Search in Solr\Lucene - Simon Hughes Dice.com, Dice Tech Jobs
Stars: ✭ 245 (+456.82%)
Mutual labels:  information-retrieval
Neuralqa
NeuralQA: A Usable Library for Question Answering on Large Datasets with BERT
Stars: ✭ 185 (+320.45%)
Mutual labels:  information-retrieval
Ranknet
My (slightly modified) Keras implementation of RankNet and PyTorch implementation of LambdaRank.
Stars: ✭ 211 (+379.55%)
Mutual labels:  information-retrieval
Bm25
A Python implementation of the BM25 ranking function.
Stars: ✭ 159 (+261.36%)
Mutual labels:  information-retrieval
Ranking
Learning to Rank in TensorFlow
Stars: ✭ 2,362 (+5268.18%)
Mutual labels:  information-retrieval
Hdltex
HDLTex: Hierarchical Deep Learning for Text Classification
Stars: ✭ 191 (+334.09%)
Mutual labels:  information-retrieval
Gensim
Topic Modelling for Humans
Stars: ✭ 12,763 (+28906.82%)
Mutual labels:  information-retrieval
Catalyst
Accelerated deep learning R&D
Stars: ✭ 2,804 (+6272.73%)
Mutual labels:  information-retrieval
Pyserini
Python interface to the Anserini IR toolkit built on Lucene
Stars: ✭ 148 (+236.36%)
Mutual labels:  information-retrieval
Openmatch
An Open-Source Package for Information Retrieval.
Stars: ✭ 186 (+322.73%)
Mutual labels:  information-retrieval
ComposeAE
Official code for WACV 2021 paper - Compositional Learning of Image-Text Query for Image Retrieval
Stars: ✭ 49 (+11.36%)
Mutual labels:  information-retrieval
Trinity
Trinity IR Infrastructure
Stars: ✭ 227 (+415.91%)
Mutual labels:  information-retrieval
Pwnback
Burp Extender plugin that generates a sitemap of a website using Wayback Machine
Stars: ✭ 203 (+361.36%)
Mutual labels:  information-retrieval

On the Effect of Low-Frequency Terms on Neural-IR Models

SIGIR’19, Sebastian Hofstätter, Navid Rekabsaz, Carsten Eickhoff, and Allan Hanbury

Low-frequency terms are a recurring challenge for information retrieval models, especially neural IR frameworks struggle with adequately capturing infrequently observed words. While these terms are often removed from neural models - mainly as a concession to efficiency demands - they traditionally play an important role in the performance of IR models. In this paper, we analyze the effects of low-frequency terms on the performance and robustness of neural IR models. We conduct controlled experiments on three recent neural IR models, trained on a large-scale passage retrieval collection. We evaluate the neural IR models with various vocabulary sizes for their respective word embeddings, considering different levels of constraints on the available GPU memory.

We observe that despite the significant benefits of using larger vocabularies, the performance gap between the vocabularies can be, to a great extent, mitigated by extensive tuning of a related parameter: the number of documents to re-rank. We further investigate the use of subword-token embedding models, and in particular FastText, for neural IR models. Our experiments show that using FastText brings slight improvements to the overall performance of the neural IR models in comparison to models trained on the full vocabulary, while the improvement becomes much more pronounced for queries containing low-frequency terms.

Get the full paper here: http://arxiv.org/abs/1904.12683

Please cite the paper:

@inproceedings{hofstaetter_sigir_2019,
    author = {Hofst{\"a}tter, Sebastian and Rekabsaz, Navid and Eickhoff, Carsten and Hanbury, Allan},
    title = {On the Effect of Low-Frequency Terms on Neural-IR Models},
    booktitle = {Proceedings of SIGIR},
    year = {2019},
    publisher = {ACM}
}

*If you have any questions or suggestions, please feel free to open an issue or write an email to Sebastian (email in the paper). Of course we are also open future collaborations in the field of neural IR *

Implemented models

Thanks to all the original authors for their inspiring papers! - We re-implemented the following models:

  • KNRM: Chenyan Xiong, Zhuyun Dai, Jamie Callan, Zhiyuan Liu, and Russell Power. 2017. End-to-End Neural Ad-hoc Ranking with Kernel Pooling. In Proc. of SIGIR.
  • Conv-KNRM: Zhuyun Dai, Chenyan Xiong, Jamie Callan, and Zhiyuan Liu. 2018. Convolutional Neural Networks for Soft-Matching N-Grams in Ad-hoc Search. In Proc. of WSDM
  • MatchPyramid: Liang Pang, Yanyan Lan, Jiafeng Guo, Jun Xu, Shengxian Wan, and Xueqi Cheng. 2016. Text Matching as Image Recognition. In Proc of. AAAI.

We show that all three models work really well with the MS MARCO test collection - if implemented and tuned correctly.

Implementation: General ideas & setup

Requirements: PyTorch 1.0+ and AllenNLP

  • For re-ranking depth evaluation you need BM25 ranks (We recommend using Anserini to generate them)

  • train.py is the main trainer -> it uses a multiprocess batch generation pipeline

  • the multiprocess pipeline requires us to do some preprocessing:

    1. pre-tokenize the training & eval files (because we want the spacy tokenizer, but it is just to slow)
    2. split the files with preprocessing/generate_file_split.sh (so that each loader process gets its own file and does not need to coordinate)

How to train the models

  1. Get the MS MARCO re-ranking dataset & clone this repository
  2. Prepare the dataset for training (with costly spacy tokenizer)
    • Run python matchmaker/preprocessing/tokenize_files.py --in-file <path> --out-file <path> --reader-type <labeled_tuple or triple> to save a tokenized version of the training and evaluation files (now we can use the much faster space only tokenizer when reading the files, but we have to benefits of the quality tokenization of spacy)
  3. Prepare the dataset for multiprocessing:
    • Use ./generate_file_split.sh 1x for training.tsv and 1x for top1000dev.tsv (the validation set)
    • You have to decide now on the number of data preparation processes you want to use for training and validation
    • You have to decide on the batch size
    • Run ./generate_file_split.sh <base_file> <n file chunks> <output_folder_and_prefix> <batch_size> for train + validation sets
    • Take the number of batches that are output at the end of the script and put them in your config .yaml
    • The number of processes for preprocessing depends on your local hardware, the preprocesses need to be faster at generating the batches then the gpu at computing the results for them (validation is much faster than training, so you need more processes)
  4. Create a new config .yaml in configs/ with all your local paths + batch counts for train and validation
    • The train and validation paths should be the output folder of 2 with a star at the end (the paths will be globed to get all files)`
  5. Create an AllenNLP vocabulary with preprocessing/generate_vocab.py, optional for re-ranking threshold evaluation: create new validation tuples that exactly match the bm25 results from anserini with preprocessing/generate_validation_input_from_candidate_set.py
  6. Run train.py with python -W ignore train.py --run-name experiment1 --config-file configs/your_file.yaml (-W ignore = ignores useless spacy import warnings, that come up for every subprocess (and there are many of them))
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].