All Projects → AdeDZY → K Nrm

AdeDZY / K Nrm

Licence: bsd-3-clause
K-NRM: End-to-End Neural Ad-hoc Ranking with Kernel Pooling

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to K Nrm

Vtext
Simple NLP in Rust with Python bindings
Stars: ✭ 108 (-40.98%)
Mutual labels:  information-retrieval
Easyocr
Ready-to-use OCR with 80+ supported languages and all popular writing scripts including Latin, Chinese, Arabic, Devanagari, Cyrillic and etc.
Stars: ✭ 13,379 (+7210.93%)
Mutual labels:  information-retrieval
Gensim
Topic Modelling for Humans
Stars: ✭ 12,763 (+6874.32%)
Mutual labels:  information-retrieval
Scilla
🏴‍☠️ Information Gathering tool 🏴‍☠️ DNS / Subdomains / Ports / Directories enumeration
Stars: ✭ 116 (-36.61%)
Mutual labels:  information-retrieval
Foundry
The Cognitive Foundry is an open-source Java library for building intelligent systems using machine learning
Stars: ✭ 124 (-32.24%)
Mutual labels:  information-retrieval
Invoicenet
Deep neural network to extract intelligent information from invoice documents.
Stars: ✭ 1,886 (+930.6%)
Mutual labels:  information-retrieval
Sert
Semantic Entity Retrieval Toolkit
Stars: ✭ 100 (-45.36%)
Mutual labels:  information-retrieval
Books
Books worth spreading
Stars: ✭ 161 (-12.02%)
Mutual labels:  information-retrieval
Rated Ranking Evaluator
Search Quality Evaluation Tool for Apache Solr & Elasticsearch search-based infrastructures
Stars: ✭ 134 (-26.78%)
Mutual labels:  information-retrieval
Terrier Core
Terrier IR Platform
Stars: ✭ 156 (-14.75%)
Mutual labels:  information-retrieval
Haystack
🔍 Haystack is an open source NLP framework that leverages Transformer models. It enables developers to implement production-ready neural search, question answering, semantic document search and summarization for a wide range of applications.
Stars: ✭ 3,409 (+1762.84%)
Mutual labels:  information-retrieval
Dan Jurafsky Chris Manning Nlp
My solution to the Natural Language Processing course made by Dan Jurafsky, Chris Manning in Winter 2012.
Stars: ✭ 124 (-32.24%)
Mutual labels:  information-retrieval
Tutorial Utilizing Kg
Resources for Tutorial on "Utilizing Knowledge Graphs in Text-centric Information Retrieval"
Stars: ✭ 148 (-19.13%)
Mutual labels:  information-retrieval
Pytrec eval
pytrec_eval is an Information Retrieval evaluation tool for Python, based on the popular trec_eval.
Stars: ✭ 114 (-37.7%)
Mutual labels:  information-retrieval
Sf1r Lite
Search Formula-1——A distributed high performance massive data engine for enterprise/vertical search
Stars: ✭ 158 (-13.66%)
Mutual labels:  information-retrieval
Ds2i
A library of inverted index data structures
Stars: ✭ 104 (-43.17%)
Mutual labels:  information-retrieval
Entityduetneuralranking
Entity-Duet Neural Ranking Model
Stars: ✭ 137 (-25.14%)
Mutual labels:  information-retrieval
Ranking
Learning to Rank in TensorFlow
Stars: ✭ 2,362 (+1190.71%)
Mutual labels:  information-retrieval
Bm25
A Python implementation of the BM25 ranking function.
Stars: ✭ 159 (-13.11%)
Mutual labels:  information-retrieval
Pyserini
Python interface to the Anserini IR toolkit built on Lucene
Stars: ✭ 148 (-19.13%)
Mutual labels:  information-retrieval

K-NRM

This is the implementation of the Kernel-based Neural Ranking Model (K-NRM) model from paper End-to-End Neural Ad-hoc Ranking with Kernel Pooling.

If you use this code for your scientific work, please cite it as (bibtex):

C. Xiong, Z. Dai, J. Callan, Z. Liu, and R. Power. End-to-end neural ad-hoc ranking with kernel pooling. 
In Proceedings of the 40th International ACM SIGIR Conference on Research & Development in Information Retrieval. 
ACM. 2017.

Requirements


  • Tensorflow 0.12
  • Numpy
  • traitlets

Coming soon: K-NRM with Tensorflow 1.0

Guide To Use


Configure: first, configure the model through the config file. Configurable parameters are listed here

sample.config

Training : pass the config file, training data and validation data as

python ./knrm/model/model_knrm.py config-file\
    --train \
    --train_file: path to training data\
    --validation_file: path to validation data\
    --train_size: size of training data (number of training samples)\
    --checkpoint_dir: directory to store/load model checkpoints\ 
    --load_model: True or False. Start with a new model or continue training

sample-train.sh

Testing: pass the config file and testing data as

python ./knrm/model/model_knrm.py config-file\
    --test \
    --test_file: path to testing data\
    --test_size: size of testing data (number of testing samples)\
    --checkpoint_dir: directory to load trained model\
    --output_score_file: file to output documents score\

Relevance scores will be output to output_score_file, one score per line, in the same order as test_file. We provide a script to convert scores into trec format.

./knrm/tools/gen_trec_from_score.py

Data Preperation


All queries and documents must be mapped into sequences of integer term ids. Term id starts with 1. -1 indicates OOV or non-existence. Term ids are sepereated by ,

Training Data Format

Each training sample is a tuple of (query, postive document, negative document)

query \t postive_document \t negative_document \t score_difference

Example: 177,705,632 \t 177,705,632,-1,2452,6,98 \t 177,705,632,3,25,14,37,2,146,159, -1 \t 0.119048

If score_difference < 0, the data generator will swap postive docment and negative document.

If score_difference < lickDataGenerator.min_score_diff, this training sample will be omitted.

We recommend shuffling the training samples to ease model convergence.

Testing Data Format

Each testing sample is a tuple of (query, document)

q \t document

Example: 177,705,632 \t 177,705,632,-1,2452,6,98

Configurations


Model Configurations

  • BaseNN.n_bins: number of kernels (soft bins) (default: 11. One exact match kernel and 10 soft kernels)
  • Knrm.lamb: defines the guassian kernels' sigma value. sigma = lamb * bin_size (default:0.5 -> sigma=0.1)
  • BaseNN.embedding_size: embedding dimension (default: 300)
  • BaseNN.max_q_len: max query length (default: 10)
  • BaseNN.max_d_len: max document length (default: 50)
  • DataGenerator.max_q_len: max query length. Should be the same as BaseNN.max_q_len (default: 10)
  • DataGenerator.max_d_len: max query length. Should be the same as BaseNN.max_d_len (default: 50)
  • BaseNN.vocabulary_size: vocabulary size.
  • DataGenerator.vocabulary_size: vocabulary size.

Data

  • Knrm.emb_in: initial embeddings
  • DataGenerator.min_score_diff: minimum score differences between postive documents and negative ones (default: 0)

Training Parameters

  • BaseNN.bath_size: batch size (default: 16)
  • BaseNN.max_epochs: max number of epochs to train
  • BaseNN.eval_frequency: evaluate model on validation set very this steps (default: 1000)
  • BaseNN.checkpoint_steps: save model very this steps (default: 10000)
  • Knrm.learning_rate: learning rate for Adam Opitmizer (default: 0.001)
  • Knrm.epsilon: epsilon for Adam Optimizer (default: 0.00001)

Efficiency

During training, it takes about 60ms to process one batch on a single-GPU machine with the following settings:

  • batch size: 16
  • max_q_len: 10
  • max_d_len: 50
  • vocabulary_size: 300K

Smaller vocabulary and shorter documents accelerate the training.

Click2Vec


We also provide the click2vec model as described in our paper.

  • ./knrm/click2vec/generate_click_term_pair.py: generate <query_term, clicked_title_term> pairs
  • ./knrm/click2vec/run_word2vec.sh: call Google's word2vec tool to train click2vec.

Cite the paper


If you use this code for your scientific work, please cite it as:

C. Xiong, Z. Dai, J. Callan, Z. Liu, and R. Power. End-to-end neural ad-hoc ranking with kernel pooling. 
In Proceedings of the 40th International ACM SIGIR Conference on Research & Development in Information Retrieval. 
ACM. 2017.
@inproceedings{xiong2017neural,
  author          = {{Xiong}, Chenyan and {Dai}, Zhuyun and {Callan}, Jamie and {Liu}, Zhiyuan and {Power}, Russell},
  title           = "{End-to-End Neural Ad-hoc Ranking with Kernel Pooling}",
  booktitle       = {Proceedings of the 40th International ACM SIGIR Conference on Research & Development in Information Retrieval},
  organization    = {ACM},
  year            = 2017,
}
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].