All Projects → chengxuanying → Wsdm Adhoc Document Retrieval

chengxuanying / Wsdm Adhoc Document Retrieval

This is our solution for WSDM - DiggSci 2020. We implemented a simple yet robust search pipeline which ranked 2nd in the validation set and 4th in the test set. We won the gold prize at innovation track and bronze prize at dataset track.

Projects that are alternatives of or similar to Wsdm Adhoc Document Retrieval

Chinesetrafficpolicepose
Detects Chinese traffic police commanding poses 检测中国交警指挥手势
Stars: ✭ 49 (-2%)
Mutual labels:  jupyter-notebook
Spotifyml
Stars: ✭ 49 (-2%)
Mutual labels:  jupyter-notebook
Tensorflow From Zero To One
TensorFlow 最佳学习资源大全(含课程、书籍、博客、公开课等内容)
Stars: ✭ 1,052 (+2004%)
Mutual labels:  jupyter-notebook
Accurate Binary Convolution Network
Binary Convolution Network for faster real-time processing in ASICs
Stars: ✭ 49 (-2%)
Mutual labels:  jupyter-notebook
Spark Tutorials
Code and Notebooks for Spark Tutorials for Learning Journal @ Youtube
Stars: ✭ 49 (-2%)
Mutual labels:  jupyter-notebook
Randomized Svd
demos for PyBay talk: Using Randomness to make code faster
Stars: ✭ 49 (-2%)
Mutual labels:  jupyter-notebook
Amldataprepdocs
Documentation for Microsoft Azure Machine Learning Data Preparation
Stars: ✭ 49 (-2%)
Mutual labels:  jupyter-notebook
My Projects
It's my projects
Stars: ✭ 50 (+0%)
Mutual labels:  jupyter-notebook
Universodiscreto
Códigos explicados nos vídeos do canal Universo Discreto (YouTube)
Stars: ✭ 49 (-2%)
Mutual labels:  jupyter-notebook
Mlapp Solutions
Solutions in Python for Kevin Murphy's Machine Learning: a Probabilistic Perspective
Stars: ✭ 49 (-2%)
Mutual labels:  jupyter-notebook
Eeg Classification Using Recurrent Neural Network
Used LSTM Network to classify eeg signals based on stimuli the subject recieved (visual or audio)
Stars: ✭ 49 (-2%)
Mutual labels:  jupyter-notebook
Lipreading
Stars: ✭ 49 (-2%)
Mutual labels:  jupyter-notebook
Teal deer
Teal deer (from TL;DR) helps you get the gist of all the stuff you need to read, so you don't have to read it all at once.
Stars: ✭ 49 (-2%)
Mutual labels:  jupyter-notebook
Mckinsey Smartcities Traffic Prediction
Adventure into using multi attention recurrent neural networks for time-series (city traffic) for the 2017-11-18 McKinsey IronMan (24h non-stop) prediction challenge
Stars: ✭ 49 (-2%)
Mutual labels:  jupyter-notebook
Feature Engineering Book
Code repo for the book "Feature Engineering for Machine Learning," by Alice Zheng and Amanda Casari, O'Reilly 2018
Stars: ✭ 1,052 (+2004%)
Mutual labels:  jupyter-notebook
Causality
A small repo in which I play with the ideas of causal modelling.
Stars: ✭ 49 (-2%)
Mutual labels:  jupyter-notebook
Salmonte
SalmonTE is an ultra-Fast and Scalable Quantification Pipeline of Transpose Element (TE) Abundances
Stars: ✭ 49 (-2%)
Mutual labels:  jupyter-notebook
Live Video Analytics
A collection of reference applications using live video analytics capabilities in Azure Media Services
Stars: ✭ 50 (+0%)
Mutual labels:  jupyter-notebook
Winter 2016 Cs231n
Assignments: CNN for Visual Recognition.
Stars: ✭ 49 (-2%)
Mutual labels:  jupyter-notebook
Pure Numpy Feedfowardnn
Simple feedforward neural network class written in pure python+numpy
Stars: ✭ 49 (-2%)
Mutual labels:  jupyter-notebook

WSDM-Adhoc-Document-Retrieval

This is our solution for WSDM - DiggSci 2020. We implemented a simple yet robust search pipeline which ranked 2nd in the validation set and 4th in the test set. We won the gold prize at innovation track and bronze prize at dataset track. [Video] [Slides] [Report] visitors

Related Project: KDD-Multimodalities-Recall

Features

  • An end-to-end system with zero feature engineering.
  • Performed data cleaning on the dataset according to self-designed saliency-based rules, and removed the redundancy data with an insignificant impact on results, and improved the [email protected] by 3%.
  • Designed a novel early stopping strategy for reranking based on the confidence score to avoid up to 40% unnecessary inference computation cost of the BERT.
  • Scores are stable (nearly the same) on the train_val, validation, and test sets.

Our Pipeline

  1. Open the jupyter notebook jupyter lab or jupyter notebook
  2. Clean the dataset: Open 01WASH.ipynb and run all cells. In this notebook, we clean the dataset, by removing the description text which is not highly related to the query topic. We also remove the NA rows in the candidate set. Such practice decreases the recall size from 838,939 to 636,439 without sacrificing so much recall rate.
  3. Recall: Open 02RECALL.ipynb and run all cells. In this notebook, we use the bm25 metric, which is a kind of scoring method to recall the documents by the same important keywords. For faster calculating, we adopted the cupyx to accelerate the calculation. Hence, the matrix multiplication of (validsize, vocab_size) dot (600k, vocab_size) can be done in 15 mins in a single GPU card.
  4. Rerank: Open 05BERT_ADARERANK.ipynb and run all cells. In this notebook, we used the fintuned BioBERT model to scoring every (query, document) pair. A novel early stopping strategy is designed for saving the computation. That is, when reranking documents for a given query, if a document is scored with high confidence (above a threshold), the reranking process for this query can be earlier stopped.

How to finetune the BioBERT for Scoring:

As mentioned in the step4, we need to finetune the BioBERT for the reranking task. Please refer to the notebook file03BERT_PREPARE.ipynb and 04BERT_TRAIN.ipynb for coding details. Also, we will give some worth-mentioning tips:

  1. Use pairwise BERT: When using Bert to scoring sentence pairs, using the [token] vector as output and followed by a single-layer neural network with dropout is recommended.
  2. Use RankNet loss: Cross-entropy is not the best choice for the ranking problem, because it aims to train the scoring function to be inf or -inf. Such loss benefits to the classification task, while in ranking task, we do not need extreme scores. What we need is more discriminative scores - the document more related scores higher. That is the Ranknet loss. Limited to the GPU resource, our team can not implement RankNet loss in BERT. Instead, we selected the finetuned models performing well in ranking task, which is so-called underfitting model in the classification tasks. Such practice improves 0.03+ [email protected] in the validation set.
  3. Use 512 Tokens in Training: For both training and inference phase, longer token means that the model can capture more semantic information. In our test, increasing token length from 256 to 512 can improve 0.02+ [email protected]
  4. Upsample Positive Items: Similar to the classification task, you can upsample the positive (query, doc) pairs or reweight them in the loss item.

Members

  1. Chengxuan Ying, Dalian University of Technology (应承轩 大连理工大学)

  2. Chen HuoServer sponsor, Wechat (霍晨 微信)

DataLeak

We did not use any data leak tricks, though we know the data leak exists.

Acknownledgment

Thanks for Yanming Shen, who provided a 8-GPU server for 4 days.

Links to Other Solutions

Reference

  1. Nogueira R, Cho K. Passage Re-ranking with BERT[J]. arXiv preprint arXiv:1901.04085, 2019.
  2. Burges C, Shaked T, Renshaw E, et al. Learning to rank using gradient descent[C]//Proceedings of the 22nd International Conference on Machine learning (ICML-05). 2005: 89-96.
  3. Severyn A, Moschitti A. Learning to rank short text pairs with convolutional deep neural networks[C]//Proceedings of the 38th international ACM SIGIR conference on research and development in information retrieval. ACM, 2015: 373-382.

Seeking Opportunities

I will be graduated in the summer of 2021 from Dalian University of Technology. If you can refer me to any company, please contact me [email protected].

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].