Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

Created with love in Canada, visit hostnodejs.com today

Feel like to post an Ad? Learn Details

All Projects → chengxuanying → Wsdm Adhoc Document Retrieval

chengxuanying / Wsdm Adhoc Document Retrieval

This is our solution for WSDM - DiggSci 2020. We implemented a simple yet robust search pipeline which ranked 2nd in the validation set and 4th in the test set. We won the gold prize at innovation track and bronze prize at dataset track.

Labels

jupyter-notebook

Projects that are alternatives of or similar to Wsdm Adhoc Document Retrieval

Chinesetrafficpolicepose

Detects Chinese traffic police commanding poses 检测中国交警指挥手势

Stars: ✭ 49 (-2%)

Mutual labels: jupyter-notebook

Stars: ✭ 49 (-2%)

Mutual labels: jupyter-notebook

Tensorflow From Zero To One

TensorFlow 最佳学习资源大全（含课程、书籍、博客、公开课等内容）

Stars: ✭ 1,052 (+2004%)

Mutual labels: jupyter-notebook

Accurate Binary Convolution Network

Binary Convolution Network for faster real-time processing in ASICs

Stars: ✭ 49 (-2%)

Mutual labels: jupyter-notebook

Spark Tutorials

Code and Notebooks for Spark Tutorials for Learning Journal @ Youtube

Stars: ✭ 49 (-2%)

Mutual labels: jupyter-notebook

demos for PyBay talk: Using Randomness to make code faster

Stars: ✭ 49 (-2%)

Mutual labels: jupyter-notebook

Amldataprepdocs

Documentation for Microsoft Azure Machine Learning Data Preparation

Stars: ✭ 49 (-2%)

Mutual labels: jupyter-notebook

It's my projects

Stars: ✭ 50 (+0%)

Mutual labels: jupyter-notebook

Universodiscreto

Códigos explicados nos vídeos do canal Universo Discreto (YouTube)

Stars: ✭ 49 (-2%)

Mutual labels: jupyter-notebook

Mlapp Solutions

Solutions in Python for Kevin Murphy's Machine Learning: a Probabilistic Perspective

Stars: ✭ 49 (-2%)

Mutual labels: jupyter-notebook

Eeg Classification Using Recurrent Neural Network

Used LSTM Network to classify eeg signals based on stimuli the subject recieved (visual or audio)

Stars: ✭ 49 (-2%)

Mutual labels: jupyter-notebook

Stars: ✭ 49 (-2%)

Mutual labels: jupyter-notebook

Teal deer (from TL;DR) helps you get the gist of all the stuff you need to read, so you don't have to read it all at once.

Stars: ✭ 49 (-2%)

Mutual labels: jupyter-notebook

Mckinsey Smartcities Traffic Prediction

Adventure into using multi attention recurrent neural networks for time-series (city traffic) for the 2017-11-18 McKinsey IronMan (24h non-stop) prediction challenge

Stars: ✭ 49 (-2%)

Mutual labels: jupyter-notebook

Feature Engineering Book

Code repo for the book "Feature Engineering for Machine Learning," by Alice Zheng and Amanda Casari, O'Reilly 2018

Stars: ✭ 1,052 (+2004%)

Mutual labels: jupyter-notebook

A small repo in which I play with the ideas of causal modelling.

Stars: ✭ 49 (-2%)

Mutual labels: jupyter-notebook

SalmonTE is an ultra-Fast and Scalable Quantification Pipeline of Transpose Element (TE) Abundances

Stars: ✭ 49 (-2%)

Mutual labels: jupyter-notebook

Live Video Analytics

A collection of reference applications using live video analytics capabilities in Azure Media Services

Stars: ✭ 50 (+0%)

Mutual labels: jupyter-notebook

Winter 2016 Cs231n

Assignments: CNN for Visual Recognition.

Stars: ✭ 49 (-2%)

Mutual labels: jupyter-notebook

Pure Numpy Feedfowardnn

Simple feedforward neural network class written in pure python+numpy

Stars: ✭ 49 (-2%)

Mutual labels: jupyter-notebook

View All Similar Projects ➔

WSDM-Adhoc-Document-Retrieval

This is our solution for WSDM - DiggSci 2020. We implemented a simple yet robust search pipeline which ranked 2nd in the validation set and 4th in the test set. We won the gold prize at innovation track and bronze prize at dataset track. [Video] [Slides] [Report]

Related Project: KDD-Multimodalities-Recall

Features

An end-to-end system with zero feature engineering.
Performed data cleaning on the dataset according to self-designed saliency-based rules, and removed the redundancy data with an insignificant impact on results, and improved the [email protected] by 3%.
Designed a novel early stopping strategy for reranking based on the confidence score to avoid up to 40% unnecessary inference computation cost of the BERT.
Scores are stable (nearly the same) on the train_val, validation, and test sets.

Our Pipeline

Open the jupyter notebook jupyter lab or jupyter notebook
Clean the dataset: Open 01WASH.ipynb and run all cells. In this notebook, we clean the dataset, by removing the description text which is not highly related to the query topic. We also remove the NA rows in the candidate set. Such practice decreases the recall size from 838,939 to 636,439 without sacrificing so much recall rate.
Recall: Open 02RECALL.ipynb and run all cells. In this notebook, we use the bm25 metric, which is a kind of scoring method to recall the documents by the same important keywords. For faster calculating, we adopted the cupyx to accelerate the calculation. Hence, the matrix multiplication of (validsize, vocab_size) dot (600k, vocab_size) can be done in 15 mins in a single GPU card.
Rerank: Open 05BERT_ADARERANK.ipynb and run all cells. In this notebook, we used the fintuned BioBERT model to scoring every (query, document) pair. A novel early stopping strategy is designed for saving the computation. That is, when reranking documents for a given query, if a document is scored with high confidence (above a threshold), the reranking process for this query can be earlier stopped.

How to finetune the BioBERT for Scoring:

As mentioned in the step4, we need to finetune the BioBERT for the reranking task. Please refer to the notebook file03BERT_PREPARE.ipynb and 04BERT_TRAIN.ipynb for coding details. Also, we will give some worth-mentioning tips:

Use pairwise BERT: When using Bert to scoring sentence pairs, using the [token] vector as output and followed by a single-layer neural network with dropout is recommended.
Use RankNet loss: Cross-entropy is not the best choice for the ranking problem, because it aims to train the scoring function to be inf or -inf. Such loss benefits to the classification task, while in ranking task, we do not need extreme scores. What we need is more discriminative scores - the document more related scores higher. That is the Ranknet loss. Limited to the GPU resource, our team can not implement RankNet loss in BERT. Instead, we selected the finetuned models performing well in ranking task, which is so-called underfitting model in the classification tasks. Such practice improves 0.03+ [email protected] in the validation set.
Use 512 Tokens in Training: For both training and inference phase, longer token means that the model can capture more semantic information. In our test, increasing token length from 256 to 512 can improve 0.02+ [email protected]
Upsample Positive Items: Similar to the classification task, you can upsample the positive (query, doc) pairs or reweight them in the loss item.

Members

Chengxuan Ying, Dalian University of Technology (应承轩大连理工大学)
Chen HuoServer sponsor, Wechat (霍晨微信)

DataLeak

We did not use any data leak tricks, though we know the data leak exists.

Acknownledgment

Thanks for Yanming Shen, who provided a 8-GPU server for 4 days.

Links to Other Solutions

Chi-Yu Yang and Kuei-Chun Huang: WSDM_SimpleBaseline
supercoderhawk: wsdm-digg-2020
shuiliwanwu: wsdm_cup2020
just4fun, greedisgood, slowdown and funny: wsdm2020-solution
xiong, wzm, Yinxiang Xu, Xiaohao Xu and Yongqiang Liu: wsdm2020_diggsci
Seiya, eclipse, will and ferryman: wsdm_cup_2020_solution

Reference

Nogueira R, Cho K. Passage Re-ranking with BERT[J]. arXiv preprint arXiv:1901.04085, 2019.
Burges C, Shaked T, Renshaw E, et al. Learning to rank using gradient descent[C]//Proceedings of the 22nd International Conference on Machine learning (ICML-05). 2005: 89-96.
Severyn A, Moschitti A. Learning to rank short text pairs with convolutional deep neural networks[C]//Proceedings of the 38th international ACM SIGIR conference on research and development in information retrieval. ACM, 2015: 373-382.

Seeking Opportunities

I will be graduated in the summer of 2021 from Dalian University of Technology. If you can refer me to any company, please contact me [email protected].

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Stars: ✭ 50

Visit Git Page 🔗Visit User Page 🔗Visit Issues Page (0) 🔗