All Projects → supercoderhawk → wsdm-digg-2020

supercoderhawk / wsdm-digg-2020

Licence: other
No description or website provided.

Programming Languages

python
139335 projects - #7 most used programming language
Jupyter Notebook
11667 projects
shell
77523 projects

Projects that are alternatives of or similar to wsdm-digg-2020

workshop intro to sql
Reader for the Intro to SQL workshop series.
Stars: ✭ 22 (+46.67%)
Mutual labels:  workshop
WSDM-Cup-2019
[ACM-WSDM] 3rd place solution at WSDM Cup 2019, Fake News Classification on Kaggle.
Stars: ✭ 62 (+313.33%)
Mutual labels:  wsdm
Teaching-Data-Visualisation
Presentation and exercises for the Software Sustainability Institute Research Data Visualisation Workshop (RDVW)
Stars: ✭ 15 (+0%)
Mutual labels:  workshop
gdg-react-workshop
React + Electron + Typescript workshop for GDG DevFest Warsaw 👩‍💻👨‍💻🤖💻
Stars: ✭ 16 (+6.67%)
Mutual labels:  workshop
ws-ldn-10
Generative design workshop (Clojure/ClojureScript)
Stars: ✭ 26 (+73.33%)
Mutual labels:  workshop
observability-workshop
To get started, please proceed to The Splunk Observability Cloud Workshop Homepage.
Stars: ✭ 48 (+220%)
Mutual labels:  workshop
dat-workshop
How to build web apps using Dat. A workshop by GEUT.
Stars: ✭ 50 (+233.33%)
Mutual labels:  workshop
Machine-Learning-in-Python-Workshop
My workshop on machine learning using python language to implement different algorithms
Stars: ✭ 89 (+493.33%)
Mutual labels:  workshop
evildork
Evildork targeting your fiancee👁️
Stars: ✭ 46 (+206.67%)
Mutual labels:  information-retrieval
Cypurr-Prezes
presentation materials for our cryptoparties
Stars: ✭ 48 (+220%)
Mutual labels:  workshop
api-ai-workshop
Dialogflow Workshop Material. This can be used to create a Conversational Agent for a simple Linear Conversation using Dialogflow
Stars: ✭ 56 (+273.33%)
Mutual labels:  workshop
docker-workshop-with-react-aspnetcore-redis-rabbitmq-mssql
An Asp.Net Core Docker workshop project that includes react ui, redis, mssql, rabbitmq and azure pipelines
Stars: ✭ 53 (+253.33%)
Mutual labels:  workshop
dataviz
Course materials for Kieran Healy's rstudio::conf 2020 data visualization workshop
Stars: ✭ 75 (+400%)
Mutual labels:  workshop
carto-workshop
CARTO training materials
Stars: ✭ 81 (+440%)
Mutual labels:  workshop
python-ogren-4-saatte-python-baslangic
(TR) 4 saatlik Python başlangıç atölyesinin içerik dokümanı. (EN version is in progress!)
Stars: ✭ 71 (+373.33%)
Mutual labels:  workshop
bounded-disturbances
A k6/.NET red/green load testing workshop
Stars: ✭ 39 (+160%)
Mutual labels:  workshop
information retrieval system
The goal of this project is to implement a basic information retrieval system using Python, NLTK and GenSIM.
Stars: ✭ 25 (+66.67%)
Mutual labels:  information-retrieval
Bot-Framework-Tutorial
Microsoft Bot Framework Hands on Lab Tutorial
Stars: ✭ 23 (+53.33%)
Mutual labels:  workshop
see
Search Engine in Erlang
Stars: ✭ 27 (+80%)
Mutual labels:  information-retrieval
clojure
Practicalli Clojure REPL Driven Development
Stars: ✭ 40 (+166.67%)
Mutual labels:  workshop

WSDM 2020 Workshop

https://biendata.com/competition/wsdm2020/

ID: @nlp-rabbit

Prerequirements

Python >= 3.6

Reproduce the result

Clone Code and Install Requirements

git clone https://github.com/supercoderhawk/wsdm-digg-2020
pip3 install -r requirements.txt
python3 -m spacy download en

Setup ElasticSearch

  1. setup elasticsearch service, refer to link

  2. setting value ES_BASE_URL in constants.py with your configured elastic search endpoint.

Prepare Data

  1. unzip file and put all files under data/ folder, rename test.csv to test_release.csv

  2. Download model , unzip it and put files into data folder

  3. execute bash scripts/prepare_data.sh in project root folder to build the data for next step

Execute the retrieval process end2end

  • execute bash scripts/run_end2end.sh in project root folder

Details

the above script includes three main parts

  1. execute elasticsearch to retrieval candidate papers

    core logic in search\search.py which is called by benchmark\benchmark.py

  2. execute the rerank by BERT

    core logic in reranking\predict.py, model code in reranking\plm_rerank.py

Basic Algorithm Architecture

  1. recall phase

    1. keywords and keyphrase extraction

      1. noun chunk extraction

      2. textrank keyword extraction

      3. candidate keywords filtering, including noun, proper noun and adjective

    2. BM25 based search (elasticsearch)

  2. rerank phase

    Bert based rerank (SciBert from AllenAI), single model, not have any ensemble methods

    training data built by first stage (BM25) search result

    loss is marginal loss (hinge loss) which is widely used in ranking scenario

Train the Model

The model required to be trained just the Bert based reranking model

# prepare training data for reranking
bash scripts/prepare_rerank.sh

# training the rerank model
bash scripts/train_rerank.sh

# predict the result
bash scripts/predict_rerank.sh

Others

  1. In this project, abbreviation plm means Pretrained Language Model.

  2. methods tried but not effective:

    1. Bert-Knrm, Bert-ConvKnrm paper: CEDR: Contextualized Embeddings for Document Ranking, code in reranking\plm_knrm.py and reranking\plm_conv_knrm.py

    2. Bert based sentence vectorization method, paper Universal Sentence Encoder (Use BERT CLS output replaced vanilla transformer trained from scratch) code in vectorization\plm_vectorization.py and vectorization\predict.py

related papaer

[1] Understanding the Behaviors of BERT in Ranking

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].