All Projects → zake7749 → WSDM-Cup-2019

zake7749 / WSDM-Cup-2019

Licence: Apache-2.0 license
[ACM-WSDM] 3rd place solution at WSDM Cup 2019, Fake News Classification on Kaggle.

Programming Languages

Jupyter Notebook
11667 projects
python
139335 projects - #7 most used programming language
shell
77523 projects

Projects that are alternatives of or similar to WSDM-Cup-2019

text2class
Multi-class text categorization using state-of-the-art pre-trained contextualized language models, e.g. BERT
Stars: ✭ 15 (-75.81%)
Mutual labels:  text-classification, bert, natural-language-understanding
Fill-the-GAP
[ACL-WS] 4th place solution to gendered pronoun resolution challenge on Kaggle
Stars: ✭ 13 (-79.03%)
Mutual labels:  natural-language-inference, bert, natural-language-understanding
Nlp Recipes
Natural Language Processing Best Practices & Examples
Stars: ✭ 5,783 (+9227.42%)
Mutual labels:  text-classification, natural-language-inference, natural-language-understanding
NSP-BERT
The code for our paper "NSP-BERT: A Prompt-based Zero-Shot Learner Through an Original Pre-training Task —— Next Sentence Prediction"
Stars: ✭ 166 (+167.74%)
Mutual labels:  text-classification, natural-language-inference, bert
TextFeatureSelection
Python library for feature selection for text features. It has filter method, genetic algorithm and TextFeatureSelectionEnsemble for improving text classification models. Helps improve your machine learning models
Stars: ✭ 42 (-32.26%)
Mutual labels:  text-classification, natural-language-inference, natural-language-understanding
BERT-chinese-text-classification-pytorch
This repo contains a PyTorch implementation of a pretrained BERT model for text classification.
Stars: ✭ 92 (+48.39%)
Mutual labels:  text-classification, bert
watson-document-classifier
Augment IBM Watson Natural Language Understanding APIs with a configurable mechanism for text classification, uses Watson Studio.
Stars: ✭ 41 (-33.87%)
Mutual labels:  text-classification, natural-language-understanding
backprop
Backprop makes it simple to use, finetune, and deploy state-of-the-art ML models.
Stars: ✭ 229 (+269.35%)
Mutual labels:  text-classification, bert
bert nli
A Natural Language Inference (NLI) model based on Transformers (BERT and ALBERT)
Stars: ✭ 97 (+56.45%)
Mutual labels:  natural-language-inference, bert
GLUE-bert4keras
基于bert4keras的GLUE基准代码
Stars: ✭ 59 (-4.84%)
Mutual labels:  bert, natural-language-understanding
banglabert
This repository contains the official release of the model "BanglaBERT" and associated downstream finetuning code and datasets introduced in the paper titled "BanglaBERT: Language Model Pretraining and Benchmarks for Low-Resource Language Understanding Evaluation in Bangla" accpeted in Findings of the Annual Conference of the North American Chap…
Stars: ✭ 186 (+200%)
Mutual labels:  natural-language-inference, bert
label-studio-transformers
Label data using HuggingFace's transformers and automatically get a prediction service
Stars: ✭ 117 (+88.71%)
Mutual labels:  bert, natural-language-understanding
bert extension tf
BERT Extension in TensorFlow
Stars: ✭ 29 (-53.23%)
Mutual labels:  bert, natural-language-understanding
ERNIE-text-classification-pytorch
This repo contains a PyTorch implementation of a pretrained ERNIE model for text classification.
Stars: ✭ 49 (-20.97%)
Mutual labels:  text-classification, bert
DiscEval
Discourse Based Evaluation of Language Understanding
Stars: ✭ 18 (-70.97%)
Mutual labels:  bert, natural-language-understanding
Kevinpro-NLP-demo
All NLP you Need Here. 个人实现了一些好玩的NLP demo,目前包含13个NLP应用的pytorch实现
Stars: ✭ 117 (+88.71%)
Mutual labels:  text-classification, bert
classifier multi label
multi-label,classifier,text classification,多标签文本分类,文本分类,BERT,ALBERT,multi-label-classification
Stars: ✭ 127 (+104.84%)
Mutual labels:  text-classification, bert
classifier multi label seq2seq attention
multi-label,classifier,text classification,多标签文本分类,文本分类,BERT,ALBERT,multi-label-classification,seq2seq,attention,beam search
Stars: ✭ 26 (-58.06%)
Mutual labels:  text-classification, bert
trove
Weakly supervised medical named entity classification
Stars: ✭ 55 (-11.29%)
Mutual labels:  text-classification, bert
Kashgari
Kashgari is a production-level NLP Transfer learning framework built on top of tf.keras for text-labeling and text-classification, includes Word2Vec, BERT, and GPT2 Language Embedding.
Stars: ✭ 2,235 (+3504.84%)
Mutual labels:  text-classification, bert

Fake News Detection

This is the 3rd place solution to ACM International Conference on Web Search and Data Mining(WSDM) Cup 2019, a challenge to fake news detection and sentence pairs modeling.

overview

Documents

Reproduce our results

1. Setup

  1. Clone this project.

  2. Download the dataset from the corresponding competition on Kaggle and extract it under the directory zake7749/data/dataset

|-- dataset
    |-- sample_submission.csv
    |-- test.csv
    `-- train.csv
  1. Prepare the embedding models

We use 2 open-source pretrained word embeddings in this competiton:

And put these two embeddings under the folder zake7749/data/wordvec/

|-- wordvec
    |-- Tencent_AILab_ChineseEmbedding.txt
    `-- sgns.merge.bigram

2. Instructions

The notebooks are under the folder zake7749/code

Pre-processing

  1. Execute Stage 1.1. Preprocessing-on-word-level.ipynb
  2. Execute Stage 1.2. Preprocessing-on-char-level.ipynb

These notebooks would generate 8 cleaned datasets under zake7749/data/processed_dataset.

.
|-- engineered_chars_test.csv
|-- engineered_chars_train.csv
|-- engineered_words_test.csv
|-- engineered_words_train.csv
|-- processed_chars_test.csv
|-- processed_chars_train.csv
|-- processed_words_test.csv
`-- processed_words_train.csv

Train the char-level embedding

Execute Stage 1.3. Train-char-embeddings, which would output 3 char embeddings under zake7749/data/wordvec/

|-- wordvec
    |-- Tencent_AILab_ChineseEmbedding.txt
    |-- fasttext-50-win3.vec
    |-- sgns.merge.bigram
    |-- zh-wordvec-50-cbow-windowsize50.vec
    `-- zh-wordvec-50-skipgram-windowsize7.vec

Train the base models (LB 0.84 ~ 0.86)

  • Execute Stage 2. First-Level-with-char-level.ipynb
  • Execute Stage 2. First-Level-with-word-level.ipynb

Ensemble the predictions of base models (LB 0.873)

  1. Execute Stage 3.1. First-level-ensemble-ridge-regression
  2. Execute Stage 3.2. First-level-ensemble-with-LGBM-each-side
  3. Execute Stage 3.3. First-level-ensemble-with-LGBM
  4. Execute Stage 3.4. First-level-ensemble-with-NN
  5. Execute Stage 3.5. Second-level-ensemble

Fine-tune the cls vector of BERT (LB 0.867)

  • Run script hanshan/bert/train_wsdm.sh
  • To get predictions file to submit at this stage run zake7749/bert/data/probs_to_preds.py

Blend the predictions of ensemble NNs with BERT (LB 0.874)

  • Execute Stage 3.6. Bagging-with-BERT

** Note: Please change the path of sec_stacking_df to the corresponding file **

Fine-tune the base models with noisy labels (LB 0.86 ~ 0.875)

  • Execute Stage 4.1. Fine-tune-word-level-models.ipynb
  • Execute Stage 4.2. Fine-tune-char-level-models.ipynb

Fine-tune the cls vector of BERT with noisy labels (LB 0.880)

  • Run hanshan/prep_pseudo_labels.py
  • Run script hanshan/bert/train_wsdm_pl.sh

Ensemble the predictions of fine-tuned base models (LB 0.879)

  1. Execute Stage 5.1. First-level-fine-tuned-ensemble-ridge-regression.ipynb
  2. Execute Stage 5.2. First-level-fine-tuned-ensemble-withNN.ipynb
  3. Execute Stage 5.3. First-level-fine-tuned-ensemble-with-LGBM.ipynb
  4. Execute Stage 5.4. Second-level-fine-tuned-ensemble.ipynb

Final Blending with post-processing (LB 0.881)

  1. Execute Stage 9. High-Ground.ipynb
  2. Execute Stage 42. Final Answer.ipynb

The final prediction final_answer.csv would be generated under the folder zake7749/data/high_ground/

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].