Python library for feature selection for text features. It has filter method, genetic algorithm and TextFeatureSelectionEnsemble for improving text classification models. Helps improve your machine learning models

Stars: ✭ 42 (-32.26%)

Mutual labels: text-classification, natural-language-inference, natural-language-understanding

BERT-chinese-text-classification-pytorch

This repo contains a PyTorch implementation of a pretrained BERT model for text classification.

Stars: ✭ 92 (+48.39%)

Mutual labels: text-classification, bert

watson-document-classifier

Augment IBM Watson Natural Language Understanding APIs with a configurable mechanism for text classification, uses Watson Studio.

Stars: ✭ 41 (-33.87%)

Mutual labels: text-classification, natural-language-understanding

backprop

Backprop makes it simple to use, finetune, and deploy state-of-the-art ML models.

Stars: ✭ 229 (+269.35%)

Mutual labels: text-classification, bert

bert nli

A Natural Language Inference (NLI) model based on Transformers (BERT and ALBERT)

Stars: ✭ 97 (+56.45%)

Mutual labels: natural-language-inference, bert

GLUE-bert4keras

基于bert4keras的GLUE基准代码

Stars: ✭ 59 (-4.84%)

Mutual labels: bert, natural-language-understanding

banglabert

This repository contains the official release of the model "BanglaBERT" and associated downstream finetuning code and datasets introduced in the paper titled "BanglaBERT: Language Model Pretraining and Benchmarks for Low-Resource Language Understanding Evaluation in Bangla" accpeted in Findings of the Annual Conference of the North American Chap…

Stars: ✭ 186 (+200%)

Mutual labels: natural-language-inference, bert

label-studio-transformers

Label data using HuggingFace's transformers and automatically get a prediction service

Stars: ✭ 117 (+88.71%)

Mutual labels: bert, natural-language-understanding

bert extension tf

BERT Extension in TensorFlow

Stars: ✭ 29 (-53.23%)

Mutual labels: bert, natural-language-understanding

ERNIE-text-classification-pytorch

This repo contains a PyTorch implementation of a pretrained ERNIE model for text classification.

Stars: ✭ 49 (-20.97%)

Mutual labels: text-classification, bert

DiscEval

Discourse Based Evaluation of Language Understanding

Stars: ✭ 18 (-70.97%)

Mutual labels: bert, natural-language-understanding

Kevinpro-NLP-demo

All NLP you Need Here. 个人实现了一些好玩的NLP demo，目前包含13个NLP应用的pytorch实现

Stars: ✭ 117 (+88.71%)

Mutual labels: text-classification, bert

classifier multi label

multi-label，classifier，text classification，多标签文本分类，文本分类，BERT，ALBERT，multi-label-classification

Stars: ✭ 127 (+104.84%)

Mutual labels: text-classification, bert

classifier multi label seq2seq attention

multi-label，classifier，text classification，多标签文本分类，文本分类，BERT，ALBERT，multi-label-classification，seq2seq，attention，beam search

Stars: ✭ 26 (-58.06%)

Mutual labels: text-classification, bert

trove

Weakly supervised medical named entity classification

Stars: ✭ 55 (-11.29%)

Mutual labels: text-classification, bert

Kashgari

Kashgari is a production-level NLP Transfer learning framework built on top of tf.keras for text-labeling and text-classification, includes Word2Vec, BERT, and GPT2 Language Embedding.

Stars: ✭ 2,235 (+3504.84%)

Mutual labels: text-classification, bert

View All Similar Projects ➔

Fake News Detection

This is the 3rd place solution to ACM International Conference on Web Search and Data Mining(WSDM) Cup 2019, a challenge to fake news detection and sentence pairs modeling.

Documents

Slides
Paper

Reproduce our results

1. Setup

Clone this project.
Download the dataset from the corresponding competition on Kaggle and extract it under the directory zake7749/data/dataset

|-- dataset
    |-- sample_submission.csv
    |-- test.csv
    `-- train.csv

Prepare the embedding models

We use 2 open-source pretrained word embeddings in this competiton:

Tencent AI Lab Embedding Corpus for Chinese Words and Phrases
Chinese-Word-Vectors
- We select the SGNS version on word and n-gram level, trained with the mixed-large corpus

And put these two embeddings under the folder zake7749/data/wordvec/

|-- wordvec
    |-- Tencent_AILab_ChineseEmbedding.txt
    `-- sgns.merge.bigram

2. Instructions

The notebooks are under the folder zake7749/code

Pre-processing

Execute Stage 1.1. Preprocessing-on-word-level.ipynb
Execute Stage 1.2. Preprocessing-on-char-level.ipynb

These notebooks would generate 8 cleaned datasets under zake7749/data/processed_dataset.

.
|-- engineered_chars_test.csv
|-- engineered_chars_train.csv
|-- engineered_words_test.csv
|-- engineered_words_train.csv
|-- processed_chars_test.csv
|-- processed_chars_train.csv
|-- processed_words_test.csv
`-- processed_words_train.csv

Train the char-level embedding

Execute Stage 1.3. Train-char-embeddings, which would output 3 char embeddings under zake7749/data/wordvec/

|-- wordvec
    |-- Tencent_AILab_ChineseEmbedding.txt
    |-- fasttext-50-win3.vec
    |-- sgns.merge.bigram
    |-- zh-wordvec-50-cbow-windowsize50.vec
    `-- zh-wordvec-50-skipgram-windowsize7.vec

Train the base models (LB 0.84 ~ 0.86)

Execute Stage 2. First-Level-with-char-level.ipynb
Execute Stage 2. First-Level-with-word-level.ipynb

Ensemble the predictions of base models (LB 0.873)

Execute Stage 3.1. First-level-ensemble-ridge-regression
Execute Stage 3.2. First-level-ensemble-with-LGBM-each-side
Execute Stage 3.3. First-level-ensemble-with-LGBM
Execute Stage 3.4. First-level-ensemble-with-NN
Execute Stage 3.5. Second-level-ensemble

Fine-tune the cls vector of BERT (LB 0.867)

Run script hanshan/bert/train_wsdm.sh
To get predictions file to submit at this stage run zake7749/bert/data/probs_to_preds.py

Blend the predictions of ensemble NNs with BERT (LB 0.874)

Execute Stage 3.6. Bagging-with-BERT

** Note: Please change the path of sec_stacking_df to the corresponding file **

Fine-tune the base models with noisy labels (LB 0.86 ~ 0.875)

Execute Stage 4.1. Fine-tune-word-level-models.ipynb
Execute Stage 4.2. Fine-tune-char-level-models.ipynb

Fine-tune the cls vector of BERT with noisy labels (LB 0.880)

Run hanshan/prep_pseudo_labels.py
Run script hanshan/bert/train_wsdm_pl.sh

Ensemble the predictions of fine-tuned base models (LB 0.879)

Execute Stage 5.1. First-level-fine-tuned-ensemble-ridge-regression.ipynb
Execute Stage 5.2. First-level-fine-tuned-ensemble-withNN.ipynb
Execute Stage 5.3. First-level-fine-tuned-ensemble-with-LGBM.ipynb
Execute Stage 5.4. Second-level-fine-tuned-ensemble.ipynb

Final Blending with post-processing (LB 0.881)

Execute Stage 9. High-Ground.ipynb
Execute Stage 42. Final Answer.ipynb

The final prediction final_answer.csv would be generated under the folder zake7749/data/high_ground/

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

zake7749 / WSDM-Cup-2019

Programming Languages

Labels

Projects that are alternatives of or similar to WSDM-Cup-2019

Fake News Detection

Documents

Reproduce our results

1. Setup

2. Instructions

Pre-processing

Train the char-level embedding

Train the base models (LB 0.84 ~ 0.86)

Ensemble the predictions of base models (LB 0.873)

Fine-tune the cls vector of BERT (LB 0.867)

Blend the predictions of ensemble NNs with BERT (LB 0.874)

Fine-tune the base models with noisy labels (LB 0.86 ~ 0.875)

Fine-tune the cls vector of BERT with noisy labels (LB 0.880)

Ensemble the predictions of fine-tuned base models (LB 0.879)

Final Blending with post-processing (LB 0.881)