All Projects → yumeng5 → WeSTClass

yumeng5 / WeSTClass

Licence: Apache-2.0 License
[CIKM 2018] Weakly-Supervised Neural Text Classification

Programming Languages

python
139335 projects - #7 most used programming language
shell
77523 projects

Projects that are alternatives of or similar to WeSTClass

MetaCat
Minimally Supervised Categorization of Text with Metadata (SIGIR'20)
Stars: ✭ 52 (-22.39%)
Mutual labels:  text-classification, weakly-supervised-learning
trove
Weakly supervised medical named entity classification
Stars: ✭ 55 (-17.91%)
Mutual labels:  text-classification, weakly-supervised-learning
WeSHClass
[AAAI 2019] Weakly-Supervised Hierarchical Text Classification
Stars: ✭ 83 (+23.88%)
Mutual labels:  text-classification, weakly-supervised-learning
HiGitClass
HiGitClass: Keyword-Driven Hierarchical Classification of GitHub Repositories (ICDM'19)
Stars: ✭ 58 (-13.43%)
Mutual labels:  text-classification, weakly-supervised-learning
RIB
Reducing Information Bottleneck for Weakly Supervised Semantic Segmentation (NeurIPS 2021)
Stars: ✭ 40 (-40.3%)
Mutual labels:  weakly-supervised-learning
augmenty
Augmenty is an augmentation library based on spaCy for augmenting texts.
Stars: ✭ 101 (+50.75%)
Mutual labels:  text-classification
text-classification-svm
The missing SVM-based text classification module implementing HanLP's interface
Stars: ✭ 46 (-31.34%)
Mutual labels:  text-classification
medical-diagnosis-cnn-rnn-rcnn
分别使用rnn/cnn/rcnn来实现根据患者描述,进行疾病诊断
Stars: ✭ 39 (-41.79%)
Mutual labels:  text-classification
node-fasttext
Nodejs binding for fasttext representation and classification.
Stars: ✭ 39 (-41.79%)
Mutual labels:  text-classification
ML2017FALL
Machine Learning (EE 5184) in NTU
Stars: ✭ 66 (-1.49%)
Mutual labels:  text-classification
support-tickets-classification
This case study shows how to create a model for text analysis and classification and deploy it as a web service in Azure cloud in order to automatically classify support tickets. This project is a proof of concept made by Microsoft (Commercial Software Engineering team) in collaboration with Endava http://endava.com/en
Stars: ✭ 142 (+111.94%)
Mutual labels:  text-classification
ebe-dataset
Evidence-based Explanation Dataset (AACL-IJCNLP 2020)
Stars: ✭ 16 (-76.12%)
Mutual labels:  text-classification
kwx
BERT, LDA, and TFIDF based keyword extraction in Python
Stars: ✭ 33 (-50.75%)
Mutual labels:  text-classification
DaDengAndHisPython
【微信公众号:大邓和他的python】, Python语法快速入门https://www.bilibili.com/video/av44384851 Python网络爬虫快速入门https://www.bilibili.com/video/av72010301, 我的联系邮箱[email protected]
Stars: ✭ 59 (-11.94%)
Mutual labels:  text-classification
Kaggle-Twitter-Sentiment-Analysis
Kaggle Twitter Sentiment Analysis Competition
Stars: ✭ 18 (-73.13%)
Mutual labels:  text-classification
Binary-Text-Classification-Doc2vec-SVM
A Python implementation of a binary text classifier using Doc2Vec and SVM
Stars: ✭ 16 (-76.12%)
Mutual labels:  text-classification
Awesome-Weakly-Supervised-Temporal-Action-Localization
A curated publication list on weakly-supervised temporal action localization
Stars: ✭ 65 (-2.99%)
Mutual labels:  weakly-supervised-learning
opentc
OpenTC is a text classification engine using several algorithms in machine learning
Stars: ✭ 27 (-59.7%)
Mutual labels:  text-classification
Filipino-Text-Benchmarks
Open-source benchmark datasets and pretrained transformer models in the Filipino language.
Stars: ✭ 22 (-67.16%)
Mutual labels:  text-classification
fake-news-detection
This repo is a collection of AWESOME things about fake news detection, including papers, code, etc.
Stars: ✭ 34 (-49.25%)
Mutual labels:  text-classification

WeSTClass

The source code used for Weakly-Supervised Neural Text Classification, published in CIKM 2018.

Requirements

Before running, you need to first install the required packages by typing following commands:

$ pip3 install -r requirements.txt

Python 3.6 is strongly recommended; using older python versions might lead to package incompatibility issues.

Quick Start

python main.py --dataset ${dataset} --sup_source ${sup_source} --model ${model}

where you need to specify the dataset in ${dataset}, the weak supervision type in ${sup_source} (could be one of ['labels', 'keywords', 'docs']), and the type of neural model to use in ${model} (could be one of ['cnn', 'rnn']).

An example run is provided in test.sh, which can be executed by

./test.sh

More advanced settings on training and hyperparameters are commented in main.py.

Inputs

The weak supervision sources ${sup_source} can come from any of the following:

  • Label surface names (labels); you need to provide class names for each class in ./${dataset}/classes.txt, where each line begins with the class id (starting from 0), followed by a colon, and then the class label surface name.
  • Class-related keywords (keywords); you need to provide class-related keywords for each class in ./${dataset}/keywords.txt, where each line begins with the class id (starting from 0), followed by a colon, and then the class-related keywords separated by commas.
  • Labeled documents (docs); you need to provide labeled document ids for each class in ./${dataset}/doc_id.txt, where each line begins with the class id (starting from 0), followed by a colon, and then document ids in the corpus (starting from 0) of the corresponding class separated by commas.

Examples are given under ./agnews/ and ./yelp/.

Outputs

The final results (document labels) will be written in ./${dataset}/out.txt, where each line is the class label id for the corresponding document.

Intermediate results (e.g. trained network weights, self-training logs) will be saved under ./results/${dataset}/${model}/.

Running on a New Dataset

To execute the code on a new dataset, you need to

  1. Create a directory named ${dataset}.
  2. Put raw corpus (with or without true labels) under ./${dataset}.
  3. Modify the function read_file in load_data.py so that it returns a list of documents in variable data, and corresponding true labels in variable y (If ground truth labels are not available, simply return y = None).
  4. Modify main.py to accept the new dataset; you need to add ${dataset} to argparse, and then specify parameter settings (e.g. update_interval, pretrain_epochs) for the new dataset.

You can always refer to the example datasets when adapting the code for a new dataset.

Citations

Please cite the following paper if you find the code helpful for your research.

@inproceedings{meng2018weakly,
  title={Weakly-Supervised Neural Text Classification},
  author={Meng, Yu and Shen, Jiaming and Zhang, Chao and Han, Jiawei},
  booktitle={Proceedings of the 27th ACM International Conference on Information and Knowledge Management},
  pages={983--992},
  year={2018},
  organization={ACM}
}
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].