All Projects → yumeng5 → WeSHClass

yumeng5 / WeSHClass

Licence: Apache-2.0 license
[AAAI 2019] Weakly-Supervised Hierarchical Text Classification

Programming Languages

python
139335 projects - #7 most used programming language
shell
77523 projects

Projects that are alternatives of or similar to WeSHClass

HiGitClass
HiGitClass: Keyword-Driven Hierarchical Classification of GitHub Repositories (ICDM'19)
Stars: ✭ 58 (-30.12%)
Mutual labels:  text-classification, weakly-supervised-learning, hierarchical-classification
trove
Weakly supervised medical named entity classification
Stars: ✭ 55 (-33.73%)
Mutual labels:  text-classification, weakly-supervised-learning
HiLAP
Code for paper "Hierarchical Text Classification with Reinforced Label Assignment" EMNLP 2019
Stars: ✭ 116 (+39.76%)
Mutual labels:  text-classification, hierarchical-classification
MetaCat
Minimally Supervised Categorization of Text with Metadata (SIGIR'20)
Stars: ✭ 52 (-37.35%)
Mutual labels:  text-classification, weakly-supervised-learning
WeSTClass
[CIKM 2018] Weakly-Supervised Neural Text Classification
Stars: ✭ 67 (-19.28%)
Mutual labels:  text-classification, weakly-supervised-learning
character-level-cnn
Keras implementation of Character-level CNN for Text Classification
Stars: ✭ 56 (-32.53%)
Mutual labels:  text-classification
backprop
Backprop makes it simple to use, finetune, and deploy state-of-the-art ML models.
Stars: ✭ 229 (+175.9%)
Mutual labels:  text-classification
Very-deep-cnn-tensorflow
Very deep CNN for text classification
Stars: ✭ 18 (-78.31%)
Mutual labels:  text-classification
BERT-chinese-text-classification-pytorch
This repo contains a PyTorch implementation of a pretrained BERT model for text classification.
Stars: ✭ 92 (+10.84%)
Mutual labels:  text-classification
WSDEC
Weakly Supervised Dense Event Captioning in Videos, i.e. generating multiple sentence descriptions for a video in a weakly-supervised manner.
Stars: ✭ 95 (+14.46%)
Mutual labels:  weakly-supervised-learning
deepnlp
小时候练手的nlp项目
Stars: ✭ 11 (-86.75%)
Mutual labels:  text-classification
WS3D
Official version of 'Weakly Supervised 3D object detection from Lidar Point Cloud'(ECCV2020)
Stars: ✭ 104 (+25.3%)
Mutual labels:  weakly-supervised-learning
Nepali-News-Classifier
Text Classification of Nepali Language Document. This Mini Project was done for the partial fulfillment of NLP Course : COMP 473.
Stars: ✭ 13 (-84.34%)
Mutual labels:  text-classification
text-classification-transformers
Easy text classification for everyone : Bert based models via Huggingface transformers (KR / EN)
Stars: ✭ 32 (-61.45%)
Mutual labels:  text-classification
RE2RNN
Source code for the EMNLP 2020 paper "Cold-Start and Interpretability: Turning Regular Expressions intoTrainable Recurrent Neural Networks"
Stars: ✭ 96 (+15.66%)
Mutual labels:  text-classification
sarcasm-detection-for-sentiment-analysis
Sarcasm Detection for Sentiment Analysis
Stars: ✭ 21 (-74.7%)
Mutual labels:  text-classification
20-newsgroups text-classification
"20 newsgroups" dataset - Text Classification using Multinomial Naive Bayes in Python.
Stars: ✭ 41 (-50.6%)
Mutual labels:  text-classification
watson-document-classifier
Augment IBM Watson Natural Language Understanding APIs with a configurable mechanism for text classification, uses Watson Studio.
Stars: ✭ 41 (-50.6%)
Mutual labels:  text-classification
seededlda
Semisupervided LDA for theory-driven text analysis
Stars: ✭ 46 (-44.58%)
Mutual labels:  text-classification
X-Transformer
X-Transformer: Taming Pretrained Transformers for eXtreme Multi-label Text Classification
Stars: ✭ 127 (+53.01%)
Mutual labels:  text-classification

WeSHClass

The source code used for Weakly-Supervised Hierarchical Text Classification, published in AAAI 2019.

Requirements

Before running, you need to first install the required packages by typing following commands:

$ pip3 install -r requirements.txt

Also, be sure to download 'punkt' in python:

import nltk
nltk.download('punkt')

Quick Start

python main.py --dataset ${dataset} --sup_source ${sup_source} --with_eval ${with_eval} --pseudo ${pseudo}

where you need to specify the dataset in ${dataset}, the weak supervision type in ${sup_source} (could be one of ['keywords', 'docs']), the evaluation type in ${with_eval} and the pseudo document generation method in ${pseudo}(bow uses bag-of-words method introduced in the CIKM paper. lstm uses LSTM language model method introduced in the AAAI paper; it generates better-quality pseudo documents, but requires much longer time for training an LSTM language model).

An example run is provided in test.sh, which can be executed by

./test.sh

More advanced settings on training and hyperparameters are commented in main.py.

Inputs

To run the algorithm, you need to provide the following files under the directory ./${dataset}:

  1. A corpus (dataset.txt) that contains all documents to be classified. Each line in dataset.txt corresponds to one document.
  2. Class hierarchy (label_hier.txt) that indicates the parent children relationships between classes (each class can have at most one parent class). The first class of each line is the parent class, followed by all its children classes. Tab is used as the delimiter.
  3. Weak supervision sources (can be either of the following two) for each leaf class in the class hierarchy:
  • Class-related keywords (keywords.txt). You need to provide class-related keywords for each leaf class in keywords.txt, where each line begins with the class name (must correspond to that in label_hier.txt), followed by a tab, and then the class-related keywords separated by space.
  • Labeled documents (doc_id.txt). You need to provide labeled document ids for each leaf class in doc_id.txt, where each line begins with the class name (must correspond to that in label_hier.txt), followed by a tab, and then document ids in the corpus (starting from 0) of the corresponding class separated by space.
  1. (Optional) Ground truth labels to be used for evaluation (the provided labels will not be used for training). You need to set the evaluation type argument (--with_eval ${with_eval}) correspondingly.
  • If ground truth labels are available for all documents, put all document labels in labels.txt where the ith line is the class name (must correspond to that in label_hier.txt) for the ith document in dataset.txt.
  • If ground truth labels are available for some but not all documents, put the partial labels in labels_sub.txt where each line begins with the class name (must correspond to that in label_hier.txt), followed by a tab, and then document ids in the corpus (starting from 0) of the corresponding class separated by space.
  • If no ground truth labels are available, no files are required.

Examples are given under the three dataset directories.

Outputs

The final results (document labels) will be written in ./${dataset}/out.txt, where each line is the class label id for the corresponding document.

Intermediate results (e.g. trained network weights, self-training logs) will be saved under ./results/${dataset}/${sup_source}/.

Running on a New Dataset

To execute the code on a new dataset, you need to

  1. Create a directory named ${dataset}.
  2. Prepare input files (see the above Inputs section).
  3. Modify main.py to accept the new dataset; you need to add ${dataset} to argparse, and then specify parameter settings (e.g. update_interval, pretrain_epochs) for the new dataset.

You can always refer to the example datasets when adapting the code for a new dataset.

Citations

Please cite the following papers if you find the code helpful for your research.

@inproceedings{meng2018weakly,
  title={Weakly-Supervised Neural Text Classification},
  author={Meng, Yu and Shen, Jiaming and Zhang, Chao and Han, Jiawei},
  booktitle={Proceedings of the 27th ACM International Conference on Information and Knowledge Management},
  pages={983--992},
  year={2018},
  organization={ACM}
}

@inproceedings{meng2019weakly,
  title={Weakly-supervised hierarchical text classification},
  author={Meng, Yu and Shen, Jiaming and Zhang, Chao and Han, Jiawei},
  booktitle={Proceedings of the AAAI Conference on Artificial Intelligence},
  volume={33},
  pages={6826--6833},
  year={2019}
}
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].