All Projects → yumeng5 → Lotclass

yumeng5 / Lotclass

Licence: apache-2.0
[EMNLP 2020] Text Classification Using Label Names Only: A Language Model Self-Training Approach

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to Lotclass

Lightnlp
基于Pytorch和torchtext的自然语言处理深度学习框架。
Stars: ✭ 739 (+361.88%)
Mutual labels:  text-classification, language-model
backprop
Backprop makes it simple to use, finetune, and deploy state-of-the-art ML models.
Stars: ✭ 229 (+43.13%)
Mutual labels:  text-classification, language-model
Vaaku2Vec
Language Modeling and Text Classification in Malayalam Language using ULMFiT
Stars: ✭ 68 (-57.5%)
Mutual labels:  text-classification, language-model
FNet-pytorch
Unofficial implementation of Google's FNet: Mixing Tokens with Fourier Transforms
Stars: ✭ 204 (+27.5%)
Mutual labels:  text-classification, language-model
Sentiment analysis fine grain
Multi-label Classification with BERT; Fine Grained Sentiment Analysis from AI challenger
Stars: ✭ 546 (+241.25%)
Mutual labels:  text-classification, language-model
Bertweet
BERTweet: A pre-trained language model for English Tweets (EMNLP-2020)
Stars: ✭ 282 (+76.25%)
Mutual labels:  text-classification, language-model
Deep-NLP-Resources
Curated list of all NLP Resources
Stars: ✭ 65 (-59.37%)
Mutual labels:  text-classification, language-model
Nlp chinese corpus
大规模中文自然语言处理语料 Large Scale Chinese Corpus for NLP
Stars: ✭ 6,656 (+4060%)
Mutual labels:  text-classification, language-model
Bert language understanding
Pre-training of Deep Bidirectional Transformers for Language Understanding: pre-train TextCNN
Stars: ✭ 933 (+483.13%)
Mutual labels:  text-classification, language-model
Text Classification Demos
Neural models for Text Classification in Tensorflow, such as cnn, dpcnn, fasttext, bert ...
Stars: ✭ 144 (-10%)
Mutual labels:  text-classification
Multi Label classification
transform multi-label classification as sentence pair task, with more training data and information
Stars: ✭ 151 (-5.62%)
Mutual labels:  text-classification
Uda pytorch
UDA(Unsupervised Data Augmentation) implemented by pytorch
Stars: ✭ 143 (-10.62%)
Mutual labels:  text-classification
Ld Net
Efficient Contextualized Representation: Language Model Pruning for Sequence Labeling
Stars: ✭ 148 (-7.5%)
Mutual labels:  language-model
Speecht
An opensource speech-to-text software written in tensorflow
Stars: ✭ 152 (-5%)
Mutual labels:  language-model
Browsecloud
A web app to create and browse text visualizations for automated customer listening.
Stars: ✭ 143 (-10.62%)
Mutual labels:  text-classification
Vdcnn
Implementation of Very Deep Convolutional Neural Network for Text Classification
Stars: ✭ 158 (-1.25%)
Mutual labels:  text-classification
Monkeylearn Python
Official Python client for the MonkeyLearn API. Build and consume machine learning models for language processing from your Python apps.
Stars: ✭ 143 (-10.62%)
Mutual labels:  text-classification
Tupe
Transformer with Untied Positional Encoding (TUPE). Code of paper "Rethinking Positional Encoding in Language Pre-training". Improve existing models like BERT.
Stars: ✭ 143 (-10.62%)
Mutual labels:  language-model
Keras Xlnet
Implementation of XLNet that can load pretrained checkpoints
Stars: ✭ 159 (-0.62%)
Mutual labels:  language-model
F Lm
Language Modeling
Stars: ✭ 156 (-2.5%)
Mutual labels:  language-model

LOTClass

The source code used for Text Classification Using Label Names Only: A Language Model Self-Training Approach, published in EMNLP 2020.

Requirements

At least one GPU is required to run the code.

Before running, you need to first install the required packages by typing following commands:

$ pip3 install -r requirements.txt

Python 3.6 or above is strongly recommended; using older python versions might lead to package incompatibility issues.

Reproducing the Results

We provide four get_data.sh scripts for downloading the datasets used in the paper under datasets and four training bash scripts agnews.sh, dbpedia.sh, imdb.sh and amazon.sh for running the model on the four datasets.

Note: Our model does not use training labels; we provide the training/test set ground truth labels only for completeness and evaluation.

The training bash scripts assume you have two 10GB GPUs. If you have different number of GPUs, or GPUs of different memory sizes, refer to the next section for how to change the following command line arguments appropriately (while keeping other arguments unchanged): train_batch_size, accum_steps, eval_batch_size and gpus.

Command Line Arguments

The meanings of the command line arguments will be displayed upon typing

python src/train.py -h

The following arguments directly affect the performance of the model and need to be set carefully:

  • train_batch_size, accum_steps, gpus: These three arguments should be set together. You need to make sure that the effective training batch size, calculated as train_batch_size * accum_steps * gpus, is around 128. For example, if you have 4 GPUs, then you can set train_batch_size = 32, accum_steps = 1, gpus = 4; if you have 1 GPU, then you can set train_batch_size = 32, accum_steps = 4, gpus = 1. If your GPUs have different memory sizes, you might need to change train_batch_size while adjusting accum_steps and gpus at the same time to keep the effective training batch size around 128.
  • eval_batch_size: This argument only affects the speed of the algorithm; use as large evaluation batch size as your GPUs can hold.
  • max_len: This argument controls the maximum length of documents fed into the model (longer documents will be truncated). Ideally, max_len should be set to the length of the longest document (max_len cannot be larger than 512 under BERT architecture), but using larger max_len also consumes more GPU memory, resulting in smaller batch size and longer training time. Therefore, you can trade model accuracy for faster training by reducing max_len.
  • mcp_epochs, self_train_epochs: They control how many epochs to train the model on masked category prediction task and self-training task, respectively. Setting mcp_epochs = 3, self_train_epochs = 1 will be a good starting point for most datasets, but you may increase them if your dataset is small (less than 100,000 documents).

Other arguments can be kept as their default values.

Running on New Datasets

To execute the code on a new dataset, you need to

  1. Create a directory named your_dataset under datasets.
  2. Prepare a text corpus train.txt (one document per line) under your_dataset for training the classification model (no document labels are needed).
  3. Prepare a label name file label_names.txt under your_dataset (each line contains the label name of one category; if multiple words are used as the label name of a category, put them in the same line and separate them with whitespace characters).
  4. (Optional) You can choose to provide a test corpus test.txt (one document per line) with ground truth labels test_labels.txt (each line contains an integer denoting the category index of the corresponding document, index starts from 0 and the order must be consistent with the category order in label_names.txt). If the test corpus is provided, the code will write classification results to out.txt under your_dataset once the training is complete. If the ground truth labels of the test corpus are provided, test accuracy will be displayed during self-training, which is useful for hyperparameter tuning and model cherry-picking using a small test set.
  5. Run the code with appropriate command line arguments (I recommend creating a new bash script by referring to the four example scripts).
  6. The final trained classification model will be saved as final_model.pt under your_dataset.

Note: The code will cache intermediate data and model checkpoints as .pt files under your dataset directory for continued training. If you change your training corpus or label names and re-run the code, you will need to first delete all .pt files to prevent the code from loading old results.

You can always refer to the example datasets when preparing your own datasets.

Citations

Please cite the following paper if you find the code helpful for your research.

@inproceedings{meng2020text,
  title={Text Classification Using Label Names Only: A Language Model Self-Training Approach},
  author={Meng, Yu and Zhang, Yunyi and Huang, Jiaxin and Xiong, Chenyan and Ji, Heng and Zhang, Chao and Han, Jiawei},
  booktitle={Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing},
  year={2020},
}
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].