Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

Created with love in Canada, visit hostnodejs.com today

Feel like to post an Ad? Learn Details

All Projects → yumeng5 → Lotclass

yumeng5 / Lotclass

Licence: apache-2.0

[EMNLP 2020] Text Classification Using Label Names Only: A Language Model Self-Training Approach

Programming Languages

139335 projects - #7 most used programming language

Labels

text-classification language-model

Projects that are alternatives of or similar to Lotclass

基于Pytorch和torchtext的自然语言处理深度学习框架。

Stars: ✭ 739 (+361.88%)

Mutual labels: text-classification, language-model

Backprop makes it simple to use, finetune, and deploy state-of-the-art ML models.

Stars: ✭ 229 (+43.13%)

Mutual labels: text-classification, language-model

Language Modeling and Text Classification in Malayalam Language using ULMFiT

Stars: ✭ 68 (-57.5%)

Mutual labels: text-classification, language-model

Unofficial implementation of Google's FNet: Mixing Tokens with Fourier Transforms

Stars: ✭ 204 (+27.5%)

Mutual labels: text-classification, language-model

Sentiment analysis fine grain

Multi-label Classification with BERT; Fine Grained Sentiment Analysis from AI challenger

Stars: ✭ 546 (+241.25%)

Mutual labels: text-classification, language-model

BERTweet: A pre-trained language model for English Tweets (EMNLP-2020)

Stars: ✭ 282 (+76.25%)

Mutual labels: text-classification, language-model

Deep-NLP-Resources

Curated list of all NLP Resources

Stars: ✭ 65 (-59.37%)

Mutual labels: text-classification, language-model

Nlp chinese corpus

大规模中文自然语言处理语料 Large Scale Chinese Corpus for NLP

Stars: ✭ 6,656 (+4060%)

Mutual labels: text-classification, language-model

Bert language understanding

Pre-training of Deep Bidirectional Transformers for Language Understanding: pre-train TextCNN

Stars: ✭ 933 (+483.13%)

Mutual labels: text-classification, language-model

Text Classification Demos

Neural models for Text Classification in Tensorflow, such as cnn, dpcnn, fasttext, bert ...

Stars: ✭ 144 (-10%)

Mutual labels: text-classification

Multi Label classification

transform multi-label classification as sentence pair task, with more training data and information

Stars: ✭ 151 (-5.62%)

Mutual labels: text-classification

UDA(Unsupervised Data Augmentation) implemented by pytorch

Stars: ✭ 143 (-10.62%)

Mutual labels: text-classification

Efficient Contextualized Representation: Language Model Pruning for Sequence Labeling

Stars: ✭ 148 (-7.5%)

Mutual labels: language-model

An opensource speech-to-text software written in tensorflow

Stars: ✭ 152 (-5%)

Mutual labels: language-model

A web app to create and browse text visualizations for automated customer listening.

Stars: ✭ 143 (-10.62%)

Mutual labels: text-classification

Implementation of Very Deep Convolutional Neural Network for Text Classification

Stars: ✭ 158 (-1.25%)

Mutual labels: text-classification

Monkeylearn Python

Official Python client for the MonkeyLearn API. Build and consume machine learning models for language processing from your Python apps.

Stars: ✭ 143 (-10.62%)

Mutual labels: text-classification

Transformer with Untied Positional Encoding (TUPE). Code of paper "Rethinking Positional Encoding in Language Pre-training". Improve existing models like BERT.

Stars: ✭ 143 (-10.62%)

Mutual labels: language-model

Implementation of XLNet that can load pretrained checkpoints

Stars: ✭ 159 (-0.62%)

Mutual labels: language-model

Language Modeling

Stars: ✭ 156 (-2.5%)

Mutual labels: language-model

View All Similar Projects ➔

LOTClass

The source code used for Text Classification Using Label Names Only: A Language Model Self-Training Approach, published in EMNLP 2020.

Requirements

At least one GPU is required to run the code.

Before running, you need to first install the required packages by typing following commands:

$ pip3 install -r requirements.txt

Python 3.6 or above is strongly recommended; using older python versions might lead to package incompatibility issues.

Reproducing the Results

We provide four get_data.sh scripts for downloading the datasets used in the paper under datasets and four training bash scripts agnews.sh, dbpedia.sh, imdb.sh and amazon.sh for running the model on the four datasets.

Note: Our model does not use training labels; we provide the training/test set ground truth labels only for completeness and evaluation.

The training bash scripts assume you have two 10GB GPUs. If you have different number of GPUs, or GPUs of different memory sizes, refer to the next section for how to change the following command line arguments appropriately (while keeping other arguments unchanged): train_batch_size, accum_steps, eval_batch_size and gpus.

Command Line Arguments

The meanings of the command line arguments will be displayed upon typing

python src/train.py -h

The following arguments directly affect the performance of the model and need to be set carefully:

train_batch_size, accum_steps, gpus: These three arguments should be set together. You need to make sure that the effective training batch size, calculated as train_batch_size * accum_steps * gpus, is around 128. For example, if you have 4 GPUs, then you can set train_batch_size = 32, accum_steps = 1, gpus = 4; if you have 1 GPU, then you can set train_batch_size = 32, accum_steps = 4, gpus = 1. If your GPUs have different memory sizes, you might need to change train_batch_size while adjusting accum_steps and gpus at the same time to keep the effective training batch size around 128.
eval_batch_size: This argument only affects the speed of the algorithm; use as large evaluation batch size as your GPUs can hold.
max_len: This argument controls the maximum length of documents fed into the model (longer documents will be truncated). Ideally, max_len should be set to the length of the longest document (max_len cannot be larger than 512 under BERT architecture), but using larger max_len also consumes more GPU memory, resulting in smaller batch size and longer training time. Therefore, you can trade model accuracy for faster training by reducing max_len.
mcp_epochs, self_train_epochs: They control how many epochs to train the model on masked category prediction task and self-training task, respectively. Setting mcp_epochs = 3, self_train_epochs = 1 will be a good starting point for most datasets, but you may increase them if your dataset is small (less than 100,000 documents).

Other arguments can be kept as their default values.

Running on New Datasets

To execute the code on a new dataset, you need to

Create a directory named your_dataset under datasets.
Prepare a text corpus train.txt (one document per line) under your_dataset for training the classification model (no document labels are needed).
Prepare a label name file label_names.txt under your_dataset (each line contains the label name of one category; if multiple words are used as the label name of a category, put them in the same line and separate them with whitespace characters).
(Optional) You can choose to provide a test corpus test.txt (one document per line) with ground truth labels test_labels.txt (each line contains an integer denoting the category index of the corresponding document, index starts from 0 and the order must be consistent with the category order in label_names.txt). If the test corpus is provided, the code will write classification results to out.txt under your_dataset once the training is complete. If the ground truth labels of the test corpus are provided, test accuracy will be displayed during self-training, which is useful for hyperparameter tuning and model cherry-picking using a small test set.
Run the code with appropriate command line arguments (I recommend creating a new bash script by referring to the four example scripts).
The final trained classification model will be saved as final_model.pt under your_dataset.

Note: The code will cache intermediate data and model checkpoints as .pt files under your dataset directory for continued training. If you change your training corpus or label names and re-run the code, you will need to first delete all .pt files to prevent the code from loading old results.

You can always refer to the example datasets when preparing your own datasets.

Citations

Please cite the following paper if you find the code helpful for your research.

@inproceedings{meng2020text,
  title={Text Classification Using Label Names Only: A Language Model Self-Training Approach},
  author={Meng, Yu and Zhang, Yunyi and Huang, Jiaxin and Xiong, Chenyan and Ji, Heng and Zhang, Chao and Han, Jiawei},
  booktitle={Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing},
  year={2020},
}

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Stars: ✭ 160

Visit Git Page 🔗Visit User Page 🔗Visit Issues Page (1) 🔗