All Projects → toriving → text-classification-transformers

toriving / text-classification-transformers

Licence: Apache-2.0 license
Easy text classification for everyone : Bert based models via Huggingface transformers (KR / EN)

Programming Languages

python
139335 projects - #7 most used programming language
shell
77523 projects

Projects that are alternatives of or similar to text-classification-transformers

Product-Categorization-NLP
Multi-Class Text Classification for products based on their description with Machine Learning algorithms and Neural Networks (MLP, CNN, Distilbert).
Stars: ✭ 30 (-6.25%)
Mutual labels:  text-classification, transformers, huggingface-transformers
Pytorch-NLU
Pytorch-NLU,一个中文文本分类、序列标注工具包,支持中文长文本、短文本的多类、多标签分类任务,支持中文命名实体识别、词性标注、分词等序列标注任务。 Ptorch NLU, a Chinese text classification and sequence annotation toolkit, supports multi class and multi label classification tasks of Chinese long text and short text, and supports sequence annotation tasks such as Chinese named entity recognition, part of speech ta…
Stars: ✭ 151 (+371.88%)
Mutual labels:  text-classification, transformers
backprop
Backprop makes it simple to use, finetune, and deploy state-of-the-art ML models.
Stars: ✭ 229 (+615.63%)
Mutual labels:  text-classification, transformers
TorchBlocks
A PyTorch-based toolkit for natural language processing
Stars: ✭ 85 (+165.63%)
Mutual labels:  text-classification, transformers
Ask2Transformers
A Framework for Textual Entailment based Zero Shot text classification
Stars: ✭ 102 (+218.75%)
Mutual labels:  text-classification, transformers
COVID-19-Tweet-Classification-using-Roberta-and-Bert-Simple-Transformers
Rank 1 / 216
Stars: ✭ 24 (-25%)
Mutual labels:  text-classification, transformers
small-text
Active Learning for Text Classification in Python
Stars: ✭ 241 (+653.13%)
Mutual labels:  text-classification, transformers
Spark Nlp
State of the Art Natural Language Processing
Stars: ✭ 2,518 (+7768.75%)
Mutual labels:  text-classification, transformers
policy-data-analyzer
Building a model to recognize incentives for landscape restoration in environmental policies from Latin America, the US and India. Bringing NLP to the world of policy analysis through an extensible framework that includes scraping, preprocessing, active learning and text analysis pipelines.
Stars: ✭ 22 (-31.25%)
Mutual labels:  text-classification, transformers
text2class
Multi-class text categorization using state-of-the-art pre-trained contextualized language models, e.g. BERT
Stars: ✭ 15 (-53.12%)
Mutual labels:  text-classification, transformers
X-Transformer
X-Transformer: Taming Pretrained Transformers for eXtreme Multi-label Text Classification
Stars: ✭ 127 (+296.88%)
Mutual labels:  text-classification, transformers
Simpletransformers
Transformers for Classification, NER, QA, Language Modelling, Language Generation, T5, Multi-Modal, and Conversational AI
Stars: ✭ 2,881 (+8903.13%)
Mutual labels:  text-classification, transformers
Text and Audio classification with Bert
Text Classification in Turkish Texts with Bert
Stars: ✭ 34 (+6.25%)
Mutual labels:  text-classification, transformers
Kashgari
Kashgari is a production-level NLP Transfer learning framework built on top of tf.keras for text-labeling and text-classification, includes Word2Vec, BERT, and GPT2 Language Embedding.
Stars: ✭ 2,235 (+6884.38%)
Mutual labels:  text-classification, bert-model
KoBERT-Transformers
KoBERT on 🤗 Huggingface Transformers 🤗 (with Bug Fixed)
Stars: ✭ 162 (+406.25%)
Mutual labels:  transformers, kobert
Chinese-Minority-PLM
CINO: Pre-trained Language Models for Chinese Minority (少数民族语言预训练模型)
Stars: ✭ 133 (+315.63%)
Mutual labels:  transformers
Introduction-to-Deep-Learning-and-Neural-Networks-Course
Code snippets and solutions for the Introduction to Deep Learning and Neural Networks Course hosted in educative.io
Stars: ✭ 33 (+3.13%)
Mutual labels:  transformers
BottleneckTransformers
Bottleneck Transformers for Visual Recognition
Stars: ✭ 231 (+621.88%)
Mutual labels:  transformers
rnn-text-classification-tf
Tensorflow implementation of Attention-based Bidirectional RNN text classification.
Stars: ✭ 26 (-18.75%)
Mutual labels:  text-classification
ginza-transformers
Use custom tokenizers in spacy-transformers
Stars: ✭ 15 (-53.12%)
Mutual labels:  transformers

Korean | English

Text-classification-transformers

Easy text classification for everyone

Text classification tasks are most easily encountered in the area of natural language processing and can be used in various ways.

However, the given data needs to be preprocessed and the model's data pipeline must be created according to the preprocessing.

The purpose of this Repository is to allow text classification to be easily performed with Transformers (BERT)-like models if text classification data has been preprocessed into a specific structure.

Implemented based on Huggingfcae transformers for quick and convenient implementation.

Data Preprocessing

Data must exist as train.csv, dev.csv, and test.csv in the data_in folder.

Also, each data is composed of label,text format.

If there are a total of 2 labels, it is expressed as 0 and 1, and if there are N, it should be expressed as 0 to N-1.

Korean dataset nsmc, kornli, English dataset sst2, and sst5 are included by default.

Basically provided dataset can be preprocessed through utils/$dataset{nsmc, kornli, sst}_preprocess.py.

Using nsmc, a dataset composed of label,text can be created in the following way.

$ python utils/nsmc_preprocess.py

Number of dataset : 150000
Number of train dataset : 120000
Number of dev dataset : 30000

The results can be checked with train.csv, dev.csv, and test.csv in the data_in folder.

$ head data_in/train.csv

0,아 더빙.. 진짜 짜증나네요 목소리
1,흠...포스터보고 초딩영화줄....오버연기조차 가볍지 않구나
0,너무재밓었다그래서보는것을추천한다
0,교도소 이야기구먼 ..솔직히 재미는 없다..평점 조정
1,사이몬페그의 익살스런 연기가 돋보였던 영화!스파이더맨에서 늙어보이기만 했던 커스틴 던스트가 너무나도 이뻐보였다
...

Here, 0 represents Negative and 1 represents Positive label.

kornli, sst2, and sst5 can also preprocess data in the same way.

In the case of kornli, two sentences are combined and preprocessed using the [SEP] token.

1,오두막집 문을 부수고 땅바닥에 쓰러졌어 [SEP] 나는 문을 박차고 들어가 쓰러졌다.
0,어른과 아이들을 위한 재미. [SEP] 외동아이들을 위한 재미.
2,"그래, 넌 학생이 맞아 [SEP] 넌 기계공 학생이지?"

For 'sst2' and 'sst5', the 'sst_preprocess.py' file is used to operate in the following manner.

$ python utils/sst_preprocess.py --task sst2

Number of data in data_in/sst2/stsa_binary_train.txt : 6920
Number of data in data_in/sst2/stsa_binary_dev.txt : 872
Number of data in data_in/sst2/stsa_binary_test.txt : 1821

$ python utils/sst_preprocess.py --task sst5

Number of data in data_in/sst5/stsa_fine_train.txt : 8544
Number of data in data_in/sst5/stsa_fine_dev.txt : 1101
Number of data in data_in/sst5/stsa_fine_test.txt : 2210

Data that is not provided by default should be created in the form of label, text and exist as train.csv, dev.csv, and test.csv in the data_in folder.

Model

Most of the models supported by Huggingface models are supported.

However, if the model does not support AutoModelForSequenceClassification on huggingface transformers model, this repository's models are not supported.

Huggingface models Supported models can be used through the model_name_or_path argument in the same way as Huggingface transformers.

There are models that do not use the model_name_or_path argument and can be loaded more easily by using the model argument.

The models are as follows.

MODEL = {
    "bert": "bert-base-multilingual-cased",
    "albert": "albert-base-v2",
    "bart": "facebook/bart-base",
    "camembert": "camembert-base",
    "distilbert": "distilbert-base-uncased",
    "electra": "google/electra-base-discriminator",
    "flaubert": "flaubert/flaubert_base_cased",
    "longformer": "allenai/longformer-base-4096",
    "mobilebert": "google/mobilebert-uncased",
    "roberta": "roberta-base",
    "kobert": "monologg/kobert",
    "koelectra": "monologg/koelectra-base-v2-discriminator",
    "distilkobert": "monologg/distilkobert",
    "kcbert": "beomi/kcbert-base"
}

For how to use the model argument, refer to the shell-script file.

Text classification task does not require token_type_embedding and is not supported by many models, so token_type_embedding is not supported by our models.

Even if there is no token_type_embedding, I was able to get some Performance even when the text was classified by concatenating two sentences using the [SEP] token.

When reusing and testing the trained model, you must specify the folder where the actual model and files are stored in model_name_or_path.

For kobert and distilkobert, make sure to include kobert in the folder name when loading the trained model

Requirements

torch==1.6.0
torchvision==0.7.0
tensorboard==2.3.0
transformers==3.0.2

Usage

$ python main.py \
        --do_train \
        --do_eval \
        --do_predict \
        --evaluate_during_training \
        --output_dir <save_path> \
        --data_dir <data_path> \
        --cache_dir <cache_save_path> \
        --overwrite_output_dir \
        --model <model_name> \
        --model_name_or_path <model_name_or_path> \
        --seed <seed> \
        --save_total_limit <num> \
        --learning_rate <learning_rate> \
        --per_device_train_batch_size <train_batch_size> \
        --per_device_eval_batch_size <eval_batch_size> \
        --num_train_epochs <epoch> \
        --max_seq_length <max_length> \
        --task_name <task_name> \
        --num_labels <num_labels> \
        --eval_steps <eval_steps> \
        --logging_steps <logging_steps> \
        --save_steps <save_steps> \
        --warmup_steps <warmup_steps> \
        --gradient_accumulation_steps <gradient_accumulation_steps>

One of model and model_name_or_path must be entered. Need to adjust num_labels according to the number of labels in the dataset.

Argument description can be found in Huggingface transformers doc or in the following way.

python main.py -h

Also, you can check the example through the shell-script files provided by default.

Execution examples can be found in google colab example.

Result

The result of this experiment was tested using shell-script, and Hyper-parameter tuning was not performed.

Better performance can be obtained through various hyper-parameter tuning, and this performance is for reference only.

In KoNLI, we used a method of linking two sentences with a token of [SEP] instead of using token_type_embedding.

Korean

Model NSMC KoNLI
bert-base-multilingual-cased 0.8748 0.7666
kobert 0.903 0.7992
koelectra-base-v2-discriminator 0.8976 0.8193
distilkobert 0.8860 0.6886
kcbert-base 0.901 0.7572

English

Model SST-2 SST-5
bert-base-multilingual-cased 0.8775 0.4945
bert-base-uncased 0.9231 0.5533
albert-base-v2 0.9192 0.5565
distilbert-base-uncased 0.9115 0.5298
mobilebert-uncased 0.9071 0.5416
roberta-base 0.9450 0.5701
longformer-base-4096 0.9511 0.5760
bart-base 0.9261 0.5606
electra-base-discriminator 0.9533 0.5868

Reference

Huggingface Transformers
Huggingface Models
KoBERT
KoBERT-Transformers
DistilKoBERT
KoELECTRA
KcBERT
NSMC
KorNLI
SST2, SST5

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].