All Projects → lonePatient → Bert Multi Label Text Classification

lonePatient / Bert Multi Label Text Classification

Licence: mit
This repo contains a PyTorch implementation of a pretrained BERT model for multi-label text classification.

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to Bert Multi Label Text Classification

Text Cnn
嵌入Word2vec词向量的CNN中文文本分类
Stars: ✭ 298 (-37%)
Mutual labels:  text-classification
Spacy Streamlit
👑 spaCy building blocks and visualizers for Streamlit apps
Stars: ✭ 360 (-23.89%)
Mutual labels:  text-classification
Multi Class Text Classification Cnn
Classify Kaggle Consumer Finance Complaints into 11 classes. Build the model with CNN (Convolutional Neural Network) and Word Embeddings on Tensorflow.
Stars: ✭ 410 (-13.32%)
Mutual labels:  text-classification
Gather Deployment
Gathers scalable tensorflow and infrastructure deployment
Stars: ✭ 326 (-31.08%)
Mutual labels:  text-classification
Text mining resources
Resources for learning about Text Mining and Natural Language Processing
Stars: ✭ 358 (-24.31%)
Mutual labels:  text-classification
Bert Multitask Learning
BERT for Multitask Learning
Stars: ✭ 380 (-19.66%)
Mutual labels:  text-classification
Bert For Sequence Labeling And Text Classification
This is the template code to use BERT for sequence lableing and text classification, in order to facilitate BERT for more tasks. Currently, the template code has included conll-2003 named entity identification, Snips Slot Filling and Intent Prediction.
Stars: ✭ 293 (-38.05%)
Mutual labels:  text-classification
Tfclassifier
Tensorflow based training and classification scripts for text, images, etc
Stars: ✭ 441 (-6.77%)
Mutual labels:  text-classification
Nlp Projects
word2vec, sentence2vec, machine reading comprehension, dialog system, text classification, pretrained language model (i.e., XLNet, BERT, ELMo, GPT), sequence labeling, information retrieval, information extraction (i.e., entity, relation and event extraction), knowledge graph, text generation, network embedding
Stars: ✭ 360 (-23.89%)
Mutual labels:  text-classification
Whatlang Rs
Natural language detection library for Rust. Try demo online: https://www.greyblake.com/whatlang/
Stars: ✭ 400 (-15.43%)
Mutual labels:  text-classification
Text Classification Cnn Rnn
CNN-RNN中文文本分类,基于TensorFlow
Stars: ✭ 3,613 (+663.85%)
Mutual labels:  text-classification
Artificial Adversary
🗣️ Tool to generate adversarial text examples and test machine learning models against them
Stars: ✭ 348 (-26.43%)
Mutual labels:  text-classification
Zhihu Text Classification
[2017知乎看山杯 多标签 文本分类] ye组(第六名) 解题方案
Stars: ✭ 392 (-17.12%)
Mutual labels:  text-classification
Bert seq2seq
pytorch实现bert做seq2seq任务,使用unilm方案,现在也可以做自动摘要,文本分类,情感分析,NER,词性标注等任务,支持GPT2进行文章续写。
Stars: ✭ 298 (-37%)
Mutual labels:  text-classification
Keras Text
Text Classification Library in Keras
Stars: ✭ 421 (-10.99%)
Mutual labels:  text-classification
Textfooler
A Model for Natural Language Attack on Text Classification and Inference
Stars: ✭ 298 (-37%)
Mutual labels:  text-classification
Rmdl
RMDL: Random Multimodel Deep Learning for Classification
Stars: ✭ 375 (-20.72%)
Mutual labels:  text-classification
Spacy
💫 Industrial-strength Natural Language Processing (NLP) in Python
Stars: ✭ 21,978 (+4546.51%)
Mutual labels:  text-classification
Sequence Semantic Embedding
Tools and recipes to train deep learning models and build services for NLP tasks such as text classification, semantic search ranking and recall fetching, cross-lingual information retrieval, and question answering etc.
Stars: ✭ 435 (-8.03%)
Mutual labels:  text-classification
Multi Label Text Classification
About Muti-Label Text Classification Based on Neural Network.
Stars: ✭ 397 (-16.07%)
Mutual labels:  text-classification

Bert multi-label text classification by PyTorch

This repo contains a PyTorch implementation of the pretrained BERT and XLNET model for multi-label text classification.

Structure of the code

At the root of the project, you will see:

├── pybert
|  └── callback
|  |  └── lrscheduler.py  
|  |  └── trainingmonitor.py 
|  |  └── ...
|  └── config
|  |  └── basic_config.py #a configuration file for storing model parameters
|  └── dataset   
|  └── io    
|  |  └── dataset.py  
|  |  └── data_transformer.py  
|  └── model
|  |  └── nn 
|  |  └── pretrain 
|  └── output #save the ouput of model
|  └── preprocessing #text preprocessing 
|  └── train #used for training a model
|  |  └── trainer.py 
|  |  └── ...
|  └── common # a set of utility functions
├── run_bert.py
├── run_xlnet.py

Dependencies

  • csv
  • tqdm
  • numpy
  • pickle
  • scikit-learn
  • PyTorch 1.1+
  • matplotlib
  • pandas
  • transformers=2.5.1

How to use the code

you need download pretrained bert model and xlnet model.

BERT: bert-base-uncased

XLNET: xlnet-base-cased

  1. Download the Bert pretrained model from s3

  2. Download the Bert config file from s3

  3. Download the Bert vocab file from s3

  4. Rename:

    • bert-base-uncased-pytorch_model.bin to pytorch_model.bin
    • bert-base-uncased-config.json to config.json
    • bert-base-uncased-vocab.txt to bert_vocab.txt
  5. Place model ,config and vocab file into the /pybert/pretrain/bert/base-uncased directory.

  6. pip install pytorch-transformers from github.

  7. Download kaggle data and place in pybert/dataset.

    • you can modify the io.task_data.py to adapt your data.
  8. Modify configuration information in pybert/configs/basic_config.py(the path of data,...).

  9. Run python run_bert.py --do_data to preprocess data.

  10. Run python run_bert.py --do_train --save_best --do_lower_case to fine tuning bert model.

  11. Run run_bert.py --do_test --do_lower_case to predict new data.

training

[training] 8511/8511 [>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>] -0.8s/step- loss: 0.0640
training result:
[2019-01-14 04:01:05]: bert-multi-label trainer.py[line:176] INFO  
Epoch: 2 - loss: 0.0338 - val_loss: 0.0373 - val_auc: 0.9922

training figure

result

---- train report every label -----
Label: toxic - auc: 0.9903
Label: severe_toxic - auc: 0.9913
Label: obscene - auc: 0.9951
Label: threat - auc: 0.9898
Label: insult - auc: 0.9911
Label: identity_hate - auc: 0.9910
---- valid report every label -----
Label: toxic - auc: 0.9892
Label: severe_toxic - auc: 0.9911
Label: obscene - auc: 0.9945
Label: threat - auc: 0.9955
Label: insult - auc: 0.9903
Label: identity_hate - auc: 0.9927

Tips

  • When converting the tensorflow checkpoint into the pytorch, it's expected to choice the "bert_model.ckpt", instead of "bert_model.ckpt.index", as the input file. Otherwise, you will see that the model can learn nothing and give almost same random outputs for any inputs. This means, in fact, you have not loaded the true ckpt for your model
  • When using multiple GPUs, the non-tensor calculations, such as accuracy and f1_score, are not supported by DataParallel instance
  • As recommanded by Jocob in his paper https://arxiv.org/pdf/1810.04805.pdf, in fine-tuning tasks, the hyperparameters are expected to set as following: Batch_size: 16 or 32, learning_rate: 5e-5 or 2e-5 or 3e-5, num_train_epoch: 3 or 4
  • The pretrained model has a limit for the sentence of input that its length should is not larger than 512, the max position embedding dim. The data flows into the model as: Raw_data -> WordPieces -> Model. Note that the length of wordPieces is generally larger than that of raw_data, so a safe max length of raw_data is at ~128 - 256
  • Upon testing, we found that fine-tuning all layers could get much better results than those of only fine-tuning the last classfier layer. The latter is actually a feature-based way
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].