Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

Created with love in Canada, visit hostnodejs.com today

Feel like to post an Ad? Learn Details

All Projects → lonePatient → Bert Multi Label Text Classification

lonePatient / Bert Multi Label Text Classification

Licence: mit

This repo contains a PyTorch implementation of a pretrained BERT model for multi-label text classification.

Programming Languages

139335 projects - #7 most used programming language

Labels

pytorch nlp text-classification

Projects that are alternatives of or similar to Bert Multi Label Text Classification

嵌入Word2vec词向量的CNN中文文本分类

Stars: ✭ 298 (-37%)

Mutual labels: text-classification

Spacy Streamlit

👑 spaCy building blocks and visualizers for Streamlit apps

Stars: ✭ 360 (-23.89%)

Mutual labels: text-classification

Multi Class Text Classification Cnn

Classify Kaggle Consumer Finance Complaints into 11 classes. Build the model with CNN (Convolutional Neural Network) and Word Embeddings on Tensorflow.

Stars: ✭ 410 (-13.32%)

Mutual labels: text-classification

Gather Deployment

Gathers scalable tensorflow and infrastructure deployment

Stars: ✭ 326 (-31.08%)

Mutual labels: text-classification

Text mining resources

Resources for learning about Text Mining and Natural Language Processing

Stars: ✭ 358 (-24.31%)

Mutual labels: text-classification

Bert Multitask Learning

BERT for Multitask Learning

Stars: ✭ 380 (-19.66%)

Mutual labels: text-classification

Bert For Sequence Labeling And Text Classification

This is the template code to use BERT for sequence lableing and text classification, in order to facilitate BERT for more tasks. Currently, the template code has included conll-2003 named entity identification, Snips Slot Filling and Intent Prediction.

Stars: ✭ 293 (-38.05%)

Mutual labels: text-classification

Tensorflow based training and classification scripts for text, images, etc

Stars: ✭ 441 (-6.77%)

Mutual labels: text-classification

word2vec, sentence2vec, machine reading comprehension, dialog system, text classification, pretrained language model (i.e., XLNet, BERT, ELMo, GPT), sequence labeling, information retrieval, information extraction (i.e., entity, relation and event extraction), knowledge graph, text generation, network embedding

Stars: ✭ 360 (-23.89%)

Mutual labels: text-classification

Natural language detection library for Rust. Try demo online: https://www.greyblake.com/whatlang/

Stars: ✭ 400 (-15.43%)

Mutual labels: text-classification

Text Classification Cnn Rnn

CNN-RNN中文文本分类，基于TensorFlow

Stars: ✭ 3,613 (+663.85%)

Mutual labels: text-classification

Artificial Adversary

🗣️ Tool to generate adversarial text examples and test machine learning models against them

Stars: ✭ 348 (-26.43%)

Mutual labels: text-classification

Zhihu Text Classification

[2017知乎看山杯多标签文本分类] ye组(第六名) 解题方案

Stars: ✭ 392 (-17.12%)

Mutual labels: text-classification

pytorch实现bert做seq2seq任务，使用unilm方案,现在也可以做自动摘要，文本分类，情感分析，NER，词性标注等任务,支持GPT2进行文章续写。

Stars: ✭ 298 (-37%)

Mutual labels: text-classification

Text Classification Library in Keras

Stars: ✭ 421 (-10.99%)

Mutual labels: text-classification

A Model for Natural Language Attack on Text Classification and Inference

Stars: ✭ 298 (-37%)

Mutual labels: text-classification

RMDL: Random Multimodel Deep Learning for Classification

Stars: ✭ 375 (-20.72%)

Mutual labels: text-classification

💫 Industrial-strength Natural Language Processing (NLP) in Python

Stars: ✭ 21,978 (+4546.51%)

Mutual labels: text-classification

Sequence Semantic Embedding

Tools and recipes to train deep learning models and build services for NLP tasks such as text classification, semantic search ranking and recall fetching, cross-lingual information retrieval, and question answering etc.

Stars: ✭ 435 (-8.03%)

Mutual labels: text-classification

Multi Label Text Classification

About Muti-Label Text Classification Based on Neural Network.

Stars: ✭ 397 (-16.07%)

Mutual labels: text-classification

View All Similar Projects ➔

Bert multi-label text classification by PyTorch

This repo contains a PyTorch implementation of the pretrained BERT and XLNET model for multi-label text classification.

Structure of the code

At the root of the project, you will see:

├── pybert
|  └── callback
|  |  └── lrscheduler.py　　
|  |  └── trainingmonitor.py　
|  |  └── ...
|  └── config
|  |  └── basic_config.py #a configuration file for storing model parameters
|  └── dataset　　　
|  └── io　　　　
|  |  └── dataset.py　　
|  |  └── data_transformer.py　　
|  └── model
|  |  └── nn　
|  |  └── pretrain　
|  └── output #save the ouput of model
|  └── preprocessing #text preprocessing 
|  └── train #used for training a model
|  |  └── trainer.py 
|  |  └── ...
|  └── common # a set of utility functions
├── run_bert.py
├── run_xlnet.py

Dependencies

csv
tqdm
numpy
pickle
scikit-learn
PyTorch 1.1+
matplotlib
pandas
transformers=2.5.1

How to use the code

you need download pretrained bert model and xlnet model.

BERT: bert-base-uncased

XLNET: xlnet-base-cased

Download the Bert pretrained model from s3
Download the Bert config file from s3
Download the Bert vocab file from s3
Rename:
- bert-base-uncased-pytorch_model.bin to pytorch_model.bin
- bert-base-uncased-config.json to config.json
- bert-base-uncased-vocab.txt to bert_vocab.txt
Place model ,config and vocab file into the /pybert/pretrain/bert/base-uncased directory.
pip install pytorch-transformers from github.
Download kaggle data and place in pybert/dataset.
- you can modify the io.task_data.py to adapt your data.
Modify configuration information in pybert/configs/basic_config.py(the path of data,...).
Run python run_bert.py --do_data to preprocess data.
Run python run_bert.py --do_train --save_best --do_lower_case to fine tuning bert model.
Run run_bert.py --do_test --do_lower_case to predict new data.

training

[training] 8511/8511 [>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>] -0.8s/step- loss: 0.0640
training result:
[2019-01-14 04:01:05]: bert-multi-label trainer.py[line:176] INFO  
Epoch: 2 - loss: 0.0338 - val_loss: 0.0373 - val_auc: 0.9922

training figure

result

---- train report every label -----
Label: toxic - auc: 0.9903
Label: severe_toxic - auc: 0.9913
Label: obscene - auc: 0.9951
Label: threat - auc: 0.9898
Label: insult - auc: 0.9911
Label: identity_hate - auc: 0.9910
---- valid report every label -----
Label: toxic - auc: 0.9892
Label: severe_toxic - auc: 0.9911
Label: obscene - auc: 0.9945
Label: threat - auc: 0.9955
Label: insult - auc: 0.9903
Label: identity_hate - auc: 0.9927

Tips

When converting the tensorflow checkpoint into the pytorch, it's expected to choice the "bert_model.ckpt", instead of "bert_model.ckpt.index", as the input file. Otherwise, you will see that the model can learn nothing and give almost same random outputs for any inputs. This means, in fact, you have not loaded the true ckpt for your model
When using multiple GPUs, the non-tensor calculations, such as accuracy and f1_score, are not supported by DataParallel instance
As recommanded by Jocob in his paper https://arxiv.org/pdf/1810.04805.pdf, in fine-tuning tasks, the hyperparameters are expected to set as following: Batch_size: 16 or 32, learning_rate: 5e-5 or 2e-5 or 3e-5, num_train_epoch: 3 or 4
The pretrained model has a limit for the sentence of input that its length should is not larger than 512, the max position embedding dim. The data flows into the model as: Raw_data -> WordPieces -> Model. Note that the length of wordPieces is generally larger than that of raw_data, so a safe max length of raw_data is at ~128 - 256
Upon testing, we found that fine-tuning all layers could get much better results than those of only fine-tuning the last classfier layer. The latter is actually a feature-based way

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Stars: ✭ 473

Visit Git Page 🔗Visit User Page 🔗Visit Issues Page (31) 🔗