lonePatient / BERT-chinese-text-classification-pytorch

Licence: other

This repo contains a PyTorch implementation of a pretrained BERT model for text classification.

Programming Languages

python

139335 projects - #7 most used programming language

Projects that are alternatives of or similar to BERT-chinese-text-classification-pytorch

Nlp chinese corpus

大规模中文自然语言处理语料 Large Scale Chinese Corpus for NLP

Stars: ✭ 6,656 (+7134.78%)

Mutual labels: text-classification, chinese, bert

ERNIE-text-classification-pytorch

This repo contains a PyTorch implementation of a pretrained ERNIE model for text classification.

Stars: ✭ 49 (-46.74%)

Mutual labels: text-classification, bert, chinese-text-classification

Pytorch-NLU

Pytorch-NLU，一个中文文本分类、序列标注工具包，支持中文长文本、短文本的多类、多标签分类任务，支持中文命名实体识别、词性标注、分词等序列标注任务。 Ptorch NLU, a Chinese text classification and sequence annotation toolkit, supports multi class and multi label classification tasks of Chinese long text and short text, and supports sequence annotation tasks such as Chinese named entity recognition, part of speech ta…

Stars: ✭ 151 (+64.13%)

Mutual labels: text-classification, bert, chinese-text-classification

Eda nlp for chinese

An implement of the paper of EDA for Chinese corpus.中文语料的EDA数据增强工具。NLP数据增强。论文阅读笔记。

Stars: ✭ 660 (+617.39%)

Mutual labels: text-classification, chinese

Text Classification Cnn Rnn

CNN-RNN中文文本分类，基于TensorFlow

Stars: ✭ 3,613 (+3827.17%)

Mutual labels: text-classification, chinese

Cluepretrainedmodels

高质量中文预训练模型集合：最先进大模型、最快小模型、相似度专门模型

Stars: ✭ 493 (+435.87%)

Mutual labels: text-classification, chinese

policy-data-analyzer

Building a model to recognize incentives for landscape restoration in environmental policies from Latin America, the US and India. Bringing NLP to the world of policy analysis through an extensible framework that includes scraping, preprocessing, active learning and text analysis pipelines.

Stars: ✭ 22 (-76.09%)

Mutual labels: text-classification, bert

Nlp xiaojiang

自然语言处理（nlp），小姜机器人（闲聊检索式chatbot），BERT句向量-相似度（Sentence Similarity），XLNET句向量-相似度（text xlnet embedding），文本分类（Text classification），实体提取（ner，bert+bilstm+crf），数据增强（text augment, data enhance），同义句同义词生成，句子主干提取（mainpart），中文汉语短文本相似度，文本特征工程，keras-http-service调用

Stars: ✭ 954 (+936.96%)

Mutual labels: text-classification, chinese

Lightnlp

基于Pytorch和torchtext的自然语言处理深度学习框架。

Stars: ✭ 739 (+703.26%)

Mutual labels: text-classification, chinese

Cluedatasetsearch

搜索所有中文NLP数据集，附常用英文NLP数据集

Stars: ✭ 2,112 (+2195.65%)

Mutual labels: text-classification, chinese

protonet-bert-text-classification

finetune bert for small dataset text classification in a few-shot learning manner using ProtoNet

Stars: ✭ 28 (-69.57%)

Mutual labels: text-classification, bert

Chinese Text Classification

Chinese-Text-Classification，Tensorflow CNN（卷积神经网络）实现的中文文本分类。QQ群：522785813，微信群二维码：http://www.tensorflownews.com/

Stars: ✭ 284 (+208.7%)

Mutual labels: text-classification, chinese

Spark Nlp

State of the Art Natural Language Processing

Stars: ✭ 2,518 (+2636.96%)

Mutual labels: text-classification, bert

kwx

BERT, LDA, and TFIDF based keyword extraction in Python

Stars: ✭ 33 (-64.13%)

Mutual labels: text-classification, bert

Cnn Text Classification Tf Chinese

CNN for Chinese Text Classification in Tensorflow

Stars: ✭ 237 (+157.61%)

Mutual labels: text-classification, chinese

AiSpace

AiSpace: Better practices for deep learning model development and deployment For Tensorflow 2.0

Stars: ✭ 28 (-69.57%)

Mutual labels: chinese, bert

Cnn Question Classification Keras

Chinese Question Classifier (Keras Implementation) on BQuLD

Stars: ✭ 28 (-69.57%)

Mutual labels: text-classification, chinese

text2class

Multi-class text categorization using state-of-the-art pre-trained contextualized language models, e.g. BERT

Stars: ✭ 15 (-83.7%)

Mutual labels: text-classification, bert

Filipino-Text-Benchmarks

Open-source benchmark datasets and pretrained transformer models in the Filipino language.

Stars: ✭ 22 (-76.09%)

Mutual labels: text-classification, bert

Kashgari

Kashgari is a production-level NLP Transfer learning framework built on top of tf.keras for text-labeling and text-classification, includes Word2Vec, BERT, and GPT2 Language Embedding.

Stars: ✭ 2,235 (+2329.35%)

Mutual labels: text-classification, bert

View All Similar Projects ➔

BERT Chinese text classification by PyTorch

This repo contains a PyTorch implementation of a pretrained BERT model for chinese text classification.

Structure of the code

At the root of the project, you will see:

├── pybert
|  └── callback
|  |  └── lrscheduler.py　　
|  |  └── trainingmonitor.py　
|  |  └── ...
|  └── config
|  |  └── base.py #a configuration file for storing model parameters
|  └── dataset　　　
|  └── io　　　　
|  |  └── bert_processor.py
|  └── model
|  |  └── nn　
|  |  └── pretrain　
|  └── output #save the ouput of model
|  └── preprocessing #text preprocessing 
|  └── train #used for training a model
|  |  └── trainer.py 
|  |  └── ...
|  └── utils # a set of utility functions
├── run_bert.py

Dependencies

csv
tqdm
numpy
pickle
scikit-learn
PyTorch 1.0
matplotlib
pytorch_transformers=1.1.0

How to use the code

you need download pretrained chinese bert model

Download the Bert pretrained model from s3
Download the Bert config file from s3
Download the Bert vocab file from s3
modify bert-base-chinese-pytorch_model.bin to pytorch_model.bin , bert-base-chinese-config.json to config.json ,bert-base-chinese-vocab.txt to vocab.txt
place model ,config and vocab file into the /pybert/pretrain/bert/base-uncased directory.
pip install pytorch-transformers from github.
Prepare BaiduNet{password:ruxu}, you can modify the io.bert_processor.py to adapt your data.
Modify configuration information in pybert/config/base.py(the path of data,...).
Run python run_bert.py --do_data to preprocess data.
Run python run_bert.py --do_train --save_best to fine tuning bert model.
Run run_bert.py --do_test --do_lower_case to predict new data.

Fine-tuning result

training

Epoch: 3 - loss: 0.0222 acc: 0.9939 - f1: 0.9911 val_loss: 0.0785 - val_acc: 0.9799 - val_f1: 0.9800

classify_report

label	precision	recall	f1-score	support
财经	0.97	0.96	0.96	1500
体育	1.00	1.00	1.00	1500
娱乐	0.99	0.99	0.99	1500
家居	0.99	0.99	0.99	1500
房产	0.96	0.97	0.96	1500
教育	0.98	0.97	0.97	1500
时尚	0.99	0.98	0.99	1500
时政	0.97	0.98	0.98	1500
游戏	1.00	0.99	0.99	1500
科技	0.96	0.97	0.97	1500
avg / total	0.98	0.98	0.98	15000

training figure

Tips

When converting the tensorflow checkpoint into the pytorch, it's expected to choice the "bert_model.ckpt", instead of "bert_model.ckpt.index", as the input file. Otherwise, you will see that the model can learn nothing and give almost same random outputs for any inputs. This means, in fact, you have not loaded the true ckpt for your model
When using multiple GPUs, the non-tensor calculations, such as accuracy and f1_score, are not supported by DataParallel instance
As recommanded by Jocob in his paper https://arxiv.org/pdf/1810.04805.pdf, in fine-tuning tasks, the hyperparameters are expected to set as following: Batch_size: 16 or 32, learning_rate: 5e-5 or 2e-5 or 3e-5, num_train_epoch: 3 or 4
The pretrained model has a limit for the sentence of input that its length should is not larger than 512, the max position embedding dim. The data flows into the model as: Raw_data -> WordPieces -> Model. Note that the length of wordPieces is generally larger than that of raw_data, so a safe max length of raw_data is at ~128 - 256
Upon testing, we found that fine-tuning all layers could get much better results than those of only fine-tuning the last classfier layer. The latter is actually a feature-based way

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

lonePatient / BERT-chinese-text-classification-pytorch

Programming Languages

Labels

Projects that are alternatives of or similar to BERT-chinese-text-classification-pytorch

BERT Chinese text classification by PyTorch

Structure of the code

Dependencies

How to use the code

Fine-tuning result

training

classify_report

training figure

Tips