Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

Created with love in Canada, visit hostnodejs.com today

Feel like to post an Ad? Learn Details

All Projects → SanghunYun → Uda_pytorch

SanghunYun / Uda_pytorch

Licence: apache-2.0

UDA(Unsupervised Data Augmentation) implemented by pytorch

Programming Languages

139335 projects - #7 most used programming language

Labels

text-classification

Projects that are alternatives of or similar to Uda pytorch

Text rnn attention

嵌入Word2vec词向量的RNN+ATTENTION中文文本分类

Stars: ✭ 117 (-18.18%)

Mutual labels: text-classification

Rcnn Text Classification

Tensorflow Implementation of "Recurrent Convolutional Neural Network for Text Classification" (AAAI 2015)

Stars: ✭ 127 (-11.19%)

Mutual labels: text-classification

export bert model for serving

Stars: ✭ 138 (-3.5%)

Mutual labels: text-classification

Classifier multi label textcnn

multi-label，classifier，text classification，多标签文本分类，文本分类，BERT，ALBERT，multi-label-classification

Stars: ✭ 116 (-18.88%)

Mutual labels: text-classification

Dan Jurafsky Chris Manning Nlp

My solution to the Natural Language Processing course made by Dan Jurafsky, Chris Manning in Winter 2012.

Stars: ✭ 124 (-13.29%)

Mutual labels: text-classification

FastText for Node.js

Stars: ✭ 127 (-11.19%)

Mutual labels: text-classification

Pytorch Rnn Text Classification

Word Embedding + LSTM + FC

Stars: ✭ 112 (-21.68%)

Mutual labels: text-classification

Summarization, translation, sentiment-analysis, text-generation and more at blazing speed using a T5 version implemented in ONNX.

Stars: ✭ 143 (+0%)

Mutual labels: text-classification

Cluedatasetsearch

搜索所有中文NLP数据集，附常用英文NLP数据集

Stars: ✭ 2,112 (+1376.92%)

Mutual labels: text-classification

Hierarchical Multi Label Text Classification

The code of CIKM'19 paper《Hierarchical Multi-label Text Classification: An Attention-based Recurrent Network Approach》

Stars: ✭ 133 (-6.99%)

Mutual labels: text-classification

ConText v4: Neural networks for text categorization

Stars: ✭ 120 (-16.08%)

Mutual labels: text-classification

Python Stop Words

Get list of common stop words in various languages in Python

Stars: ✭ 122 (-14.69%)

Mutual labels: text-classification

Textclassify with bert

使用BERT模型做文本分类；面向工业用途

Stars: ✭ 128 (-10.49%)

Mutual labels: text-classification

Bdci2017 Minglue

BDCI2017-让AI当法官，决赛第四（4/415）https://www.datafountain.cn/competitions/277/details

Stars: ✭ 118 (-17.48%)

Mutual labels: text-classification

Document Classifier Lstm

A bidirectional LSTM with attention for multiclass/multilabel text classification.

Stars: ✭ 136 (-4.9%)

Mutual labels: text-classification

Rnn Text Classification Tf

Tensorflow Implementation of Recurrent Neural Network (Vanilla, LSTM, GRU) for Text Classification

Stars: ✭ 114 (-20.28%)

Mutual labels: text-classification

ML based projects such as Spam Classification, Time Series Analysis, Text Classification using Random Forest, Deep Learning, Bayesian, Xgboost in Python

Stars: ✭ 127 (-11.19%)

Mutual labels: text-classification

Monkeylearn Python

Official Python client for the MonkeyLearn API. Build and consume machine learning models for language processing from your Python apps.

Stars: ✭ 143 (+0%)

Mutual labels: text-classification

Parselawdocuments

对收集的法律文档进行一系列分析，包括根据规范自动切分、案件相似度计算、案件聚类、法律条文推荐等（试验目前基于婚姻类案件，可扩展至其它领域）。

Stars: ✭ 138 (-3.5%)

Mutual labels: text-classification

Nlp estimator tutorial

Educational material on using the TensorFlow Estimator framework for text classification

Stars: ✭ 131 (-8.39%)

Mutual labels: text-classification

View All Similar Projects ➔

UDA(Unsupervised Data Augmentation) with BERT

This is re-implementation of Google's UDA [paper][tensorflow] in pytorch with Kakao Brain's Pytorchic BERT[pytorch].

Model	UDA official	This repository
UDA (X)	68%
UDA (O)	90%	88.45%

(Max sequence length = 128, Train batch size = 8)

UDA

UDA(Unsupervised Data Augmentation) is a semi-supervised learning method which achieves SOTA results on a wide variety of language and vision tasks. With only 20 labeled examples, UDA outperforms the previous SOTA on IMDb trained on 25,000 labeled examples. (BERT=4.51, UDA=4.20, error rate)

Unsupervised Data Augmentation for Consistency Training (2019 Google Brain, Q Xie et al.)

- UDA with BERT

UDA works as part of BERT. It means that UDA act as an assistant of BERT. So, in the picture above model M is BERT.

- Loss

UDA consist of supervised loss and unsupervised loss. Supervised loss is traditional Cross-entropy loss and Unsupervised loss is KL-divergence loss of original example and augmented example outputs. In this project, I used Back translation technique for augmentation.
The supervised loss and unsupervised loss are added to form a total loss and then total loss is descent. To be careful is loss doesn't descent trough original example route. Only by labeled data and augmented unlabeled data Model's weights are updated.

- TSA(Training Signal Annealing)

There is a large gap between the amount of unlabeled data and that of labeled data. So, it is easy to overfit to labeled data. Therefore, TSA technique mask out the examples that predicted probability is bigger than threshold. The threshold is scheduled by log, linear or exponential function.

- Sharpening Predictions

The KL-divergence loss(ori, aug) is too small to just use. It can cause that the total loss is dominated by supervised loss. Therefore, Sharpening Prediction techniques is needed.

Confidence-based masking : Maksing out examples that the current model is not confident about. Specifically, in each minibatch, the consistency loss term is computed only on examples whose highest probability.
Softmax temperature controlling : Be used when computing the predictions on original example. Specifically, probability of original example is computed as Softmax(l(x)/τ) where l(x) denotes the logits and τ is the temperature. A lower temperature corresponds to a sharper distribution.
(UDA, 2019 Google Brain, Q Xie et al.)

Requirements

UDA : python > 3.6, fire, tqdm, tensorboardX, tensorflow, pytorch, pandas, numpy

Overview

download.sh : Download pre-trained BERT model from Google's official BERT and IMDb data file
load_data.py : Load the data of sup, unsup
models.py : Model calsses for a general transformer (from Pytorchic BERT's code)
main.py : Including default BERT, UDA(TSA, Sharpening) modes
train.py : A custom training class(Trainer class) adopted from Pytorhchic BERT's code
utils
- configuration.py : Set a configuration from json file
- checkpoint.py : Functions to load a model from tensorflow's file (from Pytorchic BERT's code)
- optim.py : Optimizer (BERTAdam class) (from Pytorchic BERT's code)
- tokenization.py : Tokenizers adopted from the original Google BERT's code
- utils.py : A custom utility functions adopted from Pytorchic BERT's code

Pre-works

- Download pre-trained BERT model and unzip IMDb data

First, you have to download pre-trained BERT_base from Google's BERT repository. And unzip IMDb data

bash download.sh

After running, you can get the pre-trained BERT_base_Uncased model at /BERT_Base_Uncased director and /data

I use already pre-processed and augmented IMDb data extracted from official UDA. If you want to use your raw data, change need_prepro = True.

Example usage

This project are broadly divided into two parts(Fine-tuning, Evaluation).
Caution : Before runing code, you have to check and edit config file

Fine-tuning
You can choose train mode(train, train_eval) on non-uda.json or uda.json (default : train_eval).

Non UDA fine-tuning

  python main.py \
      --cfg='config/non-uda.json' \
      --model_cfg='config/bert_base.json'

UDA fine-tuning

  python main.py \
      --cfg='config/uda.json' \
      --model_cfg='config/bert_base.json'

Evaluation

Basically evaluation code, dump out results file. So, you can change dump option in main.py There is two mode (real_time print, make tsv file)
```
  python main.py \
      --cfg='config/eval.json' \
      --model_cfg='config/bert_base.json'
```

Acknowledgement

Thanks to references of UDA and Pytorchic BERT, I can implement this code.

TODO

It is known that further training(more pre-training by the specific corpus on already pre-trained BERT) can improve performance. But, this repository does not have pretrain code. So, pretrain code will be added. If you want to further training you can use Pytorchic BERT 's pretrain.py or any BERT project.
Korean dataset version

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Stars: ✭ 143

Visit Git Page 🔗Visit User Page 🔗Visit Issues Page (7) 🔗