ymcui / LAMB_Optimizer_TF

Licence: Apache-2.0 license

LAMB Optimizer for Large Batch Training (TensorFlow version)

Programming Languages

139335 projects - #7 most used programming language

Projects that are alternatives of or similar to LAMB Optimizer TF

artificial-neural-variability-for-deep-learning

The PyTorch Implementation of Variable Optimizers/ Neural Variable Risk Minimization proposed in our Neural Computation paper: Artificial Neural Variability for Deep Learning: On overfitting, Noise Memorization, and Catastrophic Forgetting.

Stars: ✭ 34 (-71.43%)

Mutual labels: optimizer

neuro-comma

🇷🇺 Punctuation restoration production-ready model for Russian language 🇷🇺

Stars: ✭ 46 (-61.34%)

Mutual labels: bert

prediction gan

PyTorch Impl. of Prediction Optimizer (to stabilize GAN training)

Stars: ✭ 31 (-73.95%)

Mutual labels: optimizer

ChineseNER

中文NER的那些事儿

Stars: ✭ 241 (+102.52%)

Mutual labels: bert

wisdomify

A BERT-based reverse dictionary of Korean proverbs

Stars: ✭ 95 (-20.17%)

Mutual labels: bert

Fill-the-GAP

[ACL-WS] 4th place solution to gendered pronoun resolution challenge on Kaggle

Stars: ✭ 13 (-89.08%)

Mutual labels: bert

gpl

Powerful unsupervised domain adaptation method for dense retrieval. Requires only unlabeled corpus and yields massive improvement: "GPL: Generative Pseudo Labeling for Unsupervised Domain Adaptation of Dense Retrieval" https://arxiv.org/abs/2112.07577

Stars: ✭ 216 (+81.51%)

Mutual labels: bert

keras gradient noise

Add gradient noise to any Keras optimizer

Stars: ✭ 36 (-69.75%)

Mutual labels: optimizer

TwinBert

pytorch implementation of the TwinBert paper

Stars: ✭ 36 (-69.75%)

Mutual labels: bert

les-military-mrc-rank7

莱斯杯：全国第二届“军事智能机器阅读”挑战赛 - Rank7 解决方案

Stars: ✭ 37 (-68.91%)

Mutual labels: bert

Kaleido-BERT

(CVPR2021) Kaleido-BERT: Vision-Language Pre-training on Fashion Domain.

Stars: ✭ 252 (+111.76%)

Mutual labels: bert

cmrc2019

A Sentence Cloze Dataset for Chinese Machine Reading Comprehension (CMRC 2019)

Stars: ✭ 118 (-0.84%)

Mutual labels: bert

NLPDataAugmentation

Chinese NLP Data Augmentation， BERT Contextual Augmentation

Stars: ✭ 94 (-21.01%)

Mutual labels: bert

horoscope

horoscope is an optimizer inspector for DBMS.

Stars: ✭ 34 (-71.43%)

Mutual labels: optimizer

ERNIE-text-classification-pytorch

This repo contains a PyTorch implementation of a pretrained ERNIE model for text classification.

Stars: ✭ 49 (-58.82%)

Mutual labels: bert

Text Classification TF

用tf实现各种文本分类模型，并且封装restful接口，可以直接工程化

Stars: ✭ 32 (-73.11%)

Mutual labels: bert

XTR-Toolbox

🛠 Versatile tool to optimize Windows

Stars: ✭ 138 (+15.97%)

Mutual labels: optimizer

AliceMind

ALIbaba's Collection of Encoder-decoders from MinD (Machine IntelligeNce of Damo) Lab

Stars: ✭ 1,479 (+1142.86%)

Mutual labels: bert

embedding study

中文预训练模型生成字向量学习，测试BERT，ELMO的中文效果

Stars: ✭ 94 (-21.01%)

Mutual labels: bert

sister

SImple SenTence EmbeddeR

Stars: ✭ 66 (-44.54%)

Mutual labels: bert

View All Similar Projects ➔

LAMB Optimizer (TensorFlow)

This is a simple implementation of LAMB Optimizer, which appeared in the paper "Large Batch Optimization for Deep Learning: Training BERT in 76 minutes".

The older name of the paper was "Reducing BERT Pre-Training Time from 3 Days to 76 Minutes"

Update: official implementation of LAMB optimizer is now available: https://github.com/tensorflow/addons/blob/master/tensorflow_addons/optimizers/lamb.py

Notes

This is NOT an official implementation.
LAMB optimizer changes slightly from arXiv v1 ~ v3.
We implement v3 version (which is the latest version on June, 2019.).
Some uncertain parts are clarified by consulting original authors (such as scaling function).

Algorithm

LAMB optimizer is originally designed for large batch learning in neural networks, but could also used in small batch size as indicated by authors.

Usage

The implementation is based on BERT repository, which uses AdamWeightDecayOptimizer (appears in optimization.py) for pre-training and fine-tuning.

Just use LAMBOptimizer as a regular optimizer in TensorFlow, similar to Adam or AdamWeightDecayOptimizer.
Find LAMB optimizer in optimization.py.
There is nothing special to tune other than initial learning_rate.

Results on MNIST

I don't have TPU Pod to test its scalability on BERT with large batch 😂, but tested on MNIST for verify its effectiveness.
All optimizers use an initial learning rate of 0.001 (default settings), and did NOT scale to the batch size (may bring another gain, but leave it for you to test).
All the experiments are done on NVIDIA TESLA T4.

Here are the numbers on several three classical neural networks (MLP, CNN, Bi-RNN, Bi-GRU, Bi-LSTM) with different optimizers (Adam, AdamW, LAMB).

I only list results of batch={64, 128, 1024, 16384}. For full results, please see FULL_RESULTS.md.

Batch=64

Optimizer	MLP	CNN	Bi-RNN	Bi-GRU	Bi-LSTM	Note
Adam	97.03	98.93	96.24	98.92	99.04	Just ordinary Adam
AdamW	97.11	99.01	96.50	99.11	99.04	Used in BERT
LAMB	98.27	99.33	97.73	98.83	98.94	New optimizer for large batch

Batch=128

Optimizer	MLP	CNN	Bi-RNN	Bi-GRU	Bi-LSTM	Note
Adam	96.38	98.76	97.73	99.08	99.09	Just ordinary Adam
AdamW	96.57	98.72	98.05	98.96	99.00	Used in BERT
LAMB	97.90	99.20	98.04	98.87	98.76	New optimizer for large batch

Batch=1024

Optimizer	MLP	CNN	Bi-RNN	Bi-GRU	Bi-LSTM	Note
Adam	93.05	97.92	98.10	98.94	98.67	Just ordinary Adam
AdamW	93.67	98.00	98.19	98.86	98.82	Used in BERT
LAMB	97.68	98.82	98.27	98.61	98.47	New optimizer for large batch

Batch=16384

Optimizer	MLP	CNN	Bi-RNN	Bi-GRU	Bi-LSTM	Note
Adam	88.46	95.06	95.98	97.81	97.74	Just ordinary Adam
AdamW	91.46	96.57	96.34	98.45	98.39	Used in BERT
LAMB	93.23	97.89	93.76	87.60	80.36	New optimizer for large batch

Several Conclusions

Note: The conclusions are only made by the results above.

LAMB consistently outperforms Adam and AdamW in most of the times, and shows consistent results among different batch sizes.
LAMB shows big advantage than Adam and AdamW on large batch, showing its excellent scalability.
LAMB failed to outperform than Adam and AdamW on complex RNN-based models, despite batch size.

Reproducibility

Check mnist_tensorflow.ipynb for details.

Note: You know the GPU/TPU won't get exactly the same results even we use fixed random seed.

References

Large Batch Optimization for Deep Learning: Training BERT in 76 minutes. https://arxiv.org/abs/1904.00962v3
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. https://arxiv.org/abs/1810.04805

Issues

For help or issues, please submit a GitHub issue.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

ymcui / LAMB_Optimizer_TF

Programming Languages

Labels

Projects that are alternatives of or similar to LAMB Optimizer TF

LAMB Optimizer (TensorFlow)

Notes

Algorithm

Usage

Results on MNIST

Batch=64

Batch=128

Batch=1024

Batch=16384

Several Conclusions

Reproducibility

References

Issues