All Projects → LiyuanLucasLiu → Transformer Clinic

LiyuanLucasLiu / Transformer Clinic

Licence: apache-2.0
Understanding the Difficulty of Training Transformers

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to Transformer Clinic

Joeynmt
Minimalist NMT for educational purposes
Stars: ✭ 420 (+134.64%)
Mutual labels:  nmt, transformer
Quality-Estimation1
机器翻译子任务-翻译质量评价-复现 WMT2018 阿里论文结果
Stars: ✭ 19 (-89.39%)
Mutual labels:  transformer, nmt
Nmt Keras
Neural Machine Translation with Keras
Stars: ✭ 501 (+179.89%)
Mutual labels:  nmt, transformer
pynmt
a simple and complete pytorch implementation of neural machine translation system
Stars: ✭ 13 (-92.74%)
Mutual labels:  transformer, nmt
Njunmt Tf
An open-source neural machine translation system developed by Natural Language Processing Group, Nanjing University.
Stars: ✭ 97 (-45.81%)
Mutual labels:  nmt, transformer
Onnxt5
Summarization, translation, sentiment-analysis, text-generation and more at blazing speed using a T5 version implemented in ONNX.
Stars: ✭ 143 (-20.11%)
Mutual labels:  transformer
Nlp pytorch project
Embedding, NMT, Text_Classification, Text_Generation, NER etc.
Stars: ✭ 153 (-14.53%)
Mutual labels:  nmt
Subword Nmt
Unsupervised Word Segmentation for Neural Machine Translation and Text Generation
Stars: ✭ 1,819 (+916.2%)
Mutual labels:  nmt
Transformer In Generating Dialogue
An Implementation of 'Attention is all you need' with Chinese Corpus
Stars: ✭ 121 (-32.4%)
Mutual labels:  transformer
Transformers.jl
Julia Implementation of Transformer models
Stars: ✭ 173 (-3.35%)
Mutual labels:  transformer
Eeg Dl
A Deep Learning library for EEG Tasks (Signals) Classification, based on TensorFlow.
Stars: ✭ 165 (-7.82%)
Mutual labels:  transformer
Routing Transformer
Fully featured implementation of Routing Transformer
Stars: ✭ 149 (-16.76%)
Mutual labels:  transformer
Tupe
Transformer with Untied Positional Encoding (TUPE). Code of paper "Rethinking Positional Encoding in Language Pre-training". Improve existing models like BERT.
Stars: ✭ 143 (-20.11%)
Mutual labels:  transformer
Hrnet Semantic Segmentation
The OCR approach is rephrased as Segmentation Transformer: https://arxiv.org/abs/1909.11065. This is an official implementation of semantic segmentation for HRNet. https://arxiv.org/abs/1908.07919
Stars: ✭ 2,369 (+1223.46%)
Mutual labels:  transformer
Nlp research
NLP research:基于tensorflow的nlp深度学习项目,支持文本分类/句子匹配/序列标注/文本生成 四大任务
Stars: ✭ 141 (-21.23%)
Mutual labels:  transformer
Gpt 2 Tensorflow2.0
OpenAI GPT2 pre-training and sequence prediction implementation in Tensorflow 2.0
Stars: ✭ 172 (-3.91%)
Mutual labels:  transformer
Nmtpy
nmtpy is a Python framework based on dl4mt-tutorial to experiment with Neural Machine Translation pipelines.
Stars: ✭ 127 (-29.05%)
Mutual labels:  nmt
Bigbird
Transformers for Longer Sequences
Stars: ✭ 146 (-18.44%)
Mutual labels:  transformer
Effective transformer
Running BERT without Padding
Stars: ✭ 169 (-5.59%)
Mutual labels:  transformer
Transformer Pytorch
Transformer implementation in PyTorch.
Stars: ✭ 149 (-16.76%)
Mutual labels:  transformer

License PWC

Admin

Understanding the Difficulty of Training Transformers

Guided by our analyses, we propose Adaptive Model Initialization (Admin), which successfully stabilizes previously-diverged Transformer training and achieves better performance, without introducing additional hyper-parameters. Admin is adapted for better half-precision stability and can be reparameterized into the original Transformer.

We are in an early-release beta. Expect some adventures and rough edges.

Table of Contents

Introduction

What complicates Transformer training?

In our study, we go beyond gradient vanishing and identify an amplification effect that substantially influences Transformer training. Specifically, for each layer in a multi-layer Transformer, heavy dependency on its residual branch makes training unstable, yet light dependency leads to sub-optimal performance.

Dependency and Amplification Effect

Our analysis starts from the observation that Pre-LN is more robust than Post-LN, whereas Post-LN typically leads to a better performance. As shown in Figure 1, we find these two variants have different layer dependency patterns.

With further exploration, we find that for a N-layer residual network, after updating its parameters W to W*, its outputs change is proportion to the dependency on residual branches.

Intuitively, since a larger output change indicates a more unsmooth loss surface, the large dependency complicates training. Moreover, we propose Admin (adaptive model initialization), which starts the training from the area with a smoother surface. More details can be found in our paper.

Quick Start Guide

Our implementation is based on the fairseq package (python 3.6, torch 1.5/1.6 are recommended). It can be installed by:

git clone https://github.com/LiyuanLucasLiu/Transforemr-Clinic.git
cd fairseq
pip install --editable .

The guidance for reproducing our results is available at:

Specifically, our implementation requires to first set --init-type adaptive-profiling and use one GPU for this profiling stage, then set --init-type adaptive and start training.

Citation

Please cite the following papers if you found our model useful. Thanks!

Liyuan Liu, Xiaodong Liu, Jianfeng Gao, Weizhu Chen, and Jiawei Han (2020). Understanding the Difficulty of Training Transformers. Proc. 2020 Conf. on Empirical Methods in Natural Language Processing (EMNLP'20).

@inproceedings{liu2020admin,
  title={Understanding the Difficulty of Training Transformers},
  author = {Liu, Liyuan and Liu, Xiaodong and Gao, Jianfeng and Chen, Weizhu and Han, Jiawei},
  booktitle = {Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP 2020)},
  year={2020}
}

Xiaodong Liu, Kevin Duh, Liyuan Liu, and Jianfeng Gao (2020). Very Deep Transformers for Neural Machine Translation. arXiv preprint arXiv:2008.07772 (2020).

@inproceedings{liu_deep_2020,
 author = {Liu, Xiaodong and Duh, Kevin and Liu, Liyuan and Gao, Jianfeng},
 booktitle = {arXiv:2008.07772 [cs]},
 title = {Very Deep Transformers for Neural Machine Translation},
 year = {2020}
}
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].