Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

Created with love in Canada, visit hostnodejs.com today

Feel like to post an Ad? Learn Details

guolinke / Tupe

Licence: mit

Transformer with Untied Positional Encoding (TUPE). Code of paper "Rethinking Positional Encoding in Language Pre-training". Improve existing models like BERT.

Programming Languages

python

139335 projects - #7 most used programming language

Labels

transformer language-model

Projects that are alternatives of or similar to Tupe

FNet-pytorch

Unofficial implementation of Google's FNet: Mixing Tokens with Fourier Transforms

Stars: ✭ 204 (+42.66%)

Mutual labels: transformer, language-model

Nlp Paper

NLP Paper

Stars: ✭ 484 (+238.46%)

Mutual labels: language-model, transformer

MinTL

MinTL: Minimalist Transfer Learning for Task-Oriented Dialogue Systems

Stars: ✭ 61 (-57.34%)

Mutual labels: transformer, language-model

Gpt Scrolls

A collaborative collection of open-source safe GPT-3 prompts that work well

Stars: ✭ 195 (+36.36%)

Mutual labels: language-model, transformer

Gpt2

PyTorch Implementation of OpenAI GPT-2

Stars: ✭ 64 (-55.24%)

Mutual labels: language-model, transformer

Highway-Transformer

[ACL‘20] Highway Transformer: A Gated Transformer.

Stars: ✭ 26 (-81.82%)

Mutual labels: transformer, language-model

Bert Pytorch

Google AI 2018 BERT pytorch implementation

Stars: ✭ 4,642 (+3146.15%)

Mutual labels: language-model, transformer

Relational Rnn Pytorch

An implementation of DeepMind's Relational Recurrent Neural Networks in PyTorch.

Stars: ✭ 236 (+65.03%)

Mutual labels: language-model, transformer

Vietnamese Electra

Electra pre-trained model using Vietnamese corpus

Stars: ✭ 55 (-61.54%)

Mutual labels: language-model, transformer

Gpt2 French

GPT-2 French demo | Démo française de GPT-2

Stars: ✭ 47 (-67.13%)

Mutual labels: language-model, transformer

Neural sp

End-to-end ASR/LM implementation with PyTorch

Stars: ✭ 408 (+185.31%)

Mutual labels: language-model, transformer

Pytorch Openai Transformer Lm

🐥A PyTorch implementation of OpenAI's finetuned transformer language model with a script to import the weights pre-trained by OpenAI

Stars: ✭ 1,268 (+786.71%)

Mutual labels: language-model, transformer

Awesome Bert Nlp

A curated list of NLP resources focused on BERT, attention mechanism, Transformer networks, and transfer learning.

Stars: ✭ 567 (+296.5%)

Mutual labels: language-model, transformer

Indonesian Language Models

Indonesian Language Models and its Usage

Stars: ✭ 64 (-55.24%)

Mutual labels: language-model, transformer

Transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.

Stars: ✭ 55,742 (+38880.42%)

Mutual labels: language-model, transformer

Bertqa Attention On Steroids

BertQA - Attention on Steroids

Stars: ✭ 112 (-21.68%)

Mutual labels: transformer

Transformer In Generating Dialogue

An Implementation of 'Attention is all you need' with Chinese Corpus

Stars: ✭ 121 (-15.38%)

Mutual labels: transformer

Overlappredator

[CVPR 2021, Oral] PREDATOR: Registration of 3D Point Clouds with Low Overlap.

Stars: ✭ 106 (-25.87%)

Mutual labels: transformer

Getlang

Natural language detection package in pure Go

Stars: ✭ 110 (-23.08%)

Mutual labels: language-model

Electra

中文预训练 ELECTRA 模型: 基于对抗学习 pretrain Chinese Model

Stars: ✭ 132 (-7.69%)

Mutual labels: language-model

View All Similar Projects ➔

TUPE (Transformer with Untied Positional Encoding)

Implementation for the paper Rethinking Positional Encoding in Language Pre-training.

Brief Introduction

This repo is to demonstrate TUPE (Transformer with Untied Positional Encoding). The algorithm details could be found in our paper. TUPE can outperform other baselines on GLUE benchmark by a large margin. In particular, it can achieve a higher score than baselines while only using 30% pre-training computational costs.

Due to limited computational resources, we use the most widely-used pre-training model, BERT-Base, for verification. However, please note that our method could be used for larger (and better) Transformer-based models, like RoBERTa, ELECTRA and UniLM, and further improve them. Besides, since the modification is simple and easy, you can easily apply TUPE in your models.

Our implementation is based on fairseq, with several changes:

update fairseq/modules/transformer_sentence_encoder.py and fairseq/modules/multihead_attention.py for untied positional encoding.
some other minor changes to support max-epoch with warmup-ratio in finetune, instead of setting different total-num-update and warmup-updates for different tasks.

Requirements and Installation

More details see fairseq. Briefly,

PyTorch
Python version >= 3.5
NVIDIA's apex library with the --cuda_ext installation option, for mixed precision training
You may need NCCL for multi-node distributed training

Installing from source

To install TUPE from source and develop locally:

git clone https://github.com/guolinke/TUPE
cd TUPE
pip install --editable .

Getting Started

Data Pre-Processing

The pre-processing relies on mosesdecoder, you can run the following script to pull it.

cd TUPE
git submodule update --init

Pretraining Data

Refer to the steps in preprocess/pretrain/process.sh.

Downstream Data

Refer to the steps in preprocess/glue/process.sh.

Pre-Training

DATA_DIR=./path_to_your_data/
SAVE_DIR=./your_own_save_path/
TOTAL_UPDATES=1000000
WARMUP_UPDATES=10000
PEAK_LR=0.0001
MAX_POSITIONS=512
MAX_SENTENCES=16
UPDATE_FREQ=1
SEED=your_seed
python train.py $DATA_DIR --fp16 --num-workers 16 --ddp-backend=c10d \
    --task masked_lm --criterion masked_lm --arch bert_base \
    --sample-break-mode complete --tokens-per-sample $MAX_POSITIONS \
    --optimizer adam --adam-betas '(0.9, 0.999)' --adam-eps 1e-6 --clip-norm 1.0 \
    --lr-scheduler polynomial_decay --lr $PEAK_LR --warmup-updates $WARMUP_UPDATES --total-num-update $TOTAL_UPDATES \
    --dropout 0.1 --attention-dropout 0.1 --weight-decay 0.01 \
    --max-sentences $MAX_SENTENCES --update-freq $UPDATE_FREQ --seed $SEED \
    --mask-prob 0.15 \
    --embedding-normalize \
    --max-update $TOTAL_UPDATES --log-format simple --log-interval 100 \
    --keep-updates-list 100000 300000 600000 1000000 \
    --save-interval-updates 25000 --keep-interval-updates 3 --no-epoch-checkpoints --skip-invalid-size-inputs-valid-test \
    --save-dir $SAVE_DIR --rel-pos

The above setting is for 16 V100 GPUs, and the batch size is 256 (n_gpu * MAX_SENTENCES * UPDATE_FREQ). You may need to change MAX_SENTENCES or UPDATE_FREQ according to your environment. To disable relative position, you can remove --rel-pos .

Fine-Tuning

DATA_DIR=./path_to_your_downstream_data
SAVE_DIR=./path_to_your_save_dir
BERT_MODEL_PATH=./path_to_your_checkpoint
BATCH_SIZE=32
N_EPOCH=10     # 5 for MNLI, QNLI, QQP
SEED=your_seed
WARMUP_RATIO=0.06
N_CLASSES=2     # 3 for MNLI, 1 for STS-B
LR=0.00005     # search from 2e-5, 3e-5, 4e-5, 5e-5
METRIC=accuracy     # mcc for CoLA, pearson for STS-B

python train.py $DATA_DIR --fp16 --fp16-init-scale 4 --threshold-loss-scale 1 --fp16-scale-window 128 \
    --restore-file $BERT_MODEL_PATH \
    --max-positions 512 \
    --max-sentences $BATCH_SIZE \
    --max-tokens 4400 \
    --task sentence_prediction \
    --reset-optimizer --reset-dataloader --reset-meters \
    --required-batch-size-multiple 1 \
    --init-token 0 --separator-token 2 \
    --arch bert_base \
    --criterion sentence_prediction \
    --num-classes $N_CLASSES \
    --dropout 0.1 --attention-dropout 0.1 \
    --weight-decay 0.01 --optimizer adam --adam-betas '(0.9, 0.999)' --adam-eps 1e-06 \
    --clip-norm 1.0 --validate-interval-updates 2 \
    --lr-scheduler polynomial_decay --lr $LR --warmup-ratio $WARMUP_RATIO \
    --max-epoch $N_EPOCH --seed $SEED --save-dir $SAVE_DIR --no-progress-bar --log-interval 100 --no-epoch-checkpoints --no-last-checkpoints --no-best-checkpoints \
    --find-unused-parameters --skip-invalid-size-inputs-valid-test --truncate-sequence --embedding-normalize \
    --tensorboard-logdir . \
    --best-checkpoint-metric $METRIC --maximize-best-checkpoint-metric --rel-pos

To speed up finetune, we set N_EPOCH=5 for MNLI, QNLI and QQP, and N_EPOCH=10 for others. For MNLI, N_CLASSES=3 and an additional setting --valid-subset valid,valid1 is used for evaluating MNLI-m/-mm together. STS-B is a regression task, so we set N_CLASSES=1, with additional settings --regression-target and METRIC=pearson. For CoLA, we set METRIC=mcc.

LR is searched from {2e-5, 3e-5, 4e-5, 5e-5}, each LR will be run by 5 different seeds, and we use the median of them as the result of that LR. The result of the best LR will be used.

NOTE: If your pretraining model used --rel-pos, you should set --rel-pos in the finetune, otherwise you should remove it.

We also release the checkpoint of TUPE-R (with --rel-pos), for reproducibility.

Reference

You can cite our paper by

@inproceedings{
ke2021rethinking,
title={Rethinking Positional Encoding in Language Pre-training},
author={Guolin Ke and Di He and Tie-Yan Liu},
booktitle={International Conference on Learning Representations},
year={2021},
url={https://openreview.net/forum?id=09-528y2Fgf}
}

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Stars: ✭ 143

Visit Git Page 🔗Visit User Page 🔗Visit Issues Page (4) 🔗