All Projects → nawnoes → pytorch-gpt-x

nawnoes / pytorch-gpt-x

Licence: other
Implementation of autoregressive language model using improved Transformer and DeepSpeed pipeline parallelism.

Programming Languages

python
139335 projects - #7 most used programming language
shell
77523 projects

Projects that are alternatives of or similar to pytorch-gpt-x

FasterTransformer
Transformer related optimization, including BERT, GPT
Stars: ✭ 1,571 (+7380.95%)
Mutual labels:  transformer, gpt
TabFormer
Code & Data for "Tabular Transformers for Modeling Multivariate Time Series" (ICASSP, 2021)
Stars: ✭ 209 (+895.24%)
Mutual labels:  transformer, gpt
libai
LiBai(李白): A Toolbox for Large-Scale Distributed Parallel Training
Stars: ✭ 284 (+1252.38%)
Mutual labels:  transformer, pipeline-parallelism
NLP-paper
🎨 🎨NLP 自然语言处理教程 🎨🎨 https://dataxujing.github.io/NLP-paper/
Stars: ✭ 23 (+9.52%)
Mutual labels:  transformer, gpt
wenet
Production First and Production Ready End-to-End Speech Recognition Toolkit
Stars: ✭ 2,384 (+11252.38%)
Mutual labels:  transformer
Context-Transformer
Context-Transformer: Tackling Object Confusion for Few-Shot Detection, AAAI 2020
Stars: ✭ 89 (+323.81%)
Mutual labels:  transformer
set-transformer
A neural network architecture for prediction on sets
Stars: ✭ 18 (-14.29%)
Mutual labels:  transformer
Kevinpro-NLP-demo
All NLP you Need Here. 个人实现了一些好玩的NLP demo,目前包含13个NLP应用的pytorch实现
Stars: ✭ 117 (+457.14%)
Mutual labels:  transformer
YOLOS
You Only Look at One Sequence (NeurIPS 2021)
Stars: ✭ 612 (+2814.29%)
Mutual labels:  transformer
zero
Zero -- A neural machine translation system
Stars: ✭ 121 (+476.19%)
Mutual labels:  transformer
MusicTransformer-Pytorch
MusicTransformer written for MaestroV2 using the Pytorch framework for music generation
Stars: ✭ 106 (+404.76%)
Mutual labels:  transformer
AdaSpeech
AdaSpeech: Adaptive Text to Speech for Custom Voice
Stars: ✭ 108 (+414.29%)
Mutual labels:  transformer
gpt-j-api
API for the GPT-J language model 🦜. Including a FastAPI backend and a streamlit frontend
Stars: ✭ 248 (+1080.95%)
Mutual labels:  gpt
Transformer-Transducer
PyTorch implementation of "Transformer Transducer: A Streamable Speech Recognition Model with Transformer Encoders and RNN-T Loss" (ICASSP 2020)
Stars: ✭ 61 (+190.48%)
Mutual labels:  transformer
transformer-tensorflow2.0
transformer in tensorflow 2.0
Stars: ✭ 53 (+152.38%)
Mutual labels:  transformer
paccmann proteomics
PaccMann models for protein language modeling
Stars: ✭ 28 (+33.33%)
Mutual labels:  transformer
A-Personal-Arch-Installation-Guide
A Personal Arch Installation Guide In Case of Amnesia
Stars: ✭ 58 (+176.19%)
Mutual labels:  gpt
alpine-linux-scripts
Alpine Linux Setup Scripts
Stars: ✭ 38 (+80.95%)
Mutual labels:  gpt
dodrio
Exploring attention weights in transformer-based models with linguistic knowledge.
Stars: ✭ 233 (+1009.52%)
Mutual labels:  transformer
wxml-transformer
将微信小程序的wxml代码转换成js object或html片段
Stars: ✭ 18 (-14.29%)
Mutual labels:  transformer

GPT-X

Implementation of autoregressive language model(like GPT) using improved Transformer and DeepSpeed pipeline parallelism.

Improved Transformer

Transformer used in this repository attempts to improve the transformer using the additional modules below.

Name Description Link
Rezero Rezero Is All You Need link
Explicit Sparse Transformer Concentrated Attention Through Explicit Selection link
Macaron Architecture Understanding and Improving Transformer From a Multi-Particle Dynamic System Point of View link
RealFormer Residual Attention link
ALiBi Position Embedding effective relative positional encoding

Model Description

model_name n_params n_layer d_model n_heads vocab_size max_seq_len learning_rate
GPT-X 1B 1B 20 2048 16 22000 1024 2.0 x 10^-4

DeepSpeed

DeepSpeed is a deep learning training optimization library, providing the means to train massive billion parameter models at scale.

Piepline Parallelism

You can train 1B GPT-X Model using deepspeed pipeline parallelism on 2 V100 GPU(16G).

GPU Usage

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.39       Driver Version: 418.39       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla V100-PCIE...  On   | 00000000:00:06.0 Off |                    0 |
| N/A   42C    P0    44W / 250W |  16076MiB / 16130MiB |     99%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-PCIE...  On   | 00000000:00:07.0 Off |                    0 |
| N/A   45C    P0   168W / 250W |  16060MiB / 16130MiB |     99%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0     29525      C   /home/ubuntu/anaconda3/bin/python          16065MiB |
|    1     29528      C   /home/ubuntu/anaconda3/bin/python          16049MiB |
+-----------------------------------------------------------------------------+

Pipeline Parallelism Log

[2021-12-31 12:24:20,042] [INFO] [engine.py:93:__init__] CONFIG: micro_batches=4 micro_batch_size=1
[2021-12-31 12:24:20,094] [INFO] [engine.py:151:__init__] RANK=1 STAGE=1 LAYERS=12 [11, 23) STAGE_PARAMS=548560916 (548.561M) TOTAL_PARAMS=1099214888 (1099.215M) UNIQUE_PARAMS=1099214888 (1099.215M)
[2021-12-31 12:24:20,094] [INFO] [engine.py:151:__init__] RANK=0 STAGE=0 LAYERS=11 [0, 11) STAGE_PARAMS=550653972 (550.654M) TOTAL_PARAMS=1099214888 (1099.215M) UNIQUE_PARAMS=1099214888 (1099.215M)
[2021-12-31 12:24:08,793] [INFO] [module.py:365:_partition_layers] Partitioning pipeline stages with method parameters
stage=0 layers=11
     0: Embedding
     1: ReZeroSparseTopkDecoder
     2: ReZeroSparseTopkDecoder
     3: ReZeroSparseTopkDecoder
     4: ReZeroSparseTopkDecoder
     5: ReZeroSparseTopkDecoder
     6: ReZeroSparseTopkDecoder
     7: ReZeroSparseTopkDecoder
     8: ReZeroSparseTopkDecoder
     9: ReZeroSparseTopkDecoder
    10: ReZeroSparseTopkDecoder
stage=1 layers=12
    11: ReZeroSparseTopkDecoder
    12: ReZeroSparseTopkDecoder
    13: ReZeroSparseTopkDecoder
    14: ReZeroSparseTopkDecoder
    15: ReZeroSparseTopkDecoder
    16: ReZeroSparseTopkDecoder
    17: ReZeroSparseTopkDecoder
    18: ReZeroSparseTopkDecoder
    19: ReZeroSparseTopkDecoder
    20: ReZeroSparseTopkDecoder
    21: LayerNorm
    22: Linear
  loss: cross_entropy

TODO

  • ReZero
  • RealFormer, Residual Attention
  • Macaron architectures
  • Macaron architectures - layer Scale 0.5
  • Explicit Sparse Transformer
  • torch lightning
  • Deepspeed train on single GPU
  • apply wandb
  • Deepspeed pipeline parallel trainig on 2 V100 GPU with 16GB Memory

Parameter For Few-shot

GPT-3 has a 175B parameter, and the size of the model is important for few-shot learning. In this repository, I try to pretrain language model as large as possible using 2 V100 GPUs.

GPT-3 Config

model_name n_params n_layer d_model n_heads d_head batch_size learning_rate
GPT-3 175B 175B 96 12288 96 128 3.2M 0.6 x 10^-4
GPT-3 13B 13B 40 5140 40 128 2M 1.0 x 10^-4
GPT-3 6.7B 6.7B 32 4096 32 128 2M 1.2 x 10^-4
GPT-3 2.7B 2.7B 32 2560 32 80 1M 1.6 x 10^-4
GPT-3 1.3B 1.3B 24 2048 24 128 1M 2.0 x 10^-4

Issue

  • AttributeError: module 'deepspeed' has no attribute 'zero': reinstall deepspeed

  • userwarning: cuda initialization: the nvidia driver on your system is too old: reinstall pytorch following by cuda version my solution-GPU V100, cuda 10.1

    pip install torch==1.7.1+cu101 torchvision==0.8.2+cu101 torchaudio==0.7.2 -f https://download.pytorch.org/whl/torch_stable.html
  • can't find CUDA_HOME path: reinstall cuda

References

Transformer

DeepSpeed

ReZero

Explicit Sparse Transformer

Macaron Architecrue

RealFormer Residual Attention

DeepSpeed

Pipeline Parallelism

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].