All Projects → suvojit-0x55aa → mixed-precision-pytorch

suvojit-0x55aa / mixed-precision-pytorch

Licence: WTFPL license
Training with FP16 weights in PyTorch

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to mixed-precision-pytorch

verificarlo
A tool for debugging and assessing floating point precision and reproducibility.
Stars: ✭ 51 (-29.17%)
Mutual labels:  precision, mixed-precision
NeuroEvolution-Flappy-Bird
A comparison between humans, neuroevolution and multilayer perceptrons playing Flapy Bird implemented in Python
Stars: ✭ 17 (-76.39%)
Mutual labels:  artificial-neural-networks
Android-SGTextView
同时带字体描边 渐变 阴影的TextView - both have stroker, gradient and shadow TextView
Stars: ✭ 18 (-75%)
Mutual labels:  gradient
LimitlessUI
Awesome C# UI library that highly reduced limits of your application looks
Stars: ✭ 41 (-43.06%)
Mutual labels:  gradient
tua-body-scroll-lock
🔐 Body scroll locking that just works with everything
Stars: ✭ 304 (+322.22%)
Mutual labels:  overflow
KitcheNette
KitcheNette: Predicting and Recommending Food Ingredient Pairings using Siamese Neural Networks
Stars: ✭ 52 (-27.78%)
Mutual labels:  artificial-neural-networks
Sequence-to-Sequence-Learning-of-Financial-Time-Series-in-Algorithmic-Trading
My bachelor's thesis—analyzing the application of LSTM-based RNNs on financial markets. 🤓
Stars: ✭ 64 (-11.11%)
Mutual labels:  artificial-neural-networks
Gradientable
Gradiention Protocol in iOS
Stars: ✭ 26 (-63.89%)
Mutual labels:  gradient
SimpleTypes
The universal PHP library to convert any values and measures (money, weight, currency converter, length, etc.).
Stars: ✭ 56 (-22.22%)
Mutual labels:  weight
Vision2018
The GeniSys TASS Devices & Applications use Siamese Neural Networks and Triplet Loss to classify known and unknown faces.
Stars: ✭ 17 (-76.39%)
Mutual labels:  artificial-neural-networks
artificial neural networks
A collection of Methods and Models for various architectures of Artificial Neural Networks
Stars: ✭ 40 (-44.44%)
Mutual labels:  artificial-neural-networks
sweetconfirm.js
👌A useful zero-dependencies, less than 434 Bytes (gzipped), pure JavaScript & CSS solution for drop an annoying pop-ups confirming the submission of form in your web apps.
Stars: ✭ 34 (-52.78%)
Mutual labels:  gradient
dl-relu
Deep Learning using Rectified Linear Units (ReLU)
Stars: ✭ 20 (-72.22%)
Mutual labels:  artificial-neural-networks
GradientProgressView
一个简单的进度条控件
Stars: ✭ 15 (-79.17%)
Mutual labels:  gradient
siamese-BERT-fake-news-detection-LIAR
Triple Branch BERT Siamese Network for fake news classification on LIAR-PLUS dataset in PyTorch
Stars: ✭ 96 (+33.33%)
Mutual labels:  artificial-neural-networks
modelhub
A collection of deep learning models with a unified API.
Stars: ✭ 59 (-18.06%)
Mutual labels:  artificial-neural-networks
SKTextureGradient
A SpriteKit SKTexture Gradient
Stars: ✭ 27 (-62.5%)
Mutual labels:  gradient
SaferIntegers.jl
These integer types use checked arithmetic, otherwise they are as system types.
Stars: ✭ 46 (-36.11%)
Mutual labels:  overflow
autodiff
A .NET library that provides fast, accurate and automatic differentiation (computes derivative / gradient) of mathematical functions.
Stars: ✭ 69 (-4.17%)
Mutual labels:  gradient
GradientProgress
A gradient progress bar (UIProgressView).
Stars: ✭ 38 (-47.22%)
Mutual labels:  gradient

Mixed Precision Training

in PyTorch


Training in FP16 that is in half precision results in slightly faster training in nVidia cards that supports half precision ops. Also the memory requirements of the models weights are almost halved since we use 16-bit format to store the weights instead of 32-bits.

Although training in half precision has it's own caveats. The problems that is encountered in half precision training are:

  • Imprecise weight update
  • Gradients underflow
  • Reductions overflow

Below is a discussion on how to deal with these problems.

FP16 Basics

IEEE-754 floating point starndard states that given a floating point number X if,
2^E <= abs(X) < 2^(E+1) then the distance from X to the next largest representable floating point number epsilon is:

  • epsilon = 2^(E-52) [For a 64-bit float (double precision)]
  • epsilon = 2^(E-23) [For a 32-bit float (single precision)]
  • epsilon = 2^(E-10) [For a 16-bit float (half precision)]

The above equations allow us to compute the following:

  • For half precision...

    If you want an accuracy of +/-0.5 (or 2^-1), the maximum size that the number can be is 2^10. Any larger than this and the distance between floating point numbers is greater than 0.5.

    If you want an accuracy of +/-0.0005 (about 2^-11), the maximum size that the number can be is 1. Any larger than this and the distance between floating point numbers is greater than 0.0005.

  • For single precision...

    If you want an accuracy of +/-0.5 (or 2^-1), the maximum size that the number can be is 2^23. Any larger than this and the distance between floating point numbers is greater than 0.5.

    If you want an accuracy of +/-0.0005 (about 2^-11), the maximum size that the number can be is 2^13. Any larger than this and the distance between floating point numbers is greater than 0.0005.

  • For double precision...

    If you want an accuracy of +/-0.5 (or 2^-1), the maximum size that the number can be is 2^52. Any larger than this and the distance between floating point numbers is greater than 0.5.

    If you want an accuracy of +/-0.0005 (about 2^-11), the maximum size that the number can be is 2^42. Any larger than this and the distance between floating point numbers is greater than 0.0005.

Imprecise Weight Update

Thus while training our network we'll need that added precision, since our weights will go through small updates. For example 1 + 0.0001 will result in:

  • 1.0001 in FP32
  • but in FP16 it will be 1

What that means is that we risk underflow (attempting to represent numbers so small they clamp to zero) and overflow (numbers so large they become NaN, not a number). With underflow, our network never learns anything, and with overflow, it learns garbage.
To overcome this we keep a "FP32 master copy". It is a copy of out FP16 model weights in FP32. We use these master params to update our weights and then copy them back to our model. We also update the gradients in the master copy as they are calculated in the model.

Gradients Underflow

Gradients are sometime not representable in FP16. This leads to the gradient underflow problem. A way to deal with this problem is to shift the gradient bitwise, so that they are in a range representable by half-precision floats. A way to do this is to multiply the loss by a large number like 2^7 that would shift the computed gradients during loss.backward() to the FP16 representable range. Then when we copy these gradients to FP32 master copy, we scale them back down by dividing the gradients with the same scaling factor.

Reduction Overflow

Another caveat with half-precision is that while doing large reductions it may overflow. For example consider two tensor:

  • a = torch.Tensor(4094).fill_(4.0).cuda()
  • b = torch.Tensor(4095).fill_(4.0).cuda()

If we were to do a.sum() and b.sum() it would result in 16376 and 16380 respectively, as expected in single-point precision. But if we did the same ops in half point precision it would result in 16376 and 16384 respectively. To overcome this problem we do the reduction ops like BatchNorm and loss calculation in FP32.

All these problems have been kept in mind to help us successfully train with FP16 weights. Implementation of the above ideas can be found in the train.py file.

Usage Instruction

python main.py [-h] [--lr LR] [--steps STEPS] [--gpu] [--fp16] [--loss_scaling] [--model MODEL]

PyTorch (FP16) CIFAR10 Training

optional arguments:
  -h, --help            Show this help message and exit
  --lr LR               Learning Rate
  --steps STEPS, -n STEPS
                        No of Steps
  --gpu, -p             Train on GPU
  --fp16                Train with FP16 weights
  --loss_scaling, -s    Scale FP16 losses
  --model MODEL, -m MODEL
                        Name of Network

To run in FP32 mode, use:
python main.py -n 200 -p --model resnet50

To train with FP16 weights, use:
python main.py -n 200 -p --fp16 -s --model resnet50
-s flag enables loss scaling.

Results

Training on a single P100 Pascal GPU, I was able to obtain the following result, while training with ResNet50 with a batch size of 128 over 200 epochs.

FP32 Mixed Precision
Time/Epoch 1m32s 1m15s
Storage 90 MB 46 MB
Accuracy 94.50% 94.43%

Training on 4x K80 Tesla GPUs, with ResNet50 with a batch size of 512 over 200 epochs.

FP32 Mixed Precision
Time/Epoch 1m24s 1m17s
Storage 90 MB 46 MB
Accuracy 94.634% 94.922%

Training on 4x P100 Tesla GPUs, with ResNet50 with a batch size of 512 over 200 epochs.

FP32 Mixed Precision
Time/Epoch 26s224ms 23s359ms
Storage 90 MB 46 MB
Accuracy 94.51% 94.78%

Training on a single V100 Volta GPUs, with ResNet50 with a batch size of 128 over 200 epochs.

FP32 Mixed Precision
Time/Epoch 47s112ms 25s601ms
Storage 90 MB 46 MB
Accuracy 94.87% 94.65%

Training on 4x V100 Volta GPUs, with ResNet50 with a batch size of 512 over 200 epochs.

FP32 Mixed Precision
Time/Epoch 17s841ms 12s833ms
Storage 90 MB 46 MB
Accuracy 94.38% 94.60%

Speedup of the row setup with respect to the column setup is summarized in the following table.

1xP100:FP32 1xP100:FP16 4xP100:FP32 4xP100:FP16 1xV100:FP32 1xV100:FP16 4xV100:FP32
1xP100:FP16 22.67 % 0.0 % 0.0 % 0.0 % 0.0 % 0.0 % 0.0 %
4xP100:FP32 250.82 % 186.0 % 0.0 % 0.0 % 79.65 % 0.0 % 0.0 %
4xP100:FP16 293.85 % 221.08 % 12.27 % 0.0 % 101.69 % 9.6 % 0.0 %
1xV100:FP32 95.28 % 59.2 % 0.0 % 0.0 % 0.0 % 0.0 % 0.0 %
1xV100:FP16 259.36 % 192.96 % 2.43 % 0.0 % 84.02 % 0.0 % 0.0 %
4xV100:FP32 415.67 % 320.38 % 46.99 % 30.93 % 164.07 % 43.5 % 0.0 %
4xV100:FP16 616.9 % 484.43 % 104.35 % 82.02 % 267.12 % 99.49 % 39.02 %

TODO
  • Test with all nets.
  • Test models on Volta GPUs.
  • Test runtimes on multi GPU setup.

Further Explorations:

Convenience:

nVidia provides the apex library that handles all the caveats of training in mixed precision. It also provides API for multiprocess distributed training with NCCL and SyncBatchNorm which reduces stats across processes during multiprocess distributed data parallel training.


Thanks:

The project heavily borrows from @kuangliu's project pytorch-cifar. The models have been directly borrowed from the repository with minimal change, so thanks to @kuangliu for maintaining such awesome project.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].