All Projects → bindog → pytorch-model-parallel

bindog / pytorch-model-parallel

Licence: other
A memory balanced and communication efficient FullyConnected layer with CrossEntropyLoss model parallel implementation in PyTorch

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to pytorch-model-parallel

PLSC
Paddle Large Scale Classification Tools,supports ArcFace, CosFace, PartialFC, Data Parallel + Model Parallel. Model includes ResNet, ViT, DeiT, FaceViT.
Stars: ✭ 113 (+52.7%)
Mutual labels:  distributed-training, model-parallel
ReID-PCB RPP
Beyond Part Models: Person Retrieval with Refined Part Pooling
Stars: ✭ 70 (-5.41%)
Mutual labels:  re-id
Paddle
PArallel Distributed Deep LEarning: Machine Learning Framework from Industrial Practice (『飞桨』核心框架,深度学习&机器学习高性能单机、分布式训练和跨平台部署)
Stars: ✭ 17,232 (+23186.49%)
Mutual labels:  distributed-training
libai
LiBai(李白): A Toolbox for Large-Scale Distributed Parallel Training
Stars: ✭ 284 (+283.78%)
Mutual labels:  distributed-training
chop
Round matrix elements to lower precision in MATLAB
Stars: ✭ 21 (-71.62%)
Mutual labels:  half-precision
sagemaker-xgboost-container
This is the Docker container based on open source framework XGBoost (https://xgboost.readthedocs.io/en/latest/) to allow customers use their own XGBoost scripts in SageMaker.
Stars: ✭ 93 (+25.68%)
Mutual labels:  distributed-training
Adanet
Fast and flexible AutoML with learning guarantees.
Stars: ✭ 3,340 (+4413.51%)
Mutual labels:  distributed-training
dynamic-training-with-apache-mxnet-on-aws
Dynamic training with Apache MXNet reduces cost and time for training deep neural networks by leveraging AWS cloud elasticity and scale. The system reduces training cost and time by dynamically updating the training cluster size during training, with minimal impact on model training accuracy.
Stars: ✭ 51 (-31.08%)
Mutual labels:  distributed-training
torchx
TorchX is a universal job launcher for PyTorch applications. TorchX is designed to have fast iteration time for training/research and support for E2E production ML pipelines when you're ready.
Stars: ✭ 165 (+122.97%)
Mutual labels:  distributed-training
horovod-ansible
Create Horovod cluster easily using Ansible
Stars: ✭ 22 (-70.27%)
Mutual labels:  distributed-training
pinpoint-node-agent
Pinpoint Node.js agent
Stars: ✭ 45 (-39.19%)
Mutual labels:  distributed-training
caffe-android-opencl-fp16
Optimised Caffe with OpenCL supporting for less powerful devices such as mobile phones
Stars: ✭ 17 (-77.03%)
Mutual labels:  half-precision
cublasHgemm-P100
Code for testing the native float16 matrix multiplication performance on Tesla P100 and V100 GPU based on cublasHgemm
Stars: ✭ 35 (-52.7%)
Mutual labels:  half-precision
HiCMD
[CVPR2020] Hi-CMD: Hierarchical Cross-Modality Disentanglement for Visible-Infrared Person Re-Identification
Stars: ✭ 64 (-13.51%)
Mutual labels:  re-id
MetaBIN
[CVPR2021] Meta Batch-Instance Normalization for Generalizable Person Re-Identification
Stars: ✭ 58 (-21.62%)
Mutual labels:  re-id
Pytorch Image Models
PyTorch image models, scripts, pretrained weights -- ResNet, ResNeXT, EfficientNet, EfficientNetV2, NFNet, Vision Transformer, MixNet, MobileNet-V3/V2, RegNet, DPN, CSPNet, and more
Stars: ✭ 15,232 (+20483.78%)
Mutual labels:  distributed-training
basecls
A codebase & model zoo for pretrained backbone based on MegEngine.
Stars: ✭ 29 (-60.81%)
Mutual labels:  distributed-training
torchshard
TorchShard: Slicing a PyTorch Tensor Into Parallel Shards.
Stars: ✭ 267 (+260.81%)
Mutual labels:  model-parallel
HandyRL
HandyRL is a handy and simple framework based on Python and PyTorch for distributed reinforcement learning that is applicable to your own environments.
Stars: ✭ 228 (+208.11%)
Mutual labels:  distributed-training
DistributedDeepLearning
Tutorials on running distributed deep learning on Batch AI
Stars: ✭ 23 (-68.92%)
Mutual labels:  distributed-training

English version

显存均衡的模型并行实现(基于PyTorch、支持混合精度训练与分布式训练)

为什么要用模型并行?暴力数据并行不就好了?

在人脸和re-id领域,部分私有的数据集的label数量可达上百万/千万/亿的规模,此时fc层的参数量就足以把显卡容量撑满,导致只能使用较小的batch_size,训练速度较慢,效果不佳

fc层模型并行我会,直接这样写不就好了?

class FullyConnected(nn.Module):
    def __init__(self, in_dim, out_dim, num_gpu, model_parallel=False):
        super(FullyConnected, self).__init__()
        self.num_gpu = num_gpu
        self.model_parallel = model_parallel
        if model_parallel:
            self.fc_chunks = nn.ModuleList()
            for i in range(num_gpu):
                _class_num = out_dim // num_gpu
                if i < (out_dim % num_gpu):
                    _class_num += 1
                self.fc_chunks.append(
                    nn.Linear(in_dim, _class_num, bias=False).cuda(i)
                )
        else:
            self.classifier = nn.Linear(in_dim, out_dim, bias=False)

    def forward(self, x):
        if self.model_parallel:
            x_list = []
            for i in range(self.num_gpu):
                _x = self.fc_chunks[i](x.cuda(i))
                x_list.append(_x)
            x = torch.cat(x_list, dim=1)
            return x
        else:
            return self.classifier(x)

类似的实现在这个基于PyTorch的人脸项目也能够看到

这个方案能够部分解决这个问题,但是又会引入新的问题:显存占用不均衡。由于最后结果依然要concat回0号卡上,且loss计算依然在0号卡上,0号卡的显存占用以及计算负载显著高于其他卡。受制于此,依然无法使用较大的batch_size

这个项目解决了这些问题吗?

不仅解决了,还扩展到了更多场景下,支持人脸和re-id训练中常见的margin loss,支持混合精度训练与分布式训练。

几点小小的优势:

  • 显存与计算负载合理分担到每张卡上,能够使用非常大的batch_size,训练得更加开心
  • 只需做一些小小的修改就可以适配主流的margin loss,如ArcFaceSphereFaceCosFaceAM-softmax等等
  • 相同的setting下对训练精度无影响(有数学推导保证结果正确)
  • 在某些情况下甚至能加速训练(得益于优化后CrossEntropyLoss计算的过程中通信开销的降低)
  • 支持混合精度训练与分布式训练

我该如何使用?

首先确认下你是否有必要使用模型并行:

  • 数据集label规模是否在百万级以上?
  • 模型的最后一层是否为fc层,是否使用CrossEntropyLoss?
  • 显卡数量是否足够?(至少4~8张显卡)

如果以上答案均为肯定,那么你可以考虑使用模型并行。但是由于模型并行需要hack model和optimizer(分布式条件下更为复杂),目前需要自行移植到你的项目中。

  • 普通的及混合精度训练可参考master分支
  • 分布式训练可参考dist分支,目前仍在完善中

其他框架怎么办?

原理都是相通的,其他框架如MXNet甚至有对分布式支持更为友好的kvstore可供使用

相关博客

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].