bindog / pytorch-model-parallel

Licence: other

A memory balanced and communication efficient FullyConnected layer with CrossEntropyLoss model parallel implementation in PyTorch

Programming Languages

python

139335 projects - #7 most used programming language

Projects that are alternatives of or similar to pytorch-model-parallel

PLSC

Paddle Large Scale Classification Tools，supports ArcFace, CosFace, PartialFC, Data Parallel + Model Parallel. Model includes ResNet, ViT, DeiT, FaceViT.

Stars: ✭ 113 (+52.7%)

Mutual labels: distributed-training, model-parallel

ReID-PCB RPP

Beyond Part Models: Person Retrieval with Refined Part Pooling

Stars: ✭ 70 (-5.41%)

Mutual labels: re-id

Paddle

PArallel Distributed Deep LEarning: Machine Learning Framework from Industrial Practice （『飞桨』核心框架，深度学习&机器学习高性能单机、分布式训练和跨平台部署）

Stars: ✭ 17,232 (+23186.49%)

Mutual labels: distributed-training

libai

LiBai(李白): A Toolbox for Large-Scale Distributed Parallel Training

Stars: ✭ 284 (+283.78%)

Mutual labels: distributed-training

chop

Round matrix elements to lower precision in MATLAB

Stars: ✭ 21 (-71.62%)

Mutual labels: half-precision

sagemaker-xgboost-container

This is the Docker container based on open source framework XGBoost (https://xgboost.readthedocs.io/en/latest/) to allow customers use their own XGBoost scripts in SageMaker.

Stars: ✭ 93 (+25.68%)

Mutual labels: distributed-training

Adanet

Fast and flexible AutoML with learning guarantees.

Stars: ✭ 3,340 (+4413.51%)

Mutual labels: distributed-training

dynamic-training-with-apache-mxnet-on-aws

Dynamic training with Apache MXNet reduces cost and time for training deep neural networks by leveraging AWS cloud elasticity and scale. The system reduces training cost and time by dynamically updating the training cluster size during training, with minimal impact on model training accuracy.

Stars: ✭ 51 (-31.08%)

Mutual labels: distributed-training

torchx

TorchX is a universal job launcher for PyTorch applications. TorchX is designed to have fast iteration time for training/research and support for E2E production ML pipelines when you're ready.

Stars: ✭ 165 (+122.97%)

Mutual labels: distributed-training

horovod-ansible

Create Horovod cluster easily using Ansible

Stars: ✭ 22 (-70.27%)

Mutual labels: distributed-training

pinpoint-node-agent

Pinpoint Node.js agent

Stars: ✭ 45 (-39.19%)

Mutual labels: distributed-training

caffe-android-opencl-fp16

Optimised Caffe with OpenCL supporting for less powerful devices such as mobile phones

Stars: ✭ 17 (-77.03%)

Mutual labels: half-precision

cublasHgemm-P100

Code for testing the native float16 matrix multiplication performance on Tesla P100 and V100 GPU based on cublasHgemm

Stars: ✭ 35 (-52.7%)

Mutual labels: half-precision

HiCMD

[CVPR2020] Hi-CMD: Hierarchical Cross-Modality Disentanglement for Visible-Infrared Person Re-Identification

Stars: ✭ 64 (-13.51%)

Mutual labels: re-id

MetaBIN

[CVPR2021] Meta Batch-Instance Normalization for Generalizable Person Re-Identification

Stars: ✭ 58 (-21.62%)

Mutual labels: re-id

Pytorch Image Models

PyTorch image models, scripts, pretrained weights -- ResNet, ResNeXT, EfficientNet, EfficientNetV2, NFNet, Vision Transformer, MixNet, MobileNet-V3/V2, RegNet, DPN, CSPNet, and more

Stars: ✭ 15,232 (+20483.78%)

Mutual labels: distributed-training

basecls

A codebase & model zoo for pretrained backbone based on MegEngine.

Stars: ✭ 29 (-60.81%)

Mutual labels: distributed-training

torchshard

TorchShard: Slicing a PyTorch Tensor Into Parallel Shards.

Stars: ✭ 267 (+260.81%)

Mutual labels: model-parallel

HandyRL

HandyRL is a handy and simple framework based on Python and PyTorch for distributed reinforcement learning that is applicable to your own environments.

Stars: ✭ 228 (+208.11%)

Mutual labels: distributed-training

DistributedDeepLearning

Tutorials on running distributed deep learning on Batch AI

Stars: ✭ 23 (-68.92%)

Mutual labels: distributed-training

View All Similar Projects ➔

English version

显存均衡的模型并行实现(基于PyTorch、支持混合精度训练与分布式训练)

为什么要用模型并行？暴力数据并行不就好了？

在人脸和re-id领域，部分私有的数据集的label数量可达上百万/千万/亿的规模，此时fc层的参数量就足以把显卡容量撑满，导致只能使用较小的batch_size，训练速度较慢，效果不佳

fc层模型并行我会，直接这样写不就好了？

class FullyConnected(nn.Module):
    def __init__(self, in_dim, out_dim, num_gpu, model_parallel=False):
        super(FullyConnected, self).__init__()
        self.num_gpu = num_gpu
        self.model_parallel = model_parallel
        if model_parallel:
            self.fc_chunks = nn.ModuleList()
            for i in range(num_gpu):
                _class_num = out_dim // num_gpu
                if i < (out_dim % num_gpu):
                    _class_num += 1
                self.fc_chunks.append(
                    nn.Linear(in_dim, _class_num, bias=False).cuda(i)
                )
        else:
            self.classifier = nn.Linear(in_dim, out_dim, bias=False)

    def forward(self, x):
        if self.model_parallel:
            x_list = []
            for i in range(self.num_gpu):
                _x = self.fc_chunks[i](x.cuda(i))
                x_list.append(_x)
            x = torch.cat(x_list, dim=1)
            return x
        else:
            return self.classifier(x)

类似的实现在这个基于PyTorch的人脸项目也能够看到

这个方案能够部分解决这个问题，但是又会引入新的问题：显存占用不均衡。由于最后结果依然要concat回0号卡上，且loss计算依然在0号卡上，0号卡的显存占用以及计算负载显著高于其他卡。受制于此，依然无法使用较大的batch_size

这个项目解决了这些问题吗？

不仅解决了，还扩展到了更多场景下，支持人脸和re-id训练中常见的margin loss，支持混合精度训练与分布式训练。

几点小小的优势：

显存与计算负载合理分担到每张卡上，能够使用非常大的batch_size，训练得更加开心
只需做一些小小的修改就可以适配主流的margin loss，如ArcFace、SphereFace、CosFace、AM-softmax等等
相同的setting下对训练精度无影响（有数学推导保证结果正确）
在某些情况下甚至能加速训练（得益于优化后CrossEntropyLoss计算的过程中通信开销的降低）
支持混合精度训练与分布式训练

我该如何使用？

首先确认下你是否有必要使用模型并行：

数据集label规模是否在百万级以上？
模型的最后一层是否为fc层，是否使用CrossEntropyLoss？
显卡数量是否足够？（至少4~8张显卡）

如果以上答案均为肯定，那么你可以考虑使用模型并行。但是由于模型并行需要hack model和optimizer（分布式条件下更为复杂），目前需要自行移植到你的项目中。

普通的及混合精度训练可参考master分支
分布式训练可参考dist分支，目前仍在完善中

其他框架怎么办？

原理都是相通的，其他框架如MXNet甚至有对分布式支持更为友好的kvstore可供使用

Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

bindog / pytorch-model-parallel

Programming Languages

Labels

Projects that are alternatives of or similar to pytorch-model-parallel

显存均衡的模型并行实现(基于PyTorch、支持混合精度训练与分布式训练)

为什么要用模型并行？暴力数据并行不就好了？

fc层模型并行我会，直接这样写不就好了？

这个项目解决了这些问题吗？

我该如何使用？

其他框架怎么办？

相关博客