All Projects → ucla-vision → Parle

ucla-vision / Parle

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to Parle

Remit
RabbitMQ-backed microservices supporting RPC, pubsub, automatic service discovery and scaling with no code changes.
Stars: ✭ 24 (-36.84%)
Mutual labels:  distributed
Tla Rust
writing correct lock-free and distributed stateful systems in Rust, assisted by TLA+
Stars: ✭ 880 (+2215.79%)
Mutual labels:  distributed
Xxl Job Dotnet
xxl-job is a lightweight distributed task scheduling framework, and this package provide a dotnet executor client for it
Stars: ✭ 31 (-18.42%)
Mutual labels:  distributed
Distributed game server
java,gameserver,distributed,vert.x,游戏服务器
Stars: ✭ 8 (-78.95%)
Mutual labels:  distributed
Disec
Distributed Image Search Engine Crawler
Stars: ✭ 11 (-71.05%)
Mutual labels:  distributed
Distributed
Stars: ✭ 913 (+2302.63%)
Mutual labels:  distributed
Scrapy Cluster
This Scrapy project uses Redis and Kafka to create a distributed on demand scraping cluster.
Stars: ✭ 921 (+2323.68%)
Mutual labels:  distributed
Knowledge Distillation Pytorch
A PyTorch implementation for exploring deep and shallow knowledge distillation (KD) experiments with flexibility
Stars: ✭ 986 (+2494.74%)
Mutual labels:  cifar10
Awesome Microservices Netcore
💎 A collection of awesome training series, articles, videos, books, courses, sample projects, and tools for Microservices in .NET Core
Stars: ✭ 865 (+2176.32%)
Mutual labels:  distributed
Je
A distributed job execution engine for the execution of batch jobs, workflows, remediations and more.
Stars: ✭ 30 (-21.05%)
Mutual labels:  distributed
Bingo
基于golang开发的高性能,高并发分布式框架。
Stars: ✭ 9 (-76.32%)
Mutual labels:  distributed
Resnet
Tensorflow ResNet implementation on cifar10
Stars: ✭ 10 (-73.68%)
Mutual labels:  cifar10
Randwire tensorflow
tensorflow implementation of Exploring Randomly Wired Neural Networks for Image Recognition
Stars: ✭ 29 (-23.68%)
Mutual labels:  cifar10
Subnode.org
SubNode: Social Media App
Stars: ✭ 25 (-34.21%)
Mutual labels:  distributed
Weidentity
基于区块链的符合W3C DID和Verifiable Credential规范的分布式身份解决方案
Stars: ✭ 972 (+2457.89%)
Mutual labels:  distributed
Theano Xnor Net
Theano implementation of XNOR-Net
Stars: ✭ 23 (-39.47%)
Mutual labels:  cifar10
Autooffload.jl
Automatic GPU, TPU, FPGA, Xeon Phi, Multithreaded, Distributed, etc. offloading for scientific machine learning (SciML) and differential equations
Stars: ✭ 21 (-44.74%)
Mutual labels:  distributed
Crypto Dht
Blockchain over DHT in GO
Stars: ✭ 38 (+0%)
Mutual labels:  distributed
Relativistic Average Gan Keras
The implementation of Relativistic average GAN with Keras
Stars: ✭ 36 (-5.26%)
Mutual labels:  cifar10
Lethean Vpn
Lethean Virtual Private Network (VPN)
Stars: ✭ 29 (-23.68%)
Mutual labels:  distributed

Parle: parallelizing stochastic gradient descent

This is the code for Parle: parallelizing stochastic gradient descent. We demonstrate an algorithm for parallel training of deep neural networks which trains multiple copies of the same network in parallel, called as "replicas", with special coupling upon their weights to obtain significantly improved generalization performance over a single network as well as 2-5x faster convergence over a data-parallel implementation of SGD for a single network.

High-performance multi-GPU version coming soon.

We have two versions, both of which are written using PyTorch:

  • A parallel version that uses MPI (mpi4py) for synchronizing weights.
  • A more efficient version that can be executed on a single computer with multiple GPUs. The synchronization of weights is done explicitly here using inter-GPU messages.

In both cases, we construct an optimizer class that initializes the requisite buffers on different GPUs and handles all the updates after each mini-batch. As an example, we have provided code for MNIST and CIFAR-10 datasets with two prototypical networks, LeNet and All-CNN, respectively. The MNIST and CIFAR-10/100 datasets will be downloaded and pre-processed (stored in the proc folder) the first time parle is run.

Instructions for running the code

The MPI version works great for small experiments and prototyping while the second version is a good alternative for larger networks, e.g., wide-residual networks used in the paper.

Parle is very insensitive to hyper-parameters. A description for some of the parameters and their intuition follows.

  • the learning rate lr is set to be the same as SGD, along with the same drop schedule. It is advisable to train with SGD for a few epochs and then use the same lr for Parle.
  • gamma controls how far successive gradient updates on each replica are allowed to go from the previous checkpoint, i.e., the last instant when weights were synchronized with the master. This is the same as the step-size in proximal point iteration.
  • rho controls how far each replica moves from the master. The weights of the master are the average of the weights of all the replicas while each replica gets pulled towards this average with a force that is proportional to rho.
  • L is the number of gradient updates performed on each replica (worker) before synchronizing the weights with the master. You can safely fix this to 25. Alternatively, you set this to L = gamma x lr which has the advantage of being slightly faster towards the end of training.
  • Proximal point iteration is insensitive to both gamma and rho and the above code uses a default decaying schedules for these, which should typically work. In particular, we set gamma = rho = 100*(1-/(2 nb)^(k/L) where nb is the number of mini-batches per epoch and k is the current iteration number. L is the number of weight updates per synchronization, as above.
  • n is the number of replicas. The code distributes these replicas on all available GPUs. For the MPI version, this is controlled by MPI.RANK. In general, larger the n, the better Parle works. Each replica can itself be data-parallel using multiple GPUs.

The number of epochs B for Parle is typically much smaller than SGD and 5-10 epochs are sufficient to train on MNIST or CIFAR-10/100.

  1. Execute python parle_mpi.py -h to get a list of all arguments and defaults. You can train LeNet on MNIST with 3 replicas using

    python parle_mpi.py -n 3
    
  2. You can train All-CNN on CIFAR-10 with 3 replicas using

    python parle_mpi.py -n 3 -m allcnn
    
  3. You can run the MPI version with 12 replicas as

    mpirun -n 12 python parle_mpi.py
    

Special cases

  1. Setting n=1, L=1, gamma=0, rho=0 makes Parle equivalent to SGD; the implementation here uses Nesterov's momentum.
  2. Setting n=1, rho=0 decouples the replicas from the master. In this case, Parle becomes equivalent to executing Entropy-SGD: biasing gradient descent into wide valleys; see the code for the latter here.
  3. Setting L=1, gamma=0 makes Parle equivalent to Elastic-SGD; the code for the latter by the original authors is here. Parle uses an annealing schedule on rho however, which makes it faster and generalize better than vanilla Elastic-SGD.
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].