All Projects → uoguelph-mlrg → Theano-MPI

uoguelph-mlrg / Theano-MPI

Licence: other
MPI Parallel framework for training deep learning models built in Theano

Programming Languages

python
139335 projects - #7 most used programming language
Cuda
1817 projects
shell
77523 projects

Projects that are alternatives of or similar to Theano-MPI

frovedis
Framework of vectorized and distributed data analytics
Stars: ✭ 59 (+7.27%)
Mutual labels:  mpi, distributed-computing
Easylambda
distributed dataflows with functional list operations for data processing with C++14
Stars: ✭ 475 (+763.64%)
Mutual labels:  mpi, distributed-computing
Mpi Operator
Kubernetes Operator for Allreduce-style Distributed Training
Stars: ✭ 190 (+245.45%)
Mutual labels:  mpi, distributed-computing
Deepjazz
Deep learning driven jazz generation using Keras & Theano!
Stars: ✭ 2,766 (+4929.09%)
Mutual labels:  theano
azurehpc
This repository provides easy automation scripts for building a HPC environment in Azure. It also includes examples to build e2e environment and run some of the key HPC benchmarks and applications.
Stars: ✭ 102 (+85.45%)
Mutual labels:  mpi
pyabc
pyABC: distributed, likelihood-free inference
Stars: ✭ 13 (-76.36%)
Mutual labels:  distributed-computing
fahclient
Dockerized Folding@home client with NVIDIA GPU support to help battle COVID-19
Stars: ✭ 38 (-30.91%)
Mutual labels:  distributed-computing
Keras Gp
Keras + Gaussian Processes: Learning scalable deep and recurrent kernels.
Stars: ✭ 218 (+296.36%)
Mutual labels:  theano
ParMmg
Distributed parallelization of 3D volume mesh adaptation
Stars: ✭ 19 (-65.45%)
Mutual labels:  mpi
jet
Jet is a simple OOP, dynamically typed, functional language that runs on the Erlang virtual machine (BEAM). Jet's syntax is Ruby-like syntax.
Stars: ✭ 22 (-60%)
Mutual labels:  distributed-computing
DistributedCompute-Series
分布式计算:分布式数据处理(流计算/批处理)、消息队列、数据仓库
Stars: ✭ 24 (-56.36%)
Mutual labels:  distributed-computing
hp2p
Heavy Peer To Peer: a MPI based benchmark for network diagnostic
Stars: ✭ 17 (-69.09%)
Mutual labels:  mpi
whitepaper
📄 The Ambients protocol white paper
Stars: ✭ 44 (-20%)
Mutual labels:  distributed-computing
SemiDenseNet
Repository containing the code of one of the networks that we employed in the iSEG Grand MICCAI Challenge 2017, infant brain segmentation.
Stars: ✭ 55 (+0%)
Mutual labels:  theano
az-hop
The Azure HPC On-Demand Platform provides an HPC Cluster Ready solution
Stars: ✭ 33 (-40%)
Mutual labels:  mpi
Rnn ctc
Recurrent Neural Network and Long Short Term Memory (LSTM) with Connectionist Temporal Classification implemented in Theano. Includes a Toy training example.
Stars: ✭ 220 (+300%)
Mutual labels:  theano
prometheus-spec
Censorship-resistant trustless protocols for smart contract, generic & high-load computing & machine learning on top of Bitcoin
Stars: ✭ 24 (-56.36%)
Mutual labels:  distributed-computing
taskinator
A simple orchestration library for running complex processes or workflows in Ruby
Stars: ✭ 25 (-54.55%)
Mutual labels:  distributed-computing
api-spec
API Specififications
Stars: ✭ 30 (-45.45%)
Mutual labels:  mpi
GenomicsDB
Highly performant data storage in C++ for importing, querying and transforming variant data with C/C++/Java/Spark bindings. Used in gatk4.
Stars: ✭ 77 (+40%)
Mutual labels:  mpi

Theano-MPI

Theano-MPI is a python framework for distributed training of deep learning models built in Theano. It implements data-parallelism in serveral ways, e.g., Bulk Synchronous Parallel, Elastic Averaging SGD and Gossip SGD. This project is an extension to theano_alexnet, aiming to scale up the training framework to more than 8 GPUs and across nodes. Please take a look at this technical report for an overview of implementation details. To cite our work, please use the following bibtex entry.

@article{ma2016theano,
  title = {Theano-MPI: a Theano-based Distributed Training Framework},
  author = {Ma, He and Mao, Fei and Taylor, Graham~W.},
  journal = {arXiv preprint arXiv:1605.08325},
  year = {2016}
}

Theano-MPI is compatible for training models built in different framework libraries, e.g., Lasagne, Keras, Blocks, as long as its model parameters can be exposed as theano shared variables. Theano-MPI also comes with a light-weight layer library for you to build customized models. See wiki for a quick guide on building customized neural networks based on them. Check out the examples of building Lasagne VGGNet, Wasserstein GAN, LS-GAN and Keras Wide-ResNet.

Dependencies

Theano-MPI depends on the following libraries and packages. We provide some guidance to the installing them in wiki.

Installation

Once all dependeices are ready, one can clone Theano-MPI and install it by the following.

 $ python setup.py install [--user]

Usage

To accelerate the training of Theano models in a distributed way, Theano-MPI tries to identify two components:

  • the iterative update function of the Theano model
  • the parameter sharing rule between instances of the Theano model

It is recommended to organize your model and data definition in the following way.

  • launch_session.py or launch_session.cfg
  • models/*.py
    • __init__.py
    • modelfile.py : defines your customized ModelClass
    • data/*.py
      • dataname.py : defines your customized DataClass

Your ModelClass in modelfile.py should at least have the following attributes and methods:

  • self.params : a list of Theano shared variables, i.e. trainable model parameters
  • self.data : an instance of your customized DataClass defined in dataname.py
  • self.compile_iter_fns : a method, your way of compiling train_iter_fn and val_iter_fn
  • self.train_iter : a method, your way of using your train_iter_fn
  • self.val_iter : a method, your way of using your val_iter_fn
  • self.adjust_hyperp : a method, your way of adjusting hyperparameters, e.g., learning rate.
  • self.cleanup : a method, necessary model and data clean-up steps.

Your DataClass in dataname.py should at least have the following attributes:

  • self.n_batch_train : an integer, the amount of training batches needed to go through in an epoch
  • self.n_batch_val : an integer, the amount of validation batches needed to go through during validation

After your model definition is complete, you can choose the desired way of sharing parameters among model instances:

  • BSP (Bulk Syncrhonous Parallel)
  • EASGD (Elastic Averaging SGD)
  • GOSGD (Gossip SGD)

Below is an example launch config file for training a customized ModelClass on two GPUs.

# launch_session.cfg
RULE=BSP
MODELFILE=models.modelfile
MODELCLASS=ModelClass
DEVICES=cuda0,cuda1

Then you can launch the training session by calling the following command:

 $ tmlauncher -cfg=launch_session.cfg

Alternatively, you can launch sessions within python as shown below:

# launch_session.py
from theanompi import BSP

rule=BSP()
# modelfile: the relative path to the model file
# modelclass: the class name of the model to be imported from that file
rule.init(devices=['cuda0', 'cuda1'] , 
          modelfile = 'models.modelfile', 
          modelclass = 'ModelClass') 
rule.wait()

More examples can be found here.

Example Performance

BSP tested on up to eight Tesla K80 GPUs

Training (+communication) time per 5120 images in seconds: [allow_gc = True, using nccl32 on copper]

Model 1GPU 2GPU 4GPU 8GPU
AlexNet-128b 20.50 10.35+0.78 5.13+0.54 2.63+0.61
GoogLeNet-32b 63.89 31.40+1.00 15.51+0.71 7.69+0.80
VGG16-16b 358.29 176.08+13.90 90.44+9.28 55.12+12.59
VGG16-32b 343.37 169.12+7.14 86.97+4.80 43.29+5.41
ResNet50-64b 163.15 80.09+0.81 40.25+0.56 20.12+0.57

More details on the benchmark can be found in this notebook.

Note

  • Test your single GPU model with theanompi/models/test_model.py before trying data-paralle rule.

  • You may want to use those helper functions in /theanompi/lib/opt.py to construct optimizers in order to avoid common pitfalls mentioned in (#22) and get better convergence.

  • Binding cores according to your NUMA topology may give better performance. Try the -bind option with the launcher (needs hwloc depedency).

  • Using the launcher script is prefered to start training. Using python to start training currently cause core binding problem especially on a NUMA system.

  • Shuffling training examples before asynchronous training makes the loss surface a lot smoother during model converging.

  • Some known bugs and possible enhancement are listed in Issues. We welcome all kinds of participation (bug reporting, discussion, pull request, etc) in improving the framework.

License

© Contributors, 2016-2017. Licensed under an ECL-2.0 license.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].