All Projects → hwang595 → Draco

hwang595 / Draco

Licence: MIT license
DRACO: Byzantine-resilient Distributed Training via Redundant Gradients

Programming Languages

python
139335 projects - #7 most used programming language
C++
36643 projects - #6 most used programming language
shell
77523 projects

Projects that are alternatives of or similar to Draco

b-rabbit
A thread safe library that aims to provide a simple API for interfacing with RabbitMQ. Built on top of rabbitpy, the library make it very easy to use the RabbitMQ message broker with just few lines of code. It implements all messaging pattern used by message brokers
Stars: ✭ 15 (-28.57%)
Mutual labels:  distributed-systems, parallel-computing
Bigmachine
Bigmachine is a library for self-managing serverless computing in Go
Stars: ✭ 167 (+695.24%)
Mutual labels:  distributed-systems, parallel-computing
Parapet
A purely functional library to build distributed and event-driven systems
Stars: ✭ 106 (+404.76%)
Mutual labels:  distributed-systems, parallel-computing
Dkeras
Distributed Keras Engine, Make Keras faster with only one line of code.
Stars: ✭ 181 (+761.9%)
Mutual labels:  distributed-systems, parallel-computing
Dtcraft
A High-performance Cluster Computing Engine
Stars: ✭ 122 (+480.95%)
Mutual labels:  distributed-systems, parallel-computing
Awesome Parallel Computing
A curated list of awesome parallel computing resources
Stars: ✭ 212 (+909.52%)
Mutual labels:  distributed-systems, parallel-computing
ucz-dfs
A distributed file system written in Rust.
Stars: ✭ 25 (+19.05%)
Mutual labels:  distributed-systems
braid-go
简单易用的微服务框架 | Ease used microservice framework
Stars: ✭ 34 (+61.9%)
Mutual labels:  distributed-systems
aether
Distributed system emulation in Common Lisp
Stars: ✭ 19 (-9.52%)
Mutual labels:  distributed-systems
Distributed-Algorithms
利用 Go 语言实现多种分布式算法
Stars: ✭ 53 (+152.38%)
Mutual labels:  distributed-systems
outboxer
A library that implements the outboxer pattern in go
Stars: ✭ 90 (+328.57%)
Mutual labels:  distributed-systems
boxtree
Quad/octree building for FMMs in Python and OpenCL
Stars: ✭ 52 (+147.62%)
Mutual labels:  parallel-computing
logserver
web log viewer that combines logs from several sources
Stars: ✭ 20 (-4.76%)
Mutual labels:  distributed-systems
nbodykit
Analysis kit for large-scale structure datasets, the massively parallel way
Stars: ✭ 93 (+342.86%)
Mutual labels:  parallel-computing
Garfield
An offensive attack framework for Distributed Layer of Modern Applications
Stars: ✭ 74 (+252.38%)
Mutual labels:  distributed-systems
rce
Distributed, workflow-driven integration environment
Stars: ✭ 42 (+100%)
Mutual labels:  distributed-systems
scala-parallel-programming
coursera
Stars: ✭ 17 (-19.05%)
Mutual labels:  parallel-computing
ParallelKMeans.jl
Parallel & lightning fast implementation of available classic and contemporary variants of the KMeans clustering algorithm
Stars: ✭ 45 (+114.29%)
Mutual labels:  parallel-computing
moqui-hazelcast
Moqui Framework tool component for Hazelcast, used for distributed async services, entity distributed cache invalidation, web session replication, and distributed cache (javax.cache)
Stars: ✭ 12 (-42.86%)
Mutual labels:  distributed-systems
raccoon
Massively parallel FEM code for phase-field for fracture by Dolbow Lab at Duke University
Stars: ✭ 21 (+0%)
Mutual labels:  parallel-computing

DRACO: Byzantine-resilient Distributed Training via Redundant Gradients

This repository contains source code for Draco, a scalable framework for robust distributed training that uses ideas from coding theory. Please check https://arxiv.org/abs/1803.09877 for detailed information about this project.

Overview:

Draco is a scalable framework for robust distributed training that uses ideasfrom coding theory. In Draco, compute nodes evaluate redundant gradients that are then used by the parameter server (PS) to eliminate the effects of adversarial updates.

In Draco, each compute node processes rB/P gradients and sends a linear combination of those to the PS. This means that Draco incurs a computational redundancy ratio of r. Upon receiving the P gradient sums, the PS uses a “decoding” function to remove the effect of the adversarial nodes and reconstruct the original desired sum of the B gradients. With redundancy ratio r, we show that Draco can tolerate up to (r − 1)/2 adversaries, which is information theoretically tight.

Depdendencies:

Tested stable depdencises:

  • python 2.7 (Anaconda)
  • PyTorch 0.3.0 (please note that, we're moving to PyTorch 0.4.0, and 1.0.x)
  • torchvision 0.1.18
  • MPI4Py 0.3.0
  • python-blosc 1.5.0
  • hdmedians

We highly recommend installing an Anaconda environment. You will get a high-quality BLAS library (MKL) and you get a controlled compiler version regardless of your Linux distro.

We provide this script to help you with building all dependencies. To do that you can run:

bash ./tools/pre_run.sh

Cluster Setup:

For running on distributed cluster, the first thing you need do is to launch AWS EC2 instances.

Launching Instances:

This script helps you to launch EC2 instances automatically, but before running this script, you should follow the instruction to setup AWS CLI on your local machine. After that, please edit this part in ./tools/pytorch_ec2.py

cfg = Cfg({
    "name" : "PS_PYTORCH",      # Unique name for this specific configuration
    "key_name": "NameOfKeyFile",          # Necessary to ssh into created instances
    # Cluster topology
    "n_masters" : 1,                      # Should always be 1
    "n_workers" : 8,
    "num_replicas_to_aggregate" : "8", # deprecated, not necessary
    "method" : "spot",
    # Region speficiation
    "region" : "us-west-2",
    "availability_zone" : "us-west-2b",
    # Machine type - instance type configuration.
    "master_type" : "m4.2xlarge",
    "worker_type" : "m4.2xlarge",
    # please only use this AMI for pytorch
    "image_id": "ami-xxxxxxxx",            # id of AMI
    # Launch specifications
    "spot_price" : "0.15",                 # Has to be a string
    # SSH configuration
    "ssh_username" : "ubuntu",            # For sshing. E.G: ssh ssh_username@hostname
    "path_to_keyfile" : "/dir/to/NameOfKeyFile.pem",

    # NFS configuration
    # To set up these values, go to Services > ElasticFileSystem > Create new filesystem, and follow the directions.
    #"nfs_ip_address" : "172.31.3.173",         # us-west-2c
    #"nfs_ip_address" : "172.31.35.0",          # us-west-2a
    "nfs_ip_address" : "172.31.14.225",          # us-west-2b
    "nfs_mount_point" : "/home/ubuntu/shared",       # NFS base dir

For setting everything up on EC2 cluster, the easiest way is to setup one machine and create an AMI. Then use the AMI id for image_id in pytorch_ec2.py. Then, launch EC2 instances by running

python ./tools/pytorch_ec2.py launch

After all launched instances are ready (this may take a while), getting private ips of instances by

python ./tools/pytorch_ec2.py get_hosts

this will write ips into a file named hosts_address, which looks like

172.31.16.226 (${PS_IP})
172.31.27.245
172.31.29.131
172.31.18.108
172.31.18.174
172.31.17.228
172.31.16.25
172.31.30.61
172.31.29.30

After generating the hosts_address of all EC2 instances, running the following command will copy your keyfile to the parameter server (PS) instance whose address is always the first one in hosts_address. local_script.sh will also do some basic configurations e.g. clone this git repo

bash ./tool/local_script.sh ${PS_IP}

SSH related:

At this stage, you should ssh to the PS instance and all operation should happen on PS. In PS setting, PS should be able to ssh to any compute node, this part dose the job for you by running (after ssh to the PS)

bash ./tools/remote_script.sh

Prepare Datasets

We currently support MNIST and Cifar10 datasets. Download, split, and transform datasets by (and ./tools/remote_script.sh dose this for you)

bash ./src/data_prepare.sh

Job Launching

Since this project is built on MPI, tasks are required to be launched by PS (or master) instance. run_pytorch.sh wraps job-launching process up. Commonly used options (arguments) are listed as following:

Argument Comments
n Number of processes (size of cluster) e.g. if we have P compute node and 1 PS, n=P+1.
hostfile A directory to the file that contains Private IPs of every node in the cluster, we use hosts_address here as mentioned before.
lr Inital learning rate that will be use.
momentum Value of momentum that will be use.
network Types of deep neural nets, currently LeNet, ResNet-18/32/50/110/152, and VGGs are supported.
dataset Datasets use for training.
batch-size Batch size for optimization algorithms.
comm-type A fake parameter, please always set it to be Bcast.
mode Update mode used on PS, e.g. geometric median, Krum, majority vote, and etc.
approach Approach used in experiments, e.g. baseline method or Draco (repition code or cyclic code).
err-mode Mode of simulated adversaries, reverse gradient adversary and constant adversary are currently supported.
adversarial Magnitude of adversaries.
worker-fail Number of adversarial nodes simulated in the cluster.
group-size Used for repitition code in specific, for group size of workers.
max-steps The maximum number of iterations to train.
epochs The maximal number of epochs to train (somehow redundant).
eval-freq Frequency of iterations to evaluation the model.
enable-gpu Training on CPU/GPU, if CPU please leave this argument empty.
train-dir Directory to save model checkpoints for evaluation.

Model Evaluation

Distributed evaluator will fetch model checkpoints from the shared directory and evaluate model on validation set. To evaluate model, you can run

bash ./src/evaluate_pytorch.sh

with specified arguments.

Evaluation arguments are listed as following:

Argument Comments
eval-batch-size Batch size (on validation set) used during model evaluation.
eval-freq Frequency of iterations to evaluation the model, should be set to the same value as run_pytorch.sh.
network Types of deep neural nets, should be set to the same value as run_pytorch.sh.
dataset Datasets use for training, should be set to the same value as run_pytorch.sh.
model-dir Directory to save model checkpoints for evaluation, should be set to the same value as run_pytorch.sh.

Future Work

Those are potential directions we are actively working on, stay tuned!

  • Reduce the computational cost of Draco by only approximately recovering the desired gradient summation.
  • Explore other coding methods that achieve the same redundancy and computation lower bounds.
  • Move Draco to state-of-the-art PS (or distributed) frameworks e.g. Ray or TensorFlow.

Citation

@inproceedings{Draco,
  author = {Lingjiao Chen and Hongyi Wang and Zachary Charles and Dimitris Papailiopoulos},
  title = {DRACO: Byzantine-resilient Distributed Training via Redundant Gradients},
  booktitle = {Proceedings of the 35th International Conference on Machine Learning, {ICML} 2018},
  year = {2018},
  month = jul,
  url = {https://arxiv.org/abs/1803.09877},
}
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].