Alternatives and detailed information of horovod-ansible

PyTorch image models, scripts, pretrained weights -- ResNet, ResNeXT, EfficientNet, EfficientNetV2, NFNet, Vision Transformer, MixNet, MobileNet-V3/V2, RegNet, DPN, CSPNet, and more

Stars: ✭ 15,232 (+69136.36%)

Mutual labels: distributed-training

Adanet

Fast and flexible AutoML with learning guarantees.

Stars: ✭ 3,340 (+15081.82%)

Mutual labels: distributed-training

Byteps

A high performance and generic framework for distributed DNN training

Stars: ✭ 3,028 (+13663.64%)

Mutual labels: distributed-training

Hetu

A high-performance distributed deep learning system targeting large-scale and automated distributed training.

Stars: ✭ 78 (+254.55%)

Mutual labels: distributed-training

HyperGBM

A full pipeline AutoML tool for tabular data

Stars: ✭ 172 (+681.82%)

Mutual labels: distributed-training

pytorch-model-parallel

A memory balanced and communication efficient FullyConnected layer with CrossEntropyLoss model parallel implementation in PyTorch

Stars: ✭ 74 (+236.36%)

Mutual labels: distributed-training

dynamic-training-with-apache-mxnet-on-aws

Dynamic training with Apache MXNet reduces cost and time for training deep neural networks by leveraging AWS cloud elasticity and scale. The system reduces training cost and time by dynamically updating the training cluster size during training, with minimal impact on model training accuracy.

Stars: ✭ 51 (+131.82%)

Mutual labels: distributed-training

HandyRL

HandyRL is a handy and simple framework based on Python and PyTorch for distributed reinforcement learning that is applicable to your own environments.

Stars: ✭ 228 (+936.36%)

Mutual labels: distributed-training

torchx

TorchX is a universal job launcher for PyTorch applications. TorchX is designed to have fast iteration time for training/research and support for E2E production ML pipelines when you're ready.

Stars: ✭ 165 (+650%)

Mutual labels: distributed-training

DistributedDeepLearning

Tutorials on running distributed deep learning on Batch AI

Stars: ✭ 23 (+4.55%)

Mutual labels: distributed-training

PLSC

Paddle Large Scale Classification Tools，supports ArcFace, CosFace, PartialFC, Data Parallel + Model Parallel. Model includes ResNet, ViT, DeiT, FaceViT.

Stars: ✭ 113 (+413.64%)

Mutual labels: distributed-training

sagemaker-xgboost-container

This is the Docker container based on open source framework XGBoost (https://xgboost.readthedocs.io/en/latest/) to allow customers use their own XGBoost scripts in SageMaker.

Stars: ✭ 93 (+322.73%)

Mutual labels: distributed-training

basecls

A codebase & model zoo for pretrained backbone based on MegEngine.

Stars: ✭ 29 (+31.82%)

Mutual labels: distributed-training

libai

LiBai(李白): A Toolbox for Large-Scale Distributed Parallel Training

Stars: ✭ 284 (+1190.91%)

Mutual labels: distributed-training

l2hmc-qcd

Application of the L2HMC algorithm to simulations in lattice QCD.

Stars: ✭ 33 (+50%)

Mutual labels: horovod

TonY

TonY is a framework to natively run deep learning frameworks on Apache Hadoop.

Stars: ✭ 687 (+3022.73%)

Mutual labels: horovod

horovod-ansible

Horovod is a distributed training framework for TensorFlow, Keras, PyTorch, and MXNet. Ansible is a radically simple IT automation system. We can easily install the horovod on all server through its automatic setup on AWS or On-premise

Before Start

All On-premise nodes should be ubuntu>=16.04. I assumed that all nodes were equipped with Ansible(On-premise)
Until now, only the examples of tensorflow and pyrotorch can used(Not MXNet, Caffe.. etc YET).
AWS Step : 0 - 1 -3
On-perm Step : 0 - 2 - 3

Usage

0. docker setting(both AWS, On-premise)

All steps will be conducted under Docker container for beginners.

$ docker run -it --name horovod-ansible graykode/horovod-ansible:0.1 /bin/bash

1. AWS

To create horovod clustering enviorment, start provisioning with Terraform code. Change some option variables.tf which you want. But you should not below ## DO NOT CHANGE BELOW.

If I created EC2 with option number_of_worker 3, Total architecture is same with below picture.

Export your own AWS Access / Secret keys

$ export AWS_ACCESS_KEY_ID=<Your Access Key in AWS>
$ export AWS_SECRET_ACCESS_KEY=<Your Access Key in Secret>

Initializing terraform and create private key to use.

$ cd terraform/ && ssh-keygen -t rsa -N "" -f horovod
$ terraform init

provisioning all resource EC2, VPC(gateway, router, subnet, etc..)

$ terraform apply

Then, you can get output :

Apply complete! Resources: 12 added, 0 changed, 0 destroyed.

Outputs:

horovod_master_public_ip = <master's public IP>
horovod_workers_public_ip = <worker0's public IP>,<worker1's public IP>

2. On-premise

As I said above, assume that all nodes are 'ansible' and network setup is finished. If you want to see install Ansible, Please read Ansible Install Guide on document.

3. Setup Horovod Configure using Ansible(both AWS, On-premise)

Install ansible and jinja2 using pip.

$ ../ansible && pip install -r requirements.txt

Set inventory.ini in Ansible Folder.

master ansible_host=<master's public IP>
worker0 ansible_host=<worker0's public IP>
worker1 ansible_host=<worker1's public IP>
....
worker[n] ansible_host=

[all]
master
worker0
worker1
...
worker[n]

[master-servers]
master

[worker-servers]
worker0
worker1
...
worker[n]

Ping Test to all nodes!

$ chmod +x ping.sh && ./ping.sh

Now ssh configure to using Open MPI, Download Open MPI and build

$ chmod +x playbook.sh && ./playbook.sh

Test all nodes of mpi that it is fine in master node.

$ chmod +x test.sh && ./test.sh

# go to master node.
ubuntu@master:~$ mpirun -np 3 -mca btl sm,self,tcp -host master,worker0,worker1 ./test
Processor name: master
master (0/3)
Processor name: worker0
slave  (1/3)
Processor name: worker1
slave  (2/3)

4. Install DeepLearning Framework which you want and Horovod(both AWS, On-premise)

I'd like you to change this part fluidly.

Install Tensorflow on CPU, Horovod and Run Distributed

$ chmod +x tensorflow.sh && ./tensorflow.sh

# go to master node.
ubuntu@master:~$ horovodrun -np 3 -H master,worker0,worker1 python3 tensorflow-train.py

Install Pytorch on CPU, Horovod and Run Distributed

$ chmod +x pytorch.sh && ./pytorch.sh

# go to master node.
ubuntu@master:~$ horovodrun -np 3 -H master,worker0,worker1 python3 pytorch-train.py

Issue Note : If you want to change framework after install horovod, you reinstall horovod with HOROVOD_WITH_* option, '*' is just framework name. please see horovod issue. But in my Ansible Script, I 'm not add it yet.

Author

Tae Hwan Jung(Jeff Jung) @graykode
Author Email : [email protected]

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

graykode / horovod-ansible

Programming Languages

Labels

Projects that are alternatives of or similar to horovod-ansible

horovod-ansible

Before Start

Usage

0. docker setting(both AWS, On-premise)

1. AWS

2. On-premise

3. Setup Horovod Configure using Ansible(both AWS, On-premise)

4. Install DeepLearning Framework which you want and Horovod(both AWS, On-premise)

Author