All Projects → graykode → horovod-ansible

graykode / horovod-ansible

Licence: other
Create Horovod cluster easily using Ansible

Programming Languages

HCL
1544 projects
c
50402 projects - #5 most used programming language
shell
77523 projects
Dockerfile
14818 projects

Projects that are alternatives of or similar to horovod-ansible

pinpoint-node-agent
Pinpoint Node.js agent
Stars: ✭ 45 (+104.55%)
Mutual labels:  distributed-training
Fengshenbang-LM
Fengshenbang-LM(封神榜大模型)是IDEA研究院认知计算与自然语言研究中心主导的大模型开源体系,成为中文AIGC和认知智能的基础设施。
Stars: ✭ 1,813 (+8140.91%)
Mutual labels:  distributed-training
mpi-parallelization
Examples for MPI Spawning and Splitting, and the differences between two implementations
Stars: ✭ 16 (-27.27%)
Mutual labels:  openmpi
Paddle
PArallel Distributed Deep LEarning: Machine Learning Framework from Industrial Practice (『飞桨』核心框架,深度学习&机器学习高性能单机、分布式训练和跨平台部署)
Stars: ✭ 17,232 (+78227.27%)
Mutual labels:  distributed-training
Pytorch Image Models
PyTorch image models, scripts, pretrained weights -- ResNet, ResNeXT, EfficientNet, EfficientNetV2, NFNet, Vision Transformer, MixNet, MobileNet-V3/V2, RegNet, DPN, CSPNet, and more
Stars: ✭ 15,232 (+69136.36%)
Mutual labels:  distributed-training
Adanet
Fast and flexible AutoML with learning guarantees.
Stars: ✭ 3,340 (+15081.82%)
Mutual labels:  distributed-training
Byteps
A high performance and generic framework for distributed DNN training
Stars: ✭ 3,028 (+13663.64%)
Mutual labels:  distributed-training
Hetu
A high-performance distributed deep learning system targeting large-scale and automated distributed training.
Stars: ✭ 78 (+254.55%)
Mutual labels:  distributed-training
HyperGBM
A full pipeline AutoML tool for tabular data
Stars: ✭ 172 (+681.82%)
Mutual labels:  distributed-training
pytorch-model-parallel
A memory balanced and communication efficient FullyConnected layer with CrossEntropyLoss model parallel implementation in PyTorch
Stars: ✭ 74 (+236.36%)
Mutual labels:  distributed-training
dynamic-training-with-apache-mxnet-on-aws
Dynamic training with Apache MXNet reduces cost and time for training deep neural networks by leveraging AWS cloud elasticity and scale. The system reduces training cost and time by dynamically updating the training cluster size during training, with minimal impact on model training accuracy.
Stars: ✭ 51 (+131.82%)
Mutual labels:  distributed-training
HandyRL
HandyRL is a handy and simple framework based on Python and PyTorch for distributed reinforcement learning that is applicable to your own environments.
Stars: ✭ 228 (+936.36%)
Mutual labels:  distributed-training
torchx
TorchX is a universal job launcher for PyTorch applications. TorchX is designed to have fast iteration time for training/research and support for E2E production ML pipelines when you're ready.
Stars: ✭ 165 (+650%)
Mutual labels:  distributed-training
DistributedDeepLearning
Tutorials on running distributed deep learning on Batch AI
Stars: ✭ 23 (+4.55%)
Mutual labels:  distributed-training
PLSC
Paddle Large Scale Classification Tools,supports ArcFace, CosFace, PartialFC, Data Parallel + Model Parallel. Model includes ResNet, ViT, DeiT, FaceViT.
Stars: ✭ 113 (+413.64%)
Mutual labels:  distributed-training
sagemaker-xgboost-container
This is the Docker container based on open source framework XGBoost (https://xgboost.readthedocs.io/en/latest/) to allow customers use their own XGBoost scripts in SageMaker.
Stars: ✭ 93 (+322.73%)
Mutual labels:  distributed-training
basecls
A codebase & model zoo for pretrained backbone based on MegEngine.
Stars: ✭ 29 (+31.82%)
Mutual labels:  distributed-training
libai
LiBai(李白): A Toolbox for Large-Scale Distributed Parallel Training
Stars: ✭ 284 (+1190.91%)
Mutual labels:  distributed-training
l2hmc-qcd
Application of the L2HMC algorithm to simulations in lattice QCD.
Stars: ✭ 33 (+50%)
Mutual labels:  horovod
TonY
TonY is a framework to natively run deep learning frameworks on Apache Hadoop.
Stars: ✭ 687 (+3022.73%)
Mutual labels:  horovod

horovod-ansible

Horovod is a distributed training framework for TensorFlow, Keras, PyTorch, and MXNet. Ansible is a radically simple IT automation system. We can easily install the horovod on all server through its automatic setup on AWS or On-premise

Before Start
  • All On-premise nodes should be ubuntu>=16.04. I assumed that all nodes were equipped with Ansible(On-premise)
  • Until now, only the examples of tensorflow and pyrotorch can used(Not MXNet, Caffe.. etc YET).
  • AWS Step : 0 - 1 -3
  • On-perm Step : 0 - 2 - 3

Usage

0. docker setting(both AWS, On-premise)

All steps will be conducted under Docker container for beginners.

$ docker run -it --name horovod-ansible graykode/horovod-ansible:0.1 /bin/bash

1. AWS

To create horovod clustering enviorment, start provisioning with Terraform code. Change some option variables.tf which you want. But you should not below ## DO NOT CHANGE BELOW.

If I created EC2 with option number_of_worker 3, Total architecture is same with below picture.

Export your own AWS Access / Secret keys
$ export AWS_ACCESS_KEY_ID=<Your Access Key in AWS>
$ export AWS_SECRET_ACCESS_KEY=<Your Access Key in Secret>

Initializing terraform and create private key to use.

$ cd terraform/ && ssh-keygen -t rsa -N "" -f horovod
$ terraform init

provisioning all resource EC2, VPC(gateway, router, subnet, etc..)

$ terraform apply

Then, you can get output :

Apply complete! Resources: 12 added, 0 changed, 0 destroyed.

Outputs:

horovod_master_public_ip = <master's public IP>
horovod_workers_public_ip = <worker0's public IP>,<worker1's public IP>

2. On-premise

As I said above, assume that all nodes are 'ansible' and network setup is finished. If you want to see install Ansible, Please read Ansible Install Guide on document.

3. Setup Horovod Configure using Ansible(both AWS, On-premise)

Install ansible and jinja2 using pip.

$ ../ansible && pip install -r requirements.txt

Set inventory.ini in Ansible Folder.

master ansible_host=<master's public IP>
worker0 ansible_host=<worker0's public IP>
worker1 ansible_host=<worker1's public IP>
....
worker[n] ansible_host=

[all]
master
worker0
worker1
...
worker[n]

[master-servers]
master

[worker-servers]
worker0
worker1
...
worker[n]

Ping Test to all nodes!

$ chmod +x ping.sh && ./ping.sh

Now ssh configure to using Open MPI, Download Open MPI and build

$ chmod +x playbook.sh && ./playbook.sh

Test all nodes of mpi that it is fine in master node.

$ chmod +x test.sh && ./test.sh

# go to master node.
ubuntu@master:~$ mpirun -np 3 -mca btl sm,self,tcp -host master,worker0,worker1 ./test
Processor name: master
master (0/3)
Processor name: worker0
slave  (1/3)
Processor name: worker1
slave  (2/3)

4. Install DeepLearning Framework which you want and Horovod(both AWS, On-premise)

I'd like you to change this part fluidly.

  • Install Tensorflow on CPU, Horovod and Run Distributed

    $ chmod +x tensorflow.sh && ./tensorflow.sh
    
    # go to master node.
    ubuntu@master:~$ horovodrun -np 3 -H master,worker0,worker1 python3 tensorflow-train.py
  • Install Pytorch on CPU, Horovod and Run Distributed

    $ chmod +x pytorch.sh && ./pytorch.sh
    
    # go to master node.
    ubuntu@master:~$ horovodrun -np 3 -H master,worker0,worker1 python3 pytorch-train.py
  • Issue Note : If you want to change framework after install horovod, you reinstall horovod with HOROVOD_WITH_* option, '*' is just framework name. please see horovod issue. But in my Ansible Script, I 'm not add it yet.

Author

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].