pytorch / Elastic
Licence: bsd-3-clause
PyTorch elastic training
Stars: ✭ 623
Programming Languages
python
139335 projects - #7 most used programming language
TorchElastic
TorchElastic allows you to launch distributed PyTorch jobs in a fault-tolerant and elastic manner. For the latest documentation, please refer to our website.
Requirements
torchelastic requires
- python3 (3.8+)
- torch
- etcd
Installation
pip install torchelastic
Quickstart
Fault-tolerant on 4
nodes, 8
trainers/node, total 4 * 8 = 32
trainers.
Run the following on all nodes.
python -m torchelastic.distributed.launch
--nnodes=4
--nproc_per_node=8
--rdzv_id=JOB_ID
--rdzv_backend=etcd
--rdzv_endpoint=ETCD_HOST:ETCD_PORT
YOUR_TRAINING_SCRIPT.py (--arg1 ... train script args...)
Elastic on 1 ~ 4
nodes, 8
trainers/node, total 8 ~ 32
trainers. Job
starts as soon as 1
node is healthy, you may add up to 4
nodes.
python -m torchelastic.distributed.launch
--nnodes=1:4
--nproc_per_node=8
--rdzv_id=JOB_ID
--rdzv_backend=etcd
--rdzv_endpoint=ETCD_HOST:ETCD_PORT
YOUR_TRAINING_SCRIPT.py (--arg1 ... train script args...)
Contributing
We welcome PRs. See the CONTRIBUTING file.
License
torchelastic is BSD licensed, as found in the LICENSE file.
Note that the project description data, including the texts, logos, images, and/or trademarks,
for each open source project belongs to its rightful owner.
If you wish to add or remove any projects, please contact us at [email protected].