FTLib
FTLib (Fault-Tolerant Library) is a framework to keep data-parallel distributed training continue regardless worker loss or join. It exposes collective communication APIs with fault-tolerance support by gluing a consensus
to a communication library
, both of which can be user-specific. A distributed training using FTLib is able to continue as long as at least one single worker is alive and when new workers join the training.
Status
Prototyping
Design
Develop Guide
TODO Please refer to the design docs.
See also
Getting started
Where to use FTLib
- Less reliable infrastructure/script
Distributed training jobs running on less reliable infrastructure risks more as any worker or communication failure will leads to the termination of the entire job.
- Dynamic workload system
A system may reduce the total workload of distributed training jobs to release resources so that resource can be squeezed out for jobs with higher priority. Without such jobs with higher-priority, the system can increase the workload to avoid resource idling.
Requirements
The requirements for using FTLib
differs with choices of consensus and communication library. Please refer the requirements.txt
under each consensus and communication library(Not available, still in todo list).
Usage
Please refer test
for details on how to use FTLib
in distributed training.
Layout
.
├── CHANGELOG.md
├── deploy
├── docs
│ ├── design
│ └── imgs
├── ftlib
│ ├── consensus
│ ├── commlib
│ ├── ftlib_status.py
│ ├── __init__.py
│ └── rank_assign_scheme.py
├── LICENSE
├── OWNERS
├── README.md
├── requirements.txt
├── ROADMAP
├── scripts
└── test
License
FTLib is Apache license. Implementations of consensus and communication library may come with different licenses.