All Projects → kleveross → ftlib

kleveross / ftlib

Licence: Apache-2.0 License
Fault-tolerant for DL frameworks

Programming Languages

python
139335 projects - #7 most used programming language
C++
36643 projects - #6 most used programming language
Dockerfile
14818 projects
go
31211 projects - #10 most used programming language
CMake
9771 projects
shell
77523 projects

Projects that are alternatives of or similar to ftlib

ods-provisioning-app
Provisioning app, which triggers project and component provisions (including Jira / Confluence / BitBucket and OCP resource creation)
Stars: ✭ 13 (-80%)
Mutual labels:  infrastructure
infra
Mozilla Marketing Engineering and Operations Infrastructure
Stars: ✭ 58 (-10.77%)
Mutual labels:  infrastructure
infrastructure-as-code
anmolnagpal.com/infrastructure-as-code
Stars: ✭ 17 (-73.85%)
Mutual labels:  infrastructure
gitfund
[PROJECT ON HOLD] Crowdfunding platform for open source projects
Stars: ✭ 26 (-60%)
Mutual labels:  infrastructure
cim
CIM takes the pain out of Infrastructure as Code and CloudFormation
Stars: ✭ 51 (-21.54%)
Mutual labels:  infrastructure
girltalk
A tool for helping stand up headless C2 for droppables.
Stars: ✭ 16 (-75.38%)
Mutual labels:  infrastructure
gkit
A collection of basic usability component tools dedicated to providing micro-services and single services, drawing on some excellent open source project features such as kratos, go-kit, mosn, sentinel, gopkg core components. I hope you will use and mention issue and pr more often.
Stars: ✭ 159 (+144.62%)
Mutual labels:  infrastructure
provose
Provose is a new way to manage your Amazon Web Services infrastructure.
Stars: ✭ 27 (-58.46%)
Mutual labels:  infrastructure
docker-pgtap
Postgres pgTap test runner in docker
Stars: ✭ 12 (-81.54%)
Mutual labels:  infrastructure
Stack-Lifecycle-Deployment
OpenSource self-service infrastructure solution that defines and manages the complete lifecycle of resources used and provisioned into a cloud! It is a terraform UI with rest api for terraform automation
Stars: ✭ 88 (+35.38%)
Mutual labels:  infrastructure
mongocli
MongoDB Atlas CLI and MongoDB CLI enable you to manage your MongoDB in the Cloud
Stars: ✭ 116 (+78.46%)
Mutual labels:  infrastructure
terraform-aws-base-networking
Terraform module for building base networking in AWS
Stars: ✭ 15 (-76.92%)
Mutual labels:  infrastructure
kube-universe
3D Visualization of a Kubernetes Cluster
Stars: ✭ 18 (-72.31%)
Mutual labels:  infrastructure
Open-Infra-Platform
This is the official repository of the open-source Open Infra Platform software (as of April 2020).
Stars: ✭ 26 (-60%)
Mutual labels:  infrastructure
terraform-vsphere-single-vm
Deploy single vSphere VM with Terraform - template.
Stars: ✭ 21 (-67.69%)
Mutual labels:  infrastructure
Azure-70-533-Practice-Test
(RETIRED) Azure 70-533 Certification Practice Test
Stars: ✭ 32 (-50.77%)
Mutual labels:  infrastructure
POSH-HPEOneView
PowerShell language bindings library for HPE OneView.
Stars: ✭ 116 (+78.46%)
Mutual labels:  infrastructure
devops-infra-demo
Growing repository of Infrastructure as Code demos (initially created for DevOps Wall Street)
Stars: ✭ 31 (-52.31%)
Mutual labels:  infrastructure
deploykit
A toolkit for creating and managing declarative, self-healing infrastructure.
Stars: ✭ 2,246 (+3355.38%)
Mutual labels:  infrastructure
AutoSpotting
Saves up to 90% of AWS EC2 costs by automating the use of spot instances on existing AutoScaling groups. Installs in minutes using CloudFormation or Terraform. Convenient to deploy at scale using StackSets. Uses tagging to avoid launch configuration changes. Automated spot termination handling. Reliable fallback to on-demand instances.
Stars: ✭ 2,058 (+3066.15%)
Mutual labels:  infrastructure

FTLib

Build Status License

FTLib (Fault-Tolerant Library) is a framework to keep data-parallel distributed training continue regardless worker loss or join. It exposes collective communication APIs with fault-tolerance support by gluing a consensus to a communication library, both of which can be user-specific. A distributed training using FTLib is able to continue as long as at least one single worker is alive and when new workers join the training.

Status

Prototyping

Design

Develop Guide

TODO Please refer to the design docs.

See also

Getting started

Where to use FTLib

  • Less reliable infrastructure/script

Distributed training jobs running on less reliable infrastructure risks more as any worker or communication failure will leads to the termination of the entire job.

  • Dynamic workload system

A system may reduce the total workload of distributed training jobs to release resources so that resource can be squeezed out for jobs with higher priority. Without such jobs with higher-priority, the system can increase the workload to avoid resource idling.

Requirements

The requirements for using FTLib differs with choices of consensus and communication library. Please refer the requirements.txt under each consensus and communication library(Not available, still in todo list).

Usage

Please refer test for details on how to use FTLib in distributed training.

Layout

.
├── CHANGELOG.md
├── deploy
├── docs
│   ├── design
│   └── imgs
├── ftlib
│   ├── consensus
│   ├── commlib
│   ├── ftlib_status.py
│   ├── __init__.py
│   └── rank_assign_scheme.py
├── LICENSE
├── OWNERS
├── README.md
├── requirements.txt
├── ROADMAP
├── scripts
└── test

License

FTLib is Apache license. Implementations of consensus and communication library may come with different licenses.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].