All Projects → sql-machine-learning → Elasticdl

sql-machine-learning / Elasticdl

Licence: mit
Kubernetes-native Deep Learning Framework

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to Elasticdl

Ra
A Raft implementation for Erlang and Elixir that strives to be efficient and make it easier to use multiple Raft clusters in a single system.
Stars: ✭ 478 (-20.86%)
Mutual labels:  distributed-systems
Corfudb
A cluster consistency platform
Stars: ✭ 539 (-10.76%)
Mutual labels:  distributed-systems
Neutrino
Privacy-Preserving Bitcoin Light Client
Stars: ✭ 564 (-6.62%)
Mutual labels:  distributed-systems
Package
Metaparticle/Package: Language Fluent Containerization and Deployment in Java, .NET and Javascript (and more coming soon)
Stars: ✭ 493 (-18.38%)
Mutual labels:  distributed-systems
Awesome Distributed Systems
Awesome list of distributed systems resources
Stars: ✭ 512 (-15.23%)
Mutual labels:  distributed-systems
Copycat
A novel implementation of the Raft consensus algorithm
Stars: ✭ 551 (-8.77%)
Mutual labels:  distributed-systems
Pysyncobj
A library for replicating your python class between multiple servers, based on raft protocol
Stars: ✭ 468 (-22.52%)
Mutual labels:  distributed-systems
Pixie
Instant Kubernetes-Native Application Observability
Stars: ✭ 589 (-2.48%)
Mutual labels:  distributed-systems
Cadence
Cadence is a distributed, scalable, durable, and highly available orchestration engine to execute asynchronous long-running business logic in a scalable and resilient way.
Stars: ✭ 5,522 (+814.24%)
Mutual labels:  distributed-systems
Faang
Facebook, Amazon, Apple, Netflix and Google (FAANG) Job preparation.
Stars: ✭ 557 (-7.78%)
Mutual labels:  distributed-systems
Verdi
A framework for formally verifying distributed systems implementations in Coq
Stars: ✭ 496 (-17.88%)
Mutual labels:  distributed-systems
Iotex Core
Official implementation of IoTeX blockchain protocol in Go.
Stars: ✭ 505 (-16.39%)
Mutual labels:  distributed-systems
Pachyderm
Reproducible Data Science at Scale!
Stars: ✭ 5,305 (+778.31%)
Mutual labels:  distributed-systems
Scalecube Services
ScaleCube Services is a high throughput, low latency reactive microservices library built to scale. it features: API-Gateways, service-discovery, service-load-balancing, the architecture supports plug-and-play service communication modules and features. built to provide performance and low-latency real-time stream-processing. its open and designed to accommodate changes. (no sidecar in a form of broker or any kind)
Stars: ✭ 482 (-20.2%)
Mutual labels:  distributed-systems
Golimit
Golimit is Uber ringpop based distributed and decentralized rate limiter
Stars: ✭ 581 (-3.81%)
Mutual labels:  distributed-systems
Nsq
A realtime distributed messaging platform (forked from https://github.com/nsqio/nsq)
Stars: ✭ 476 (-21.19%)
Mutual labels:  distributed-systems
Reactivemanifesto
The Reactive Manifesto
Stars: ✭ 542 (-10.26%)
Mutual labels:  distributed-systems
Memento
Simple + Powerful interface to the Mnesia Distributed Database 💾
Stars: ✭ 597 (-1.16%)
Mutual labels:  distributed-systems
Minecase
Minecraft server based on Orleans
Stars: ✭ 581 (-3.81%)
Mutual labels:  distributed-systems
Git Bug
Distributed, offline-first bug tracker embedded in git, with bridges
Stars: ✭ 5,431 (+799.17%)
Mutual labels:  distributed-systems

ElasticDL: A Kubernetes-native Deep Learning Framework

Travis-CI Build Status Code Coverage License: MIT PyPI Status Badge

ElasticDL is a Kubernetes-native deep learning framework built on top of TensorFlow 2.0 that supports fault-tolerance and elastic scheduling.

Main Features

Elastic Scheduling and Fault-Tolerance

Through Kubernetes-native design, ElasticDL enables fault-tolerance and works with the priority-based preemption of Kubernetes to achieve elastic scheduling for deep learning tasks.

TensorFlow 2.0 Eager Execution

A distributed deep learning framework needs to know local gradients before the model update. Eager Execution allows ElasticDL to do it without hacking into the graph execution process.

Minimalism Interface

Given a model defined with Keras API, train the model distributedly with a command line.

elasticdl train \
  --image_name=elasticdl:mnist \
  --model_zoo=model_zoo \
  --model_def=mnist.mnist_functional_api.custom_model \
  --training_data=/data/mnist/train \
  --job_name=test-mnist \
  --volume="host_path=/data,mount_path=/data"

Integration with SQLFlow

ElasticDL will be integrated seamlessly with SQLFlow to connect SQL to distributed deep learning tasks with ElasticDL.

SELECT * FROM employee LABEL income INTO my_elasticdl_model

Quick Start

Please check out our step-by-step tutorial for running ElasticDL on local laptop, on-prem cluster, or on public cloud such as Google Kubernetes Engine.

Background

TensorFlow has its native distributed computing feature that is fault-recoverable. In the case that some processes fail, the distributed computing job would fail; however, we can restart the job and recover its status from the most recent checkpoint files.

ElasticDL, as an enhancement of TensorFlow's distributed training feature, supports fault-tolerance. In the case that some processes fail, the job would go on running. Therefore, ElasticDL doesn't need to save checkpoint nor recover from checkpoints.

The feature of fault-tolerance makes ElasticDL works with the priority-based preemption of Kubernetes to achieve elastic scheduling. When Kubernetes kills some processes of a job to free resource for new-coming jobs with higher priority, the current job doesn't fail but continues with less resource.

Elastic scheduling could significantly improve the overall utilization of a cluster. Suppose that a cluster has N GPUs, and a job is using one of them. Without elastic scheduling, a new job claiming N GPUs would have to wait for the first job to complete before starting. This pending time could be hours, days, or even weeks. During this very long time, the utilization of the cluster is 1/N. With elastic scheduling, the new job could start running immediately with N-1 GPUs, and Kubernetes might increase its GPU consumption by 1 after the first job completes. In this case, the overall utilization is 100%.

The feature of elastic scheduling of ElasticDL comes from its Kubernetes-native design -- it doesn't rely on Kubernetes extensions like Kubeflow to run TensorFlow programs; instead, the master process of an ElasticDL job calls Kubernetes API to start workers and parameter servers; it also watches events like process/pod killing and reacts to such events to realize fault-tolerance.

In short, ElasticDL enhances TensorFlow with fault-tolerance and elastic scheduling in the case that you have a Kubernetes cluster. We provide a tutorial showing how to set up a Kubernetes cluster on Google Cloud and run ElasticDL jobs there. We respect TensorFlow's native distributed computing feature, which doesn't require specific computing platforms like Kubernetes and allows TensorFlow running on any platform.

Development Guide

Please refer to this document for development guide.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].