All Projects → petuum → Adaptdl

petuum / Adaptdl

Licence: apache-2.0
Resource-adaptive cluster scheduler for deep learning training.

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to Adaptdl

Deploy Spring Boot Aws Eb
Deploying Spring Boot Apps to AWS using Elastic Beanstalk
Stars: ✭ 79 (-21%)
Mutual labels:  aws, cloud
Awesome Aws
A curated list of awesome Amazon Web Services (AWS) libraries, open source repos, guides, blogs, and other resources. Featuring the Fiery Meter of AWSome.
Stars: ✭ 9,895 (+9795%)
Mutual labels:  aws, cloud
Tensorflow Lambda Layer
Lets you import Tensorflow + Keras from an AWS lambda
Stars: ✭ 79 (-21%)
Mutual labels:  aws, cloud
Sceptre
Build better AWS infrastructure
Stars: ✭ 1,160 (+1060%)
Mutual labels:  aws, cloud
Foundatio
Pluggable foundation blocks for building distributed apps.
Stars: ✭ 1,365 (+1265%)
Mutual labels:  aws, distributed-systems
Trusted Overlord
Aggregate AWS Trusted Advisor alarms, AWS Health notifications and AWS Support cases from several AWS accounts
Stars: ✭ 73 (-27%)
Mutual labels:  aws, cloud
Nats Server
High-Performance server for NATS.io, the cloud and edge native messaging system.
Stars: ✭ 10,223 (+10123%)
Mutual labels:  cloud, distributed-systems
Awless Templates
Repository of examples for awless templates (see https://github.com/wallix/awless)
Stars: ✭ 59 (-41%)
Mutual labels:  aws, cloud
Cloudsploit
Cloud Security Posture Management (CSPM)
Stars: ✭ 1,338 (+1238%)
Mutual labels:  aws, cloud
Nextflow
A DSL for data-driven computational pipelines
Stars: ✭ 1,337 (+1237%)
Mutual labels:  aws, cloud
Cloud Security Audit
A command line security audit tool for Amazon Web Services
Stars: ✭ 68 (-32%)
Mutual labels:  aws, cloud
Learning Cloud
List of resources - courses, sample code, articles and screencasts for learning AWS, Azure, GCP and Alibaba Cloud
Stars: ✭ 100 (+0%)
Mutual labels:  aws, cloud
Terraform Modules
Reusable Terraform modules
Stars: ✭ 63 (-37%)
Mutual labels:  aws, cloud
Cfn Sphere
AWS CloudFormation stack management tool
Stars: ✭ 76 (-24%)
Mutual labels:  aws, cloud
Cloud Portal
Self service web portal for different Cloud platforms like Azure, AWS and VMWare vSphere.
Stars: ✭ 60 (-40%)
Mutual labels:  aws, cloud
Policy sentry
IAM Least Privilege Policy Generator
Stars: ✭ 1,284 (+1184%)
Mutual labels:  aws, cloud
Aws Foundations Cis Baseline
InSpec profile to validate your VPC to the standards of the CIS Amazon Web Services Foundations Benchmark v1.1.0
Stars: ✭ 57 (-43%)
Mutual labels:  aws, cloud
Pulumi
Pulumi - Developer-First Infrastructure as Code. Your Cloud, Your Language, Your Way 🚀
Stars: ✭ 10,887 (+10787%)
Mutual labels:  aws, cloud
Aws Workflows On Github
Workflows for automation of AWS services setup from Github CI/CD
Stars: ✭ 95 (-5%)
Mutual labels:  aws, cloud
Awstaghelper
AWS bulk tagging tool
Stars: ✭ 98 (-2%)
Mutual labels:  aws, cloud

.. image:: _static/img/AdaptDLHorizLogo.png :align: center

.. image:: https://img.shields.io/github/workflow/status/petuum/adaptdl/Test :target: https://github.com/petuum/adaptdl/actions?query=workflow%3ATest :alt: GitHub Workflow Status .. image:: https://codecov.io/gh/petuum/adaptdl/branch/master/graph/badge.svg :target: https://codecov.io/gh/petuum/adaptdl .. image:: https://readthedocs.org/projects/adaptdl/badge/?version=latest :target: https://adaptdl.readthedocs.io/en/latest/?badge=latest :alt: Documentation Status .. image:: https://img.shields.io/pypi/v/adaptdl :target: https://pypi.org/project/adaptdl/ :alt: PyPI

Introduction

Documentation <https://adaptdl.readthedocs.org>_ | Examples <https://github.com/petuum/adaptdl/tree/master/examples>_

.. include-start-after

AdaptDL is a resource-adaptive deep learning (DL) training and scheduling framework, and is part of the CASL open source project <https://www.casl-project.ai>_. The goal of AdaptDL is to make distributed DL easy and efficient in dynamic-resource environments such as shared clusters and the cloud.

AdaptDL consists of two components which can be used together with or separately from one another:

  • adaptdl-sched: A cluster scheduler on Kubernetes optimized for distributed deep learning training.
  • adaptdl: A library for adaptive batch sizes that can efficiently scale distributed training to many nodes.

Some core features offered by AdaptDL are:

  • Elastically schedule distributed DL training jobs in shared clusters.
  • Cost-aware resource auto-scaling in cloud computing environments (e.g. AWS).
  • Automatic batch size and learning rate scaling for distributed training.

AdaptDL supports PyTorch training programs. TensorFlow support coming soon!

Why AdaptDL?

Efficient Resource Management ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

The AdaptDL scheduler directly optimizes cluster-wide training performance and resource utilization, by using a genetic algorithm to periodically optimize resource allocations for all jobs. Through elastic re-scaling, co-adapting batch sizes and learning rates, and avoiding network interference, AdaptDL significantly accelerates shared-cluster training when compared with alternative schedulers. For details, please see our technical paper <https://arxiv.org/pdf/2008.12260.pdf>_.

.. image:: _static/img/scheduling-performance.png :align: center

In the cloud (e.g. AWS), AdaptDL auto-scales the size of the cluster based on how well those cluster resources are utilized. AdaptDL automatically provisions spot instances when available to reduce cost by up to 80%.

Adaptive Batch Size Scaling ^^^^^^^^^^^^^^^^^^^^^^^^^^^

Efficient distributed training requires careful selection of the batch size and learning rate, which can be tricky to find manually. AdaptDL offers automatic batch size and learning rate scaling, which enables efficient distributed training without requiring manual effort. To achieve this, AdaptDL measures the system performance and gradient noise scale <https://arxiv.org/pdf/1812.06162.pdf>_ during training, adaptively selects the most efficient batch size, and scales the learning rate using AdaScale <https://arxiv.org/pdf/2007.05105.pdf>_.

.. image:: _static/img/autobsz-performance.png :align: center

Easy-to-use Elastic API ^^^^^^^^^^^^^^^^^^^^^^^

Making training programs run elastically can be challenging and error-prone. AdaptDL offers APIs which make it easy to enable elasticity for data-parallel PyTorch programs. Simply change a few lines of code, without heavy refactoring!

BEFORE:

.. code-block:: python

torch.distributed.init_process_group("nccl") model = torch.nn.parallel.DistributedDataParallel(model) dataloader = torch.utils.data.DataLoader(dataset, batch_size=128) for epoch in range(100): ...

AFTER:

.. code-block:: python

adaptdl.torch.init_process_group("nccl") model = adaptdl.torch.AdaptiveDataParallel(model, optimizer) dataloader = adaptdl.torch.AdaptiveDataLoader(dataset, batch_size=128) for epoch in adaptdl.torch.remaining_epochs_until(100): ...

.. include-end-before

Getting Started

AdaptDL consists of a Kubernetes job scheduler and an adaptive training library. They can be used in two ways:

  1. Scheduling multiple training jobs on a shared cluster or the cloud (Scheduler Installation <https://adaptdl.readthedocs.io/en/latest/installation/index.html>_).
  2. Adapting the batch size and learning rate for a single training job (Standalone Training <https://adaptdl.readthedocs.io/en/latest/standalone-training.html>_).

.. image:: _static/img/Petuum.png :align: center

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].