Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

Created with love in Canada, visit hostnodejs.com today

Feel like to post an Ad? Learn Details

All Projects → petuum → Adaptdl

petuum / Adaptdl

Licence: apache-2.0

Resource-adaptive cluster scheduler for deep learning training.

Programming Languages

python

139335 projects - #7 most used programming language

Labels

deep-learning machine-learning pytorch kubernetes aws cloud distributed-systems

Projects that are alternatives of or similar to Adaptdl

Deploy Spring Boot Aws Eb

Deploying Spring Boot Apps to AWS using Elastic Beanstalk

Stars: ✭ 79 (-21%)

Mutual labels: aws, cloud

Awesome Aws

A curated list of awesome Amazon Web Services (AWS) libraries, open source repos, guides, blogs, and other resources. Featuring the Fiery Meter of AWSome.

Stars: ✭ 9,895 (+9795%)

Mutual labels: aws, cloud

Tensorflow Lambda Layer

Lets you import Tensorflow + Keras from an AWS lambda

Stars: ✭ 79 (-21%)

Mutual labels: aws, cloud

Sceptre

Build better AWS infrastructure

Stars: ✭ 1,160 (+1060%)

Mutual labels: aws, cloud

Foundatio

Pluggable foundation blocks for building distributed apps.

Stars: ✭ 1,365 (+1265%)

Mutual labels: aws, distributed-systems

Trusted Overlord

Aggregate AWS Trusted Advisor alarms, AWS Health notifications and AWS Support cases from several AWS accounts

Stars: ✭ 73 (-27%)

Mutual labels: aws, cloud

Nats Server

High-Performance server for NATS.io, the cloud and edge native messaging system.

Stars: ✭ 10,223 (+10123%)

Mutual labels: cloud, distributed-systems

Awless Templates

Repository of examples for awless templates (see https://github.com/wallix/awless)

Stars: ✭ 59 (-41%)

Mutual labels: aws, cloud

Cloudsploit

Cloud Security Posture Management (CSPM)

Stars: ✭ 1,338 (+1238%)

Mutual labels: aws, cloud

Nextflow

A DSL for data-driven computational pipelines

Stars: ✭ 1,337 (+1237%)

Mutual labels: aws, cloud

Cloud Security Audit

A command line security audit tool for Amazon Web Services

Stars: ✭ 68 (-32%)

Mutual labels: aws, cloud

Learning Cloud

List of resources - courses, sample code, articles and screencasts for learning AWS, Azure, GCP and Alibaba Cloud

Stars: ✭ 100 (+0%)

Mutual labels: aws, cloud

Terraform Modules

Reusable Terraform modules

Stars: ✭ 63 (-37%)

Mutual labels: aws, cloud

Cfn Sphere

AWS CloudFormation stack management tool

Stars: ✭ 76 (-24%)

Mutual labels: aws, cloud

Cloud Portal

Self service web portal for different Cloud platforms like Azure, AWS and VMWare vSphere.

Stars: ✭ 60 (-40%)

Mutual labels: aws, cloud

Policy sentry

IAM Least Privilege Policy Generator

Stars: ✭ 1,284 (+1184%)

Mutual labels: aws, cloud

Aws Foundations Cis Baseline

InSpec profile to validate your VPC to the standards of the CIS Amazon Web Services Foundations Benchmark v1.1.0

Stars: ✭ 57 (-43%)

Mutual labels: aws, cloud

Pulumi

Pulumi - Developer-First Infrastructure as Code. Your Cloud, Your Language, Your Way 🚀

Stars: ✭ 10,887 (+10787%)

Mutual labels: aws, cloud

Aws Workflows On Github

Workflows for automation of AWS services setup from Github CI/CD

Stars: ✭ 95 (-5%)

Mutual labels: aws, cloud

Awstaghelper

AWS bulk tagging tool

Stars: ✭ 98 (-2%)

Mutual labels: aws, cloud

View All Similar Projects ➔

.. image:: _static/img/AdaptDLHorizLogo.png :align: center

.. image:: https://img.shields.io/github/workflow/status/petuum/adaptdl/Test :target: https://github.com/petuum/adaptdl/actions?query=workflow%3ATest :alt: GitHub Workflow Status .. image:: https://codecov.io/gh/petuum/adaptdl/branch/master/graph/badge.svg :target: https://codecov.io/gh/petuum/adaptdl .. image:: https://readthedocs.org/projects/adaptdl/badge/?version=latest :target: https://adaptdl.readthedocs.io/en/latest/?badge=latest :alt: Documentation Status .. image:: https://img.shields.io/pypi/v/adaptdl :target: https://pypi.org/project/adaptdl/ :alt: PyPI

Introduction

Documentation <https://adaptdl.readthedocs.org>_ | Examples <https://github.com/petuum/adaptdl/tree/master/examples>_

.. include-start-after

AdaptDL is a resource-adaptive deep learning (DL) training and scheduling framework, and is part of the CASL open source project <https://www.casl-project.ai>_. The goal of AdaptDL is to make distributed DL easy and efficient in dynamic-resource environments such as shared clusters and the cloud.

AdaptDL consists of two components which can be used together with or separately from one another:

adaptdl-sched: A cluster scheduler on Kubernetes optimized for distributed deep learning training.
adaptdl: A library for adaptive batch sizes that can efficiently scale distributed training to many nodes.

Some core features offered by AdaptDL are:

Elastically schedule distributed DL training jobs in shared clusters.
Cost-aware resource auto-scaling in cloud computing environments (e.g. AWS).
Automatic batch size and learning rate scaling for distributed training.

AdaptDL supports PyTorch training programs. TensorFlow support coming soon!

Why AdaptDL?

Efficient Resource Management ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

The AdaptDL scheduler directly optimizes cluster-wide training performance and resource utilization, by using a genetic algorithm to periodically optimize resource allocations for all jobs. Through elastic re-scaling, co-adapting batch sizes and learning rates, and avoiding network interference, AdaptDL significantly accelerates shared-cluster training when compared with alternative schedulers. For details, please see our technical paper <https://arxiv.org/pdf/2008.12260.pdf>_.

.. image:: _static/img/scheduling-performance.png :align: center

In the cloud (e.g. AWS), AdaptDL auto-scales the size of the cluster based on how well those cluster resources are utilized. AdaptDL automatically provisions spot instances when available to reduce cost by up to 80%.

Adaptive Batch Size Scaling ^^^^^^^^^^^^^^^^^^^^^^^^^^^

Efficient distributed training requires careful selection of the batch size and learning rate, which can be tricky to find manually. AdaptDL offers automatic batch size and learning rate scaling, which enables efficient distributed training without requiring manual effort. To achieve this, AdaptDL measures the system performance and gradient noise scale <https://arxiv.org/pdf/1812.06162.pdf>_ during training, adaptively selects the most efficient batch size, and scales the learning rate using AdaScale <https://arxiv.org/pdf/2007.05105.pdf>_.

.. image:: _static/img/autobsz-performance.png :align: center

Easy-to-use Elastic API ^^^^^^^^^^^^^^^^^^^^^^^

Making training programs run elastically can be challenging and error-prone. AdaptDL offers APIs which make it easy to enable elasticity for data-parallel PyTorch programs. Simply change a few lines of code, without heavy refactoring!

BEFORE:

.. code-block:: python

torch.distributed.init_process_group("nccl") model = torch.nn.parallel.DistributedDataParallel(model) dataloader = torch.utils.data.DataLoader(dataset, batch_size=128) for epoch in range(100): ...

AFTER:

.. code-block:: python

adaptdl.torch.init_process_group("nccl") model = adaptdl.torch.AdaptiveDataParallel(model, optimizer) dataloader = adaptdl.torch.AdaptiveDataLoader(dataset, batch_size=128) for epoch in adaptdl.torch.remaining_epochs_until(100): ...

.. include-end-before

Getting Started

AdaptDL consists of a Kubernetes job scheduler and an adaptive training library. They can be used in two ways:

Scheduling multiple training jobs on a shared cluster or the cloud (Scheduler Installation <https://adaptdl.readthedocs.io/en/latest/installation/index.html>_).
Adapting the batch size and learning rate for a single training job (Standalone Training <https://adaptdl.readthedocs.io/en/latest/standalone-training.html>_).

.. image:: _static/img/Petuum.png :align: center

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Stars: ✭ 100

Visit Git Page 🔗Visit User Page 🔗Visit Issues Page (16) 🔗