All Projects → grailbio → diviner

grailbio / diviner

Licence: Apache-2.0 license
Diviner is a serverless machine learning and hyper parameter tuning platform

Programming Languages

go
31211 projects - #10 most used programming language

Projects that are alternatives of or similar to diviner

Auptimizer
An automatic ML model optimization tool.
Stars: ✭ 166 (+773.68%)
Mutual labels:  hyperparameter-tuning
open-box
Generalized and Efficient Blackbox Optimization System.
Stars: ✭ 64 (+236.84%)
Mutual labels:  hyperparameter-tuning
map-floodwater-satellite-imagery
This repository focuses on training semantic segmentation models to predict the presence of floodwater for disaster prevention. Models were trained using SageMaker and Colab.
Stars: ✭ 21 (+10.53%)
Mutual labels:  hyperparameter-tuning
Coursera Deep Learning Specialization
Notes, programming assignments and quizzes from all courses within the Coursera Deep Learning specialization offered by deeplearning.ai: (i) Neural Networks and Deep Learning; (ii) Improving Deep Neural Networks: Hyperparameter tuning, Regularization and Optimization; (iii) Structuring Machine Learning Projects; (iv) Convolutional Neural Networks; (v) Sequence Models
Stars: ✭ 188 (+889.47%)
Mutual labels:  hyperparameter-tuning
Hypernets
A General Automated Machine Learning framework to simplify the development of End-to-end AutoML toolkits in specific domains.
Stars: ✭ 221 (+1063.16%)
Mutual labels:  hyperparameter-tuning
skrobot
skrobot is a Python module for designing, running and tracking Machine Learning experiments / tasks. It is built on top of scikit-learn framework.
Stars: ✭ 22 (+15.79%)
Mutual labels:  hyperparameter-tuning
Forecasting
Time Series Forecasting Best Practices & Examples
Stars: ✭ 2,123 (+11073.68%)
Mutual labels:  hyperparameter-tuning
irace
Iterated Racing for Automatic Algorithm Configuration
Stars: ✭ 26 (+36.84%)
Mutual labels:  hyperparameter-tuning
scikit-hyperband
A scikit-learn compatible implementation of hyperband
Stars: ✭ 68 (+257.89%)
Mutual labels:  hyperparameter-tuning
naturalselection
A general-purpose pythonic genetic algorithm.
Stars: ✭ 17 (-10.53%)
Mutual labels:  hyperparameter-tuning
Lale
Library for Semi-Automated Data Science
Stars: ✭ 198 (+942.11%)
Mutual labels:  hyperparameter-tuning
Tune Sklearn
A drop-in replacement for Scikit-Learn’s GridSearchCV / RandomizedSearchCV -- but with cutting edge hyperparameter tuning techniques.
Stars: ✭ 241 (+1168.42%)
Mutual labels:  hyperparameter-tuning
mango
Parallel Hyperparameter Tuning in Python
Stars: ✭ 241 (+1168.42%)
Mutual labels:  hyperparameter-tuning
Sentence Classification
Sentence Classifications with Neural Networks
Stars: ✭ 177 (+831.58%)
Mutual labels:  hyperparameter-tuning
differential-privacy-bayesian-optimization
This repo contains the underlying code for all the experiments from the paper: "Automatic Discovery of Privacy-Utility Pareto Fronts"
Stars: ✭ 22 (+15.79%)
Mutual labels:  hyperparameter-tuning
Rl Baselines3 Zoo
A collection of pre-trained RL agents using Stable Baselines3, training and hyperparameter optimization included.
Stars: ✭ 161 (+747.37%)
Mutual labels:  hyperparameter-tuning
mlr3tuning
Hyperparameter optimization package of the mlr3 ecosystem
Stars: ✭ 44 (+131.58%)
Mutual labels:  hyperparameter-tuning
mltb
Machine Learning Tool Box
Stars: ✭ 25 (+31.58%)
Mutual labels:  hyperparameter-tuning
maggy
Distribution transparent Machine Learning experiments on Apache Spark
Stars: ✭ 83 (+336.84%)
Mutual labels:  hyperparameter-tuning
pyAudioProcessing
Audio feature extraction and classification
Stars: ✭ 165 (+768.42%)
Mutual labels:  hyperparameter-tuning

Diviner

Diviner is a serverless machine learning and hyperparameter tuning platform. Diviner runs studies on behalf of a user; each study comprises a set of hyperparameters (e.g., learning rate, data augmentation policy, loss functions) and instructions for how to instantiate a trial based on a set of concrete hyperparameter values. Diviner then manages trial execution, book-keeping, and hyperparameter optimization based on past trials.

Diviner can be used as a scriptable tool or in Go programs through a Go package.

Studies, trials, and runs

Diviner defines a data model that is rooted in user-defined studies. A study contains all the information needed to conduct a number of trials, including a set of hyperparameters over which to conduct the study. A trial is an instantiation of a set of valid hyperparameter values. A run is a trial attempt; runs may fail or be retried.

Diviner stores studies and runs in an underlying database, keyed by study names. The database is used to construct leaderboards that show the best-performing hyperparameter combinations for a study. The database can also be used to query pending runs and detailed information about each.

The diviner tool interprets study definitions written in Starlark. The study definitions include the hyperparameters definitions and a function that determines how to conduct a run based on a set of parameter values, selected by an optimizer (called an oracle).

Example: optimizing MNIST with PyTorch

In this example, we'll run a hyperparameter search on the example PyTorch convolutional neural network on the MNIST dataset.

By default, Diviner uses the dynamoDB table called "diviner". To set this up, we have to run diviner create-table before we begin to use Diviner.

$ diviner create-table

Now we're ready to define our study and instruct Diviner how to run trials.

First, create a file called mnist.dv. We are interested in running training on GPU instances in AWS EC2, so the first thing we do is to define an ec2 system for the trainers to run:

ec2gpu = ec2system(
    "ec2gpu",
    region="us-west-2",
    ami="ami-01a4e5be5f289dd12",
    security_group="SECURITY_GROUP",
    instance_type="p3.2xlarge",
    disk_space=100,
    data_space=500,
    on_demand=True,
    flavor="ubuntu")

The AMI named here is the AWS DLAMI in us-west-2 as of this writing. The security group should give external access to port 443 (HTTPS).

We'll run our examples on p3.2xlarge instances, which are the smallest GPU instances provided by EC2.

Next, we'll define a function that is called with a set of parameter values. The function, run_mnist, returns a run configuration based on the provided parameter values. The run config defines how a trial is to be run given these parameters. In our case, we keep it very simple: we use the pytorch-provided docker image to invoke the mnist example with the provided parameter values.

def run_mnist(params):
    return run_config(
        system=ec2gpu,
        script="""
nvidia-docker run -i pytorch/pytorch bash -x <<EOF
git clone https://github.com/pytorch/examples.git
python examples/mnist/main.py \
	--batch-size %(batch_size)d \
	--lr %(learning_rate)f \
	--epochs %(epochs)d  2>&1 | \
	awk '/^Test set:/ {gsub(",", "", \$5); print "METRICS: loss="\$5} {print}'
EOF
""" % params)

The only thing of note here is that we're using awk to pull out the test losses reported by the PyTorch trainer and formatting them in the manner expected by Diviner. (Any line in the process stdout beginning with "METRICS: " and followed by a set of key=value pairs are interpeted by Diviner as metrics reported by the trial.) The run config also defines the system definition on which to run the trial.

Now that we have a system definition and a run function, we can define our study. The study declares the set of parameters and their ranges. In this case, we're using discrete parameter ranges, but users can also define continuous ranges. Starlark intrinsics available in Diviner are documented here. We define the study's objective to minimize the metric "loss" (as reported by the runner, above). The oracle, grid_search, defines how the parameter space is to be explored. Grid searching exhaustively explores the parameter space.

study(
    name="mnist",
    params={
        "epochs": discrete(1, 10, 50, 100),
        "batch_size": discrete(32, 64, 128, 256),
        "learning_rate": discrete(0.001, 0.01, 0.1, 1.0),
    },
    objective=minimize("loss"),
    oracle=grid_search,
    run=run_mnist,
)

Finally, we can now run the study. We run the study in "streaming" mode, meaning that new trials are started as soon as capacity allows. The -trials argument determines how many trials may be run in parallel. (And in our case, how many EC2 GPU instances are created at a time.)

$ diviner run -stream -trials 5 mnist.dv

While the study is running, we can query the database. For example, to see the current set of trials running:

$ diviner list -runs -state pending -s
mnist:9  6:02PM 11m59s pending running: Train Epoch: 16 [45120/60000 (75%)] Loss: 0.171844
mnist:10 6:08PM 5m29s  pending running: Train Epoch: 9 [14080/60000 (23%)]  Loss: 0.178012
mnist:11 6:09PM 5m14s  pending running:
mnist:12 6:09PM 5m14s  pending running: Train Epoch: 5 [56320/60000 (94%)] Loss: 0.380566
mnist:13 6:09PM 4m59s  pending running: Train Epoch: 7 [19200/60000 (32%)] Loss: 0.069407

(The last column(s) in this output shows the last line printed to standard output.) We can also examine the details of particular run. Runs are named by the study and an index.

$ diviner info mnist:9
run mnist:9:
    state:     pending
    created:   2019-10-25 18:02:14 -0700 PDT
    runtime:   13m29.972611766s
    restarts:  0
    replicate: 0
    values:
        batch_size:    32
        epochs:        50
        learning_rate: 0.001
    metrics:
        loss: 0.0398

Here we can see the parameter values used in this run, the latest reported metrics, and some other relevant metadata. Passing -v to diviner info gives even more detail, including all of the reported metrics and the rendered script.

$ diviner info -v mnist:9
run mnist:9:
    state:     pending
    created:   2019-10-25 18:02:14 -0700 PDT
    runtime:   13m59.973265371s
    restarts:  0
    replicate: 0
    values:
        batch_size:    32
        epochs:        50
        learning_rate: 0.001
    metrics[0]:
        loss: 0.2996
    metrics[1]:
        loss: 0.1848
    ...
    metrics[17]:
        loss: 0.0398
    script:
        nvidia-docker run -i pytorch/pytorch bash -x <<EOF
        git clone https://github.com/pytorch/examples.git
        python examples/mnist/main.py  --batch-size 32  --lr 0.001000  --epochs 50  2>&1 |  awk '/^Test set:/ {gsub(",", "", \$5); print "METRICS: loss="\$5} {print}'
        EOF

After Diviner has accumulated a number of trials, we can request the current leaderboard:

$ diviner leaderboard mnist
study    replicates loss   batch_size epochs learning_rate
mnist:21 0          0.0255 32         10     0.01
mnist:28 0          0.0265 256        50     0.01
mnist:9  0          0.027  32         50     0.001
mnist:14 0          0.0285 64         100    0.001
mnist:27 0          0.029  128        50     0.01
mnist:13 0          0.0298 32         100    0.001
...

This tells us the best hyperparameters (so far) for this MNIST classification task is batch_size=32, epochs=10, and learning_rate=0.01.

In addition to tracking studies and runs, Diviner maintains logs for each run. This can be useful when debugging issues or monitoring ongoing jobs. For example to view the logs of the best entry in the above study:

$ diviner logs mnist:21
diviner: started run (try 1) at 2019-10-25 18:41:57.84792 -0700 PDT on https://ec2-52-88-125-113.us-west-2.compute.amazonaws.com/
+ git clone https://github.com/pytorch/examples.git
Cloning into 'examples'...
+ python examples/mnist/main.py --batch-size 32 --lr 0.010000 --epochs 10
+ awk '/^Test set:/ {gsub(",", "", $5); print "METRICS: loss="$5} {print}'
9920512it [00:01, 8438927.61it/s]
32768it [00:00, 138802.83it/s]
1654784it [00:00, 2387828.32it/s]
8192it [00:00, 53052.53it/s]            Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz to ../data/MNIST/raw/train-images-idx3-ubyte.gz
Extracting ../data/MNIST/raw/train-images-idx3-ubyte.gz to ../data/MNIST/raw
Downloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz to ../data/MNIST/raw/train-labels-idx1-ubyte.gz
Extracting ../data/MNIST/raw/train-labels-idx1-ubyte.gz to ../data/MNIST/raw
Downloading http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz to ../data/MNIST/raw/t10k-images-idx3-ubyte.gz
Extracting ../data/MNIST/raw/t10k-images-idx3-ubyte.gz to ../data/MNIST/raw
Downloading http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz to ../data/MNIST/raw/t10k-labels-idx1-ubyte.gz
Extracting ../data/MNIST/raw/t10k-labels-idx1-ubyte.gz to ../data/MNIST/raw
Processing...
Done!
Train Epoch: 1 [0/60000 (0%)]	Loss: 2.301286
Train Epoch: 1 [320/60000 (1%)]	Loss: 2.246368
Train Epoch: 1 [640/60000 (1%)]	Loss: 2.143235
Train Epoch: 1 [960/60000 (2%)]	Loss: 2.020995
Train Epoch: 1 [1280/60000 (2%)]	Loss: 1.889598
Train Epoch: 1 [1600/60000 (3%)]	Loss: 1.387788
Train Epoch: 1 [1920/60000 (3%)]	Loss: 1.103807
...
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].