What happens when ML pipeline meets Hydra?
Hydra is a handy and powerful tool that can dramatically reduce our boilerplate codes and combine dynamically various configurations. I started out with the idea that Hydra
could be used in ML Pipeline as well, and this is a Python app in the form of a template that I quickly implemented. Feedback is always welcome.
Assumption
Our ML pipeline consists of the following three steps. I think this is the minimum steps for the ML pipeline, and you can add other steps as you need.
preprocessing
: prepare datamodeling
: train, validate modeldeployment
: deploy model to serving cluster
Command Architecture
This app has a two-level command architecture. (c) The command line arguments for executing each command are as follows:
βββ preprocessing
β βββ foo -> c=preprocessing c/preprocessing_sub=foo
β βββ bar -> c=preprocessing c/preprocessing_sub=bar
βββ modeling
β βββ foo -> c=modeling c/modeling_sub=foo
β βββ bar -> c=modeling c/modeling_sub=bar
βββ deployment
β βββ foo -> c=deployment c/deployment_sub=foo
β βββ bar -> c=deployment c/deployment_sub=foo
βββ help -> c=help
Prepared Configuration
Here are the configurations prepared for this app. (preprocessing, modeling, deployment) The command line arguments for using each configuration are as follows:
βββ preprocessing
β βββ dataset
β β βββ dataset_1.yaml -> preprocessing/dataset=dataset_1
β β βββ dataset_2.yaml -> preprocessing/dataset=dataset_2
β βββ param
β βββ param_1.yaml -> preprocessing/param=param_1
β βββ param_2.yaml -> preprocessing/param=param_2
βββ modeling
β βββ model
β β βββ model_1.yaml -> modeling/model=model_1
β β βββ model_1.yaml -> modeling/model=model_2
β βββ param
β βββ param_1.yaml -> modeling/param=param_1
β βββ param_2.yaml -> modeling/param=param_2
βββ deployment
βββ cluster
βββ cluster_1.yaml -> deployment/cluster=cluster_1
βββ cluster_1.yaml -> deployment/cluster=cluster_2
How to Install
# Create a new Anaconda environment if needed.
$ conda create --name when_ml_pipeline_meets_hydra python=3.6 -y
$ conda activate when_ml_pipeline_meets_hydra
# Clone this repo.
$ git clone https://github.com/withsmilo/When-ML-pipeline-meets-Hydra.git
$ cd When-ML-pipeline-meets-Hydra
# Install this app.
$ python setup.py develop
$ when_ml_pipeline_meets_hydra --help
ML Pipeline Test
1. First taste
I will construct a new ML pipeline dynamically using all *_1.yaml
configurations and executing the same foo
subcommand per each step. The command you need is simple and structured.
$ when_ml_pipeline_meets_hydra \
preprocessing/dataset=dataset_1 \
preprocessing/param=param_1 \
modeling/model=model_1 \
modeling/param=param_1 \
deployment/cluster=cluster_1 \
c/preprocessing_sub=foo \
c/modeling_sub=foo \
c/deployment_sub=foo \
c=preprocessing,modeling,deployment \
--multirun
[2019-10-13 22:12:22,032] - Launching 3 jobs locally
[2019-10-13 22:12:22,032] - Sweep output dir : .multirun/2019-10-13
[2019-10-13 22:12:22,032] - #0 : preprocessing/dataset=dataset_1 preprocessing/param=param_1 modeling/model=model_1 modeling/param=param_1 deployment/cluster=cluster_1 c/preprocessing_sub=foo c/modeling_sub=foo c/deployment_sub=foo c=preprocessing
========== Run preprocessing's 'foo' subcommand ==========
dataset:
name: dataset_1
path: /path/of/dataset/1
p_param:
key_1_1: value_1_1
key_1_2: value_1_2
name: param_1
output_path: /path/of/output/path/1
Do something here!
[2019-10-13 22:12:22,175] - #1 : preprocessing/dataset=dataset_1 preprocessing/param=param_1 modeling/model=model_1 modeling/param=param_1 deployment/cluster=cluster_1 c/preprocessing_sub=foo c/modeling_sub=foo c/deployment_sub=foo c=modeling
========== Run modeling's 'foo' subcommand ==========
model:
input_path: /path/of/input/path/1
name: model_1
output_path: /path/of/output/path/1
m_param:
hyperparam_key_1_1: hyperparam_value_1_1
hyperparam_key_1_2: hyperparam_value_1_2
name: param_1
Do something here!
[2019-10-13 22:12:22,314] - #2 : preprocessing/dataset=dataset_1 preprocessing/param=param_1 modeling/model=model_1 modeling/param=param_1 deployment/cluster=cluster_1 c/preprocessing_sub=foo c/modeling_sub=foo c/deployment_sub=foo c=deployment
========== Run deployment's 'foo' subcommand ==========
cluster:
id: user_1
name: cluster_1
pw: pw_1
url: https://cluster/1/url
Do something here!
2. Change hyperparameters and serving cluster for your model.
After then, if you'd like to deploy a model that has only changed the hyperparameter settings to another serving cluster, you can simply change modeling/param
to param_2.yaml
and deployment/cluster
to cluster_2.yaml
before executing your command. That's it!
$ when_ml_pipeline_meets_hydra \
preprocessing/dataset=dataset_1 \
preprocessing/param=param_1 \
modeling/model=model_1 \
modeling/param=param_2 \
deployment/cluster=cluster_2 \
c/preprocessing_sub=foo \
c/modeling_sub=foo \
c/deployment_sub=foo \
c=preprocessing,modeling,deployment \
--multirun
[2019-10-13 22:13:13,898] - Launching 3 jobs locally
[2019-10-13 22:13:13,898] - Sweep output dir : .multirun/2019-10-13
[2019-10-13 22:13:13,898] - #0 : preprocessing/dataset=dataset_1 preprocessing/param=param_1 modeling/model=model_1 modeling/param=param_2 deployment/cluster=cluster_2 c/preprocessing_sub=foo c/modeling_sub=foo c/deployment_sub=foo c=preprocessing
========== Run preprocessing's 'foo' subcommand ==========
dataset:
name: dataset_1
path: /path/of/dataset/1
p_param:
key_1_1: value_1_1
key_1_2: value_1_2
name: param_1
output_path: /path/of/output/path/1
Do something here!
[2019-10-13 22:13:14,040] - #1 : preprocessing/dataset=dataset_1 preprocessing/param=param_1 modeling/model=model_1 modeling/param=param_2 deployment/cluster=cluster_2 c/preprocessing_sub=foo c/modeling_sub=foo c/deployment_sub=foo c=modeling
========== Run modeling's 'foo' subcommand ==========
model:
input_path: /path/of/input/path/1
name: model_1
output_path: /path/of/output/path/1
m_param:
hyperparam_key_2_1: hyperparam_value_2_1
hyperparam_key_2_2: hyperparam_value_2_2
name: param_2
Do something here!
[2019-10-13 22:13:14,179] - #2 : preprocessing/dataset=dataset_1 preprocessing/param=param_1 modeling/model=model_1 modeling/param=param_2 deployment/cluster=cluster_2 c/preprocessing_sub=foo c/modeling_sub=foo c/deployment_sub=foo c=deployment
========== Run deployment's 'foo' subcommand ==========
cluster:
id: user_2
name: cluster_2
pw: pw_3 # For testing purposes, assume that this data is wrong
url: https://cluster/2/url
Do something here!!
3. Fix wrong configuration dynamically.
Oops. You found wrong configuration("pw": "pw_3"
) and want to fix it quickly. To do this, you only need to add cluster.pw=pw_2
to you command line.
$ when_ml_pipeline_meets_hydra \
preprocessing/dataset=dataset_1 \
preprocessing/param=param_1 \
modeling/model=model_1 \
modeling/param=param_2 \
deployment/cluster=cluster_2 \
cluster.pw=pw_2 \
c/preprocessing_sub=foo \
c/modeling_sub=foo \
c/deployment_sub=foo \
c=preprocessing,modeling,deployment \
--multirun
[2019-10-13 22:13:43,246] - Launching 3 jobs locally
[2019-10-13 22:13:43,246] - Sweep output dir : .multirun/2019-10-13
[2019-10-13 22:13:43,246] - #0 : preprocessing/dataset=dataset_1 preprocessing/param=param_1 modeling/model=model_1 modeling/param=param_2 deployment/cluster=cluster_2 c/preprocessing_sub=foo c/modeling_sub=foo c/deployment_sub=foo c=preprocessing cluster.pw=pw_2
========== Run preprocessing's 'foo' subcommand ==========
dataset:
name: dataset_1
path: /path/of/dataset/1
p_param:
key_1_1: value_1_1
key_1_2: value_1_2
name: param_1
output_path: /path/of/output/path/1
Do something here!
[2019-10-13 22:13:43,391] - #1 : preprocessing/dataset=dataset_1 preprocessing/param=param_1 modeling/model=model_1 modeling/param=param_2 deployment/cluster=cluster_2 c/preprocessing_sub=foo c/modeling_sub=foo c/deployment_sub=foo c=modeling cluster.pw=pw_2
========== Run modeling's 'foo' subcommand ==========
model:
input_path: /path/of/input/path/1
name: model_1
output_path: /path/of/output/path/1
m_param:
hyperparam_key_2_1: hyperparam_value_2_1
hyperparam_key_2_2: hyperparam_value_2_2
name: param_2
Do something here!
[2019-10-13 22:13:43,531] - #2 : preprocessing/dataset=dataset_1 preprocessing/param=param_1 modeling/model=model_1 modeling/param=param_2 deployment/cluster=cluster_2 c/preprocessing_sub=foo c/modeling_sub=foo c/deployment_sub=foo c=deployment cluster.pw=pw_2
========== Run deployment's 'foo' subcommand ==========
cluster:
id: user_2
name: cluster_2
pw: pw_2
url: https://cluster/2/url
Do something here!
As well as this scenario, you can think of various cases.
Note
This project has been set up using PyScaffold 3.2.2. For details and usage information on PyScaffold see https://pyscaffold.org/.
License
This app is licensed under MIT License.