All Projects → PeterFogh → dvc_dask_use_case

PeterFogh / dvc_dask_use_case

Licence: other
A use case of a reproducible machine learning pipeline using Dask, DVC, and MLflow.

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to dvc dask use case

Express Env Example
A sample express environment that is well architected for scale. Read about it here:
Stars: ✭ 130 (+490.91%)
Mutual labels:  guide, example
Escape From Callback Mountain
Example Project & Guide for mastering Promises in Node/JavaScript. Feat. proposed 'Functional River' pattern
Stars: ✭ 249 (+1031.82%)
Mutual labels:  guide, example
guide vue-cli-3-multiple-entry-points
Simple guide to show how to create multiple entry points (pages) using vue-cli-3
Stars: ✭ 29 (+31.82%)
Mutual labels:  guide, example
Charles-Proxy-Mobile-Guide
The mobile hackers' guide to Charles Proxy 👍
Stars: ✭ 105 (+377.27%)
Mutual labels:  setup, guide
Guide
Kubernetes clusters for the hobbyist.
Stars: ✭ 5,150 (+23309.09%)
Mutual labels:  setup, guide
mlflow-tracking-server
MLFLow Tracking Server based on Docker and AWS S3
Stars: ✭ 59 (+168.18%)
Mutual labels:  mlflow, mlflow-tracking-server
React Redux Typescript Realworld App
RealWorld App implementation based on "react-redux-typescript-guide"
Stars: ✭ 178 (+709.09%)
Mutual labels:  guide, example
Mac Setup
🛠️ Front end web development setup for macOS.
Stars: ✭ 265 (+1104.55%)
Mutual labels:  setup, guide
Provisioning
Kubernetes cluster provisioning using Terraform.
Stars: ✭ 277 (+1159.09%)
Mutual labels:  setup, guide
Mac Setup
Installing Development environment on macOS
Stars: ✭ 6,510 (+29490.91%)
Mutual labels:  setup, guide
mlflow-docker
Ready to run docker-compose configuration for ML Flow with Mysql and Minio S3
Stars: ✭ 146 (+563.64%)
Mutual labels:  mlflow, mlflow-tracking-server
termux.sh
AIO Termux >>> bash <(curl -fsSL https://git.io/JvMD6) <<<
Stars: ✭ 49 (+122.73%)
Mutual labels:  guide
habrlang
Step by Step guide how to make your own programming language
Stars: ✭ 20 (-9.09%)
Mutual labels:  guide
crane
Crane is a easy-to-use and beautiful desktop application helps you build manage your container images.
Stars: ✭ 223 (+913.64%)
Mutual labels:  mlflow
gms2-destructible-terrain
⛰️ Collidable, destructible terrain in GameMaker Studio 2
Stars: ✭ 24 (+9.09%)
Mutual labels:  example
deep-snake
A snake game trained using simple deep learning implemented in client side javascript.
Stars: ✭ 77 (+250%)
Mutual labels:  example
UmbracoAngularBackofficePages
Example project showing how to extend Umbraco with a custom tree and edit page using web api and angular
Stars: ✭ 55 (+150%)
Mutual labels:  example
example-runtime-bundle
DEPRECATED moved to https://github.com/Activiti/activiti-cloud-application
Stars: ✭ 14 (-36.36%)
Mutual labels:  example
spring-boot-mongodb-react-java-crud
Spring Boot, MongoDB and React.js CRUD Java Web Application Example
Stars: ✭ 33 (+50%)
Mutual labels:  example
kotlin-multiplatform-example
A barebones Kotlin multiplatform project with JVM and JS targets
Stars: ✭ 15 (-31.82%)
Mutual labels:  example

DVC and Dask use case

This repository contains the description and code for setting up DVC to use a remote computer server using dask. Note that this use case relay on the original DVC tutorial and its code found here https://dvc.org/doc/tutorial.

How to set up the use case

Prerequisites

The use case have the following prerequisites:

  1. A remote server with:
    1. SSH installed.
    2. A unix user you have the username and password for.
    3. A folder for your remote shared DVC cache, my is at /scratch/dvc_project_cache/.
    4. A folder for your remote DVC data directories, my is at /scratch/dvc_users/[REMOTE_USERNAME]/.
    5. A Dask scheduler installed and running at port 8786, see http://docs.dask.org/en/latest/setup.html for a guide.
    6. A MLflow tracking server installed and running at host 0.0.0.0 and port 5000, with mlflow server --host 0.0.0.0 --file-store /projects/mlflow_runs/.
  2. A local SSH keyfile (ssh-keygen), which have been copied to the remote server, with ssh-copy-id [REMOTE_USERNAME]@[REMOTE_IP].
  3. An open SSH port-forward to the Dask scheduler and MLflow tracking server from your local machine to the remote server, with ssh -L 8786:[REMOTE_USERNAME]@[REMOTE_IP]:8786 -L 5000:[REMOTE_USERNAME]@[REMOTE_IP]:5000 [REMOTE_USERNAME]@[REMOTE_IP].
  4. Set up local DVC development repository (following https://dvc.org/doc/user-guide/contributing/) with a conda environment:
    1. Fork https://github.com/iterative/dvc on Github.
    2. git clone [email protected]:<GITHUB_USERNAME>/dvc.git
    3. cd dvc
    4. conda create -n py36_open_source_dvc python=3.6
    5. conda activate py36_open_source_dvc
    6. pip install -r requirements.txt
    7. pip install -r tests/requirements.txt
    8. pip install -e .
    9. pip install pre-commit
    10. pre-commit install
    11. which dvc should say [HOME]/anaconda3/envs/py36_open_source_dvc/bin/dvc and dvc --version should say the exact version available in you local DVC development repository.
  5. Configure you DVC globally (e.g. using the --global flag) for you local machine - note that I call my remote server "ahsoka":
    1. conda activate py36_open_source_dvc
    2. dvc remote add ahsoka ssh://[REMOTE_IP]/ --global
    3. dvc remote modify ahsoka user [REMOTE_USERNAME] --global
    4. dvc remote modify ahsoka port 22 --global
    5. dvc remote modify ahsoka keyfile [PATH_TO_YOUR_PUBLIC_SSH_KEY] --global
    6. dvc remote add ahsoka_user_workspace remote://ahsoka/scratch/dvc_users/[REMOTE_USERNAME]/ --global
    • These globally configured DVC remotes are used by the DVC config file in the Git repository, see .dvc/config, to specify project specific remotes for the DVC cache and DVC data workspace.

Use case

This use case of DVC and Dask has been set up as follow.

On your local machine do the following:

  1. Clone this test repository from my Github: git clone [email protected]:PeterFogh/dvc_dask_use_case.git
  2. Install the Conda environment for this repository - note the new enviroment must point to your local DVC development repository:
    1. conda env create -f conda_env.yml, which have been create by the following commands (executed the 26-04-2019):
      1. conda create --name py36_open_source_dvc_dask_use_case --clone py36_open_source_dvc
      2. conda install -n py36_open_source_dvc_dask_use_case dask scikit-learn
      3. conda activate py36_open_source_dvc_dask_use_case && pip install mlflow matplotlib
      4. conda env export -n py36_open_source_dvc_dask_use_case > conda_env.yml
    2. Check dvc version matches your development repository version: conda activate py36_open_source_dvc && which dvc && dvc --version and conda activate py36_open_source_dvc_dask_use_case && which dvc && dvc --version
  3. Reproduce the DVC pipeline: dvc repro - which have been specified by the following DVC stages:
    1. conda activate py36_open_source_dvc_dask_use_case
    2. dvc run -d download_xml.py -d conf.py -o remote://ahsoka_project_data/download_xml/ -f download_xml.dvc python download_xml.py
    3. dvc run -d xml_to_tsv.py -d conf.py -d remote://ahsoka_project_data/download_xml/ -o remote://ahsoka_project_data/xml_to_tsv/ -f xml_to_tsv.dvc python xml_to_tsv.py
    4. dvc run -d split_train_test.py -d conf.py -d remote://ahsoka_project_data/xml_to_tsv/ -o remote://ahsoka_project_data/split_train_test/ -f split_train_test.dvc python split_train_test.py
    5. dvc run -d featurization.py -d conf.py -d remote://ahsoka_project_data/split_train_test/ -o remote://ahsoka_project_data/featurization/ -f featurization.dvc python featurization.py
    6. dvc run -d train_model.py -d conf.py -d remote://ahsoka_project_data/featurization/ -o remote://ahsoka_project_data/train_model/ -f train_model.dvc python train_model.py
    7. dvc run -d evaluate.py -d conf.py -d remote://ahsoka_project_data/featurization/ -d remote://ahsoka_project_data/train_model/ -o remote://ahsoka_project_data/evaluate/ -m eval.txt -f Dvcfile python evaluate.py
  4. Show DVC metrics dvc metrics show -a.
  5. Visit MLflow tracking server webUI from your local browser at http://localhost:5000/ to see the results of the pipeline.

Problems with MLflow for the use case

  • MLflow artifacts do not support our SSH setup. mlflow.log_artifacts() do not support files saved on the remote server. Artifact files must be located at a directory shared by both the client machine and the server using the methods described here. Read mlflow/mlflow#572 (comment) for more details on the problem. However, we can circumvent this problem using Dask to executed the MLflow run on the remote server. Thereby, both the client and the MLflow tracking server has not problem reading and writing to the same folder, as the they are executed on the same machine.
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].