alibaba / Kubedl
Programming Languages
KubeDL
KubeDL is short for Kubernetes-Deep-Learning. It is a unified operator that supports running multiple types of distributed deep learning/machine learning workloads on Kubernetes.
Currently, KubeDL supports the following ML/DL jobs:
Features
- Support running prevalent ML/DL workloads in a single operator and maintains API compatibility with kubeflow job operators.
- Support running jobs with custom artifacts downloaded from remote repository such as github, saving users from manually baking the artificats into the image.
- Instrumented with unified prometheus metrics for different types of DL jobs, such as job launch delay, number of pending/running jobs.
- Support job metadata persistency with a pluggable storage backend such as Mysql.
- Provide more granular information on kubectl command line to show job status.
- Enable specific job type based on the installed CRDs automatically or through the startup flags explicitly.
- Support advanced scheduling features such as gang scheduling with pluggable backend schedulers.
- A modular architecture that can be easily extended for more types of DL/ML workloads with shared libraries, see how to add a custom job workload.
- [Work-in-progress] Provide a dashboard for monitoring the jobs' lifecycle and stats.
Getting started
You can deploy KubeDL using a single Helm command or just YAML files.
Deploy KubeDL using Helm
KubeDL can be deployed with a single command leveraging the helm chart:
helm install kubedl ./helm/kubedl
You can override default values defined in ./helm/kubedl/values.yaml
with --set
flag, for example:
helm install kubedl ./helm/kubedl --set kubedlSysNamespace=kube-system --set resources.requests.cpu=1024m --set resources.requests.memory=2Gi
Helm will render templates and apply them to cluster, just run the command above in root dir and be ready to go :)
Deploy KubeDL using YAML File
Install CRDs
kubectl apply -f https://raw.githubusercontent.com/alibaba/kubedl/master/config/crd/bases/kubeflow.org_pytorchjobs.yaml
kubectl apply -f https://raw.githubusercontent.com/alibaba/kubedl/master/config/crd/bases/kubeflow.org_tfjobs.yaml
kubectl apply -f https://raw.githubusercontent.com/alibaba/kubedl/master/config/crd/bases/xdl.kubedl.io_xdljobs.yaml
kubectl apply -f https://raw.githubusercontent.com/alibaba/kubedl/master/config/crd/bases/xgboostjob.kubeflow.org_xgboostjobs.yaml
kubectl apply -f https://raw.githubusercontent.com/alibaba/kubedl/master/config/crd/bases/kubedl.io_marsjobs.yaml
Install KubeDL operator
kubectl apply -f https://raw.githubusercontent.com/alibaba/kubedl/master/config/manager/all_in_one.yaml
The official KubeDL operator image is hosted under docker hub.
Optional: Enable job type selectively
If you only need some job types and want to disable the others, you can use either one of the three options or all of them:
-
[DEFAULT] Only install the CRDs you need, KubeDL will automatically enables corresponding workload controllers, you can set
--workloads auto
orWORKLOADS_ENABLE=auto
explicitly. This is the default approach. -
Set env
WORKLOADS_ENABLE
in KubeDL container. The value is a list of job types to be enabled. For example,WORKLOADS_ENABLE=TFJob,PytorchJob
means only TFJob and PytorchJob workload are enabled, the others are disabled. -
Set startup arguments
--workloads
in KubeDL container command args. The value is a list of job types to be enabled likeTFJob,PytorchJob
.
Check documents for a full list of operator startup flags.
Run an Example Job
This example demonstrates how to run a simple MNist Tensorflow job with KubeDL.
Submit the TFJob
kubectl apply -f https://raw.githubusercontent.com/alibaba/kubedl/master/example/tf/tf_job_mnist.yaml
Monitor the status of the Tensorflow job
kubectl get tfjobs -n kubedl
kubectl describe tfjob mnist -n kubedl
Delete the job
kubectl delete tfjob mnist -n kubedl
Workload types
Supported workload types are tfjob
, pytorchjob
, marsjob
, e.g.
kubectl get tfjob
KubeDL Metrics
Check the documents for the prometheus metrics supported for KubeDL operator.
Download Artifacts from Remote Repository
KubeDL supports running jobs with custom artifacts downloaded from remote repository dynamically, saving users from rebuilding the image to include the artifacts. Currently, only github is supported. The framework is pluggable and can easily support other repositories like HDFS. Check the documents for details.
Tutorial
Job Dashboard
A dashboard for monitoring the jobs' lifecycle and stats is currently in progress. The dashboard also provides convenient job operation options including job creation、termination, and deletion. See the demo below.
Developer Guide
Build the controller manager binary
make manager
Run the tests
make test
Generate manifests e.g. CRD, RBAC YAML files etc
make manifests
Build the docker image
export IMG=<your_image_name> && make docker-build
Push the image
docker push <your_image_name>
To develop/debug KubeDL controller manager locally, please check the debug guide.
Community
If you have any questions or want to contribute, GitHub issues or pull requests are warmly welcome. You can also contact us via the following channels:
- Dingtalk Group(钉钉讨论群)