All Projects → douban → Tfmesos

douban / Tfmesos

Licence: bsd-3-clause
Tensorflow in Docker on Mesos #tfmesos #tensorflow #mesos

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to Tfmesos

Niftynet
[unmaintained] An open-source convolutional neural networks platform for research in medical image analysis and image-guided therapy
Stars: ✭ 1,276 (+557.73%)
Mutual labels:  deep-neural-networks, ml, distributed
Tensorflow
An Open Source Machine Learning Framework for Everyone
Stars: ✭ 161,335 (+83062.37%)
Mutual labels:  deep-neural-networks, ml, distributed
Oneflow
OneFlow is a performance-centered and open-source deep learning framework.
Stars: ✭ 2,868 (+1378.35%)
Mutual labels:  deep-neural-networks, ml, distributed
Mnn
MNN is a blazing fast, lightweight deep learning framework, battle-tested by business-critical use cases in Alibaba
Stars: ✭ 6,284 (+3139.18%)
Mutual labels:  deep-neural-networks, ml
Handson Ml
A series of Jupyter notebooks that walk you through the fundamentals of Machine Learning and Deep Learning in python using Scikit-Learn and TensorFlow.
Stars: ✭ 23,798 (+12167.01%)
Mutual labels:  ml, distributed
Serving
A flexible, high-performance serving system for machine learning models
Stars: ✭ 5,306 (+2635.05%)
Mutual labels:  deep-neural-networks, ml
parallax
A Tool for Automatic Parallelization of Deep Learning Training in Distributed Multi-GPU Environments.
Stars: ✭ 128 (-34.02%)
Mutual labels:  ml, distributed
Ludwig
Data-centric declarative deep learning framework
Stars: ✭ 8,018 (+4032.99%)
Mutual labels:  deep-neural-networks, ml
Jubatus
Framework and Library for Distributed Online Machine Learning
Stars: ✭ 702 (+261.86%)
Mutual labels:  ml, distributed
Caffe2
Caffe2 is a lightweight, modular, and scalable deep learning framework.
Stars: ✭ 8,409 (+4234.54%)
Mutual labels:  deep-neural-networks, ml
Onnx
Open standard for machine learning interoperability
Stars: ✭ 11,829 (+5997.42%)
Mutual labels:  deep-neural-networks, ml
Deephyper
DeepHyper: Scalable Asynchronous Neural Architecture and Hyperparameter Search for Deep Neural Networks
Stars: ✭ 117 (-39.69%)
Mutual labels:  deep-neural-networks, ml
Compression
Data compression in TensorFlow
Stars: ✭ 458 (+136.08%)
Mutual labels:  deep-neural-networks, ml
Openrec
OpenRec is an open-source and modular library for neural network-inspired recommendation algorithms
Stars: ✭ 360 (+85.57%)
Mutual labels:  deep-neural-networks, ml
Ffdl
Fabric for Deep Learning (FfDL, pronounced fiddle) is a Deep Learning Platform offering TensorFlow, Caffe, PyTorch etc. as a Service on Kubernetes
Stars: ✭ 640 (+229.9%)
Mutual labels:  deep-neural-networks, ml
Cntk
Microsoft Cognitive Toolkit (CNTK), an open source deep-learning toolkit
Stars: ✭ 17,113 (+8721.13%)
Mutual labels:  deep-neural-networks, distributed
Skater
Python Library for Model Interpretation/Explanations
Stars: ✭ 973 (+401.55%)
Mutual labels:  deep-neural-networks, ml
Andrew Ng Notes
This is Andrew NG Coursera Handwritten Notes.
Stars: ✭ 180 (-7.22%)
Mutual labels:  deep-neural-networks, ml
dask-sql
Distributed SQL Engine in Python using Dask
Stars: ✭ 271 (+39.69%)
Mutual labels:  ml, distributed
Dltk
Deep Learning Toolkit for Medical Image Analysis
Stars: ✭ 1,249 (+543.81%)
Mutual labels:  deep-neural-networks, ml

TFMesos

.. image:: https://badges.gitter.im/douban/tfmesos.svg :alt: Join the chat at https://gitter.im/douban/tfmesos :target: https://gitter.im/douban/tfmesos?utm_source=badge&utm_medium=badge&utm_campaign=pr-badge&utm_content=badge

.. image:: https://img.shields.io/travis/douban/tfmesos.svg :target: https://travis-ci.org/douban/tfmesos/ .. image:: https://img.shields.io/pypi/v/tfmesos.svg :target: https://pypi.python.org/pypi/tfmesos .. image:: https://img.shields.io/docker/automated/tfmesos/tfmesos.svg :target: https://hub.docker.com/r/tfmesos/tfmesos/

TFMesos is a lightweight framework to help running distributed Tensorflow <https://www.tensorflow.org>_ Machine Learning tasks on Apache Mesos <http://mesos.apache.org>_ within Docker <https://www.docker.com>_ and Nvidia-Docker <https://github.com/NVIDIA/nvidia-docker/>_ .

TFMesos dynamically allocates resources from a Mesos cluster, builds a distributed training cluster for Tensorflow, and makes different training tasks mangeed and isolated in the shared Mesos cluster with the help of Docker.

Prerequisites

  • For Mesos >= 1.0.0:
  1. Mesos Cluster (cf: Mesos Getting Started <http://mesos.apache.org/documentation/latest/getting-started>_). All nodes in the cluster should be reachable using their hostnames, and all nodes have identical /etc/passwd and /etc/group.

  2. Setup Mesos Agent to enable Mesos Containerizer <http://mesos.apache.org/documentation/container-image/>_ and Mesos Nvidia GPU Support <https://issues.apache.org/jira/browse/MESOS-4626>_ (optional). eg: mesos-agent --containerizers=mesos --image_providers=docker --isolation=filesystem/linux,docker/runtime,cgroups/devices,gpu/nvidia

  3. (optional) A Distributed Filesystem (eg: MooseFS <https://moosefs.com>_)

  4. Ensure latest TFMesos docker image (tfmesos/tfmesos <https://hub.docker.com/r/tfmesos/tfmesos/>_) is pulled across the whole cluster

  • For Mesos < 1.0.0:
  1. Mesos Cluster (cf: Mesos Getting Started <http://mesos.apache.org/documentation/latest/getting-started>_). All nodes in the cluster should be reachable using their hostnames, and all nodes have identical /etc/passwd and /etc/group.

  2. Docker (cf: Docker Get Start Tutorial <https://docs.docker.com/engine/installation/linux/>_)

  3. Mesos Docker Containerizer Support (cf: Mesos Docker Containerizer <http://mesos.apache.org/documentation/latest/docker-containerizer/>_)

  4. (optional) Nvidia-docker installation (cf: Nvidia-docker installation <https://github.com/NVIDIA/nvidia-docker/wiki/Installation>_) and make sure nvidia-plugin is accessible from remote host (with -l 0.0.0.0:3476)

  5. (optional) A Distributed Filesystem (eg: MooseFS <https://moosefs.com>_)

  6. Ensure latest TFMesos docker image (tfmesos/tfmesos <https://hub.docker.com/r/tfmesos/tfmesos/>_) is pulled across the whole cluster

If you are using AWS G2 instance, here is a sample <https://github.com/douban/tfmesos/blob/master/misc/setup-aws-g2.sh>_ script to setup most of there prerequisites.

Running simple Test

After setting up the mesos and pulling the docker image on a single node (or a cluser), you should be able to use the following command to run a simple test.

.. code:: bash

$ docker run -e MESOS_MASTER=mesos-master:5050 \
    -e DOCKER_IMAGE=tfmesos/tfmesos \
    --net=host \
    -v /path-to-your-tfmesos-code/tfmesos/examples/plus.py:/tmp/plus.py \
    --rm \
    -it \
    tfmesos/tfmesos \
    python /tmp/plus.py mesos-master:5050

Successfully running the test should result in an output of 42 on the console.

Running in replica mode

This mode is called Between-graph replication in official Distributed Tensorflow Howto <https://github.com/tensorflow/tensorflow/blob/master/tensorflow/g3doc/how_tos/distributed/index.md#replicated-training>_

Most distributed training models that Google has open sourced (such as mnist_replica <https://github.com/tensorflow/tensorflow/blob/master/tensorflow/tools/dist_test/python/mnist_replica.py>_ and inception <https://github.com/tensorflow/models/blob/master/inception/inception/inception_distributed_train.py>_) are using this mode. In this mode, two kind of Jobs are defined with the names 'ps' and 'worker'. 'ps' tasks act as 'Parameter Server' and 'worker' tasks run the actual training process.

Here we use our modified 'mnist_replica' <https://github.com/douban/tfmesos/blob/master/examples/mnist/mnist_replica.py>_ as example:

  1. Checkout the mnist example codes into a directory in shared filesystem, eg: /nfs/mnist
  2. Assume Mesos master is mesos-master:5050
  3. Now we can launch this script using following commands:

CPU:

.. code:: bash

$ docker run --rm -it -e MESOS_MASTER=mesos-master:5050 \
             --net=host \
             -v /nfs/mnist:/nfs/mnist \
             -v /etc/passwd:/etc/passwd:ro \
             -v /etc/group:/etc/group:ro \
             -u `id -u` \
             -w /nfs/mnist \
             tfmesos/tfmesos \
             tfrun -w 1 -s 1  \
             -V /nfs/mnist:/nfs/mnist \
             -- python mnist_replica.py \
             --ps_hosts {ps_hosts} --worker_hosts {worker_hosts} \
             --job_name {job_name} --worker_index {task_index}

GPU (1 GPU per worker):

.. code:: bash

$ nvidia-docker run --rm -it -e MESOS_MASTER=mesos-master:5050 \
             --net=host \
             -v /nfs/mnist:/nfs/mnist \
             -v /etc/passwd:/etc/passwd:ro \
             -v /etc/group:/etc/group:ro \
             -u `id -u` \
             -w /nfs/mnist \
             tfmesos/tfmesos \
             tfrun -w 1 -s 1 -Gw 1 -- python mnist_replica.py \
             --ps_hosts {ps_hosts} --worker_hosts {worker_hosts} \
             --job_name {job_name} --worker_index {task_index}

Note:

In this mode, tfrun is used to prepare the cluster and launch the training script on each node, and worker #0 (the chief worker) will be launched in the local container. tfrun will substitute {ps_hosts}, {worker_hosts}, {job_name}, {task_index} with corresponding values of each task.

Running in fine-grained mode

This mode is called In-graph replication in official Distributed Tensorflow Howto <https://github.com/tensorflow/tensorflow/blob/master/tensorflow/g3doc/how_tos/distributed/index.md#replicated-training>_

In this mode, we have more control over the cluster spec. All nodes in the cluster is remote and just running a Grpc server. Each worker is driven by a local thread to run the training task.

Here we use our modified mnist <https://github.com/douban/tfmesos/blob/master/examples/mnist/mnist.py>_ as example:

  1. Checkout the mnist example codes into a directory, eg: /tmp/mnist
  2. Assume Mesos master is mesos-master:5050
  3. Now we can launch this script using following commands:

CPU:

.. code:: bash

$ docker run --rm -it -e MESOS_MASTER=mesos-master:5050 \
             --net=host \
             -v /tmp/mnist:/tmp/mnist \
             -v /etc/passwd:/etc/passwd:ro \
             -v /etc/group:/etc/group:ro \
             -u `id -u` \
             -w /tmp/mnist \
             tfmesos/tfmesos \
             python mnist.py 

GPU (1 GPU per worker):

.. code:: bash

$ nvidia-docker run --rm -it -e MESOS_MASTER=mesos-master:5050 \
             --net=host \
             -v /tmp/mnist:/tmp/mnist \
             -v /etc/passwd:/etc/passwd:ro \
             -v /etc/group:/etc/group:ro \
             -u `id -u` \
             -w /tmp/mnist \
             tfmesos/tfmesos \
             python mnist.py --worker-gpus 1
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].