An open source framework that provides a simple, universal API for building distributed applications. Ray is packaged with RLlib, a scalable reinforcement learning library, and Tune, a scalable hyperparameter tuning library.

Stars: ✭ 18,547 (+92635%)

Mutual labels: distributed

CNN-models

YOLO-v2, ResNet-32, GoogLeNet-lite

Stars: ✭ 32 (+60%)

Mutual labels: resnet

Brainiak

Brain Imaging Analysis Kit

Stars: ✭ 232 (+1060%)

Mutual labels: distributed

mxnet-retrain

Create mxnet finetuner (retrain) for mac/linux ,no need install docker and supports CPU, GPU(eGpu/cudnn).support the inception,resnet ,squeeznet,mobilenet...

Stars: ✭ 32 (+60%)

Mutual labels: resnet

Tensorflow

An Open Source Machine Learning Framework for Everyone

Stars: ✭ 161,335 (+806575%)

Mutual labels: distributed

MQBench Quantize

QAT(quantize aware training) for classification with MQBench

Stars: ✭ 29 (+45%)

Mutual labels: resnet

Shardingsphere Elasticjob Cloud

Stars: ✭ 248 (+1140%)

Mutual labels: distributed

Cat

CAT 作为服务端项目基础组件，提供了 Java, C/C++, Node.js, Python, Go 等多语言客户端，已经在美团点评的基础架构中间件框架（MVC框架，RPC框架，数据库框架，缓存框架等，消息队列，配置系统等）深度集成，为美团点评各业务线提供系统丰富的性能指标、健康状况、实时告警等。

Stars: ✭ 16,236 (+81080%)

Mutual labels: distributed

Multi-Node-TimescaleDB

The multi-node setup of TimescaleDB 🐯🐯🐯 🐘 🐯🐯🐯

Stars: ✭ 42 (+110%)

Mutual labels: distributed

Powerjob

Enterprise job scheduling middleware with distributed computing ability.

Stars: ✭ 3,231 (+16055%)

Mutual labels: distributed

osilo

Personal data silos with secure sharing

Stars: ✭ 15 (-25%)

Mutual labels: distributed

Flambe

An ML framework to accelerate research and its path to production.

Stars: ✭ 236 (+1080%)

Mutual labels: distributed

celery-monitor

The celery monitor app was written by Django.

Stars: ✭ 92 (+360%)

Mutual labels: distributed

tool-db

A peer-to-peer decentralized database

Stars: ✭ 15 (-25%)

Mutual labels: distributed

majordodo

Distributed Operations and Data Organizer built on Apache BookKeeper

Stars: ✭ 25 (+25%)

Mutual labels: distributed

webhunger

WebHunger is an extensible, full-scale crawler framework that supports distributed crawling, aiming at getting users focused on web page parsing without concerning for the crawling process.

Stars: ✭ 17 (-15%)

Mutual labels: distributed

View All Similar Projects ➔

Distributed ResNet on Cifar and Imagenet Dataset.

This Repo contains code for Distributed ResNet Training and scripts to submit distributed tasks in slurm system, specific to multiple machine each having one GPU card. I use the official resnet model provided by google and wrap it with my distributed code using SyncReplicaOptimzor. Some modifications from official model in r1.4 are made to fit r1.3 version Tensorflow.

Problem occured:

I met the same problem with SyncReplicaOptimzor as mentioned in

github issue

stackoverflow

If you have any idea to fix this problem, please contact the author. contact: Jiarui Fang ([email protected])

Results with this code:

Cifar-10 global batch size = 128, evaluation results with test data are as following. A. One CPU with 4 Titan Xp GPU

CIFAR-10 Model	Horovod Best Precision	#node	steps	speed (stp/sec)
50 layer	93.3%	4	~90k	21.82

Each node is a P100 GPU.

CIFAR-10 Model	TF Best Precision	PS-WK	Steps	Speed (stp/sec)	Horovod Best Prec.	#node	speed
50 layer	93.6%	local	~80k	13.94
50 layer	85.2%	1ps-1wk	~80k	10.19
50 layer	86.4%	2ps-4wk	~80k	20.3
50 layer	87.3%	4ps-8wk	~60k	19.19	-	8	28.66

The eval best precisions are illustrated in the following picture. Jumps in curves are due to restart evaluation from checkpoint, which will loss previous best precision values and shows sudden drop of curves in picture.

Distributed Versions get lower eval accuracy results as provided in Tensorflow Model Research

ImageNet We set global batch size as 128*8 = 1024. Follows the Hyperparameter settting in Intel-Caffe, i.e. sub-batch-size is 128 for each node. Runing out of memory warning will occure for 128 sub-batch-size.

Model Layer	Batch Size	TF Best Precision	PS-WK	Steps	Speed (stp/sec)
50	128	62.6%	8-ps-8wk	~76k	0.93
50	128	64.4%	4-ps-8wk	~75k	0.90
50	64	-	1-ps-1wk	-	1.56
50	32	-	1-ps-1wk	-	2.20
50	128	-	1-ps-1wk	-	0.96
50	128	-	8-ps-128wk	-	0.285
50	32	-	8-ps-128wk	-	0.292

Also get lower eval accuracy values.

Usage

Prerequists

Install TensorFlow, Bazel. I install a conda2 package on Daint. Bazel and other packages required are installed by virtualenv inside conda2.
Download ImageNet Dataset to Daint To avoid the error raised from unrecognition of the relative directory path, the following modification should made in download_and_preprocess_imagenet.sh. replace

WORK_DIR="$0.runfiles/inception/inception"

with

WORK_DIR="$(realpath -s "$0").runfiles/inception/inception"

After few days, you will see the following data in your data path. Due to the file system of Daint dose not support storage of millions of files, you have to deleted raw-data directory.

Download CIFAR-10/CIFAR-100 dataset.

curl -o cifar-10-binary.tar.gz https://www.cs.toronto.edu/~kriz/cifar-10-binary.tar.gz
curl -o cifar-100-binary.tar.gz https://www.cs.toronto.edu/~kriz/cifar-100-binary.tar.gz

How to run:

$ cd scripts 
# run local for cifar10. It will launch 1 ps and 2 workers
$ sh submit_local_dist.sh
# run distributed for cifar
$ sh submit_cifar_daint_dist.sh #server #worker #batch_size
# run distributed for Imagenet
$ sh submit_imagenet_daint_dist.sh #server #worker

I left one node for evaluation, so the #worker should be the #worker for traing plus one. For example, you would like to launch a 2 ps and 4 worker job and evaluate your model simultanously on another node. The ps and work are assigned to the same node in default.

$ cd scripts
$ sh submit_imagenet_daint_dist.sh 2 5

Related papers:

Identity Mappings in Deep Residual Networks

https://arxiv.org/pdf/1603.05027v2.pdf

Deep Residual Learning for Image Recognition

https://arxiv.org/pdf/1512.03385v1.pdf

Wide Residual Networks

https://arxiv.org/pdf/1605.07146v1.pdf

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

feifeibear / Distributed-ResNet-Tensorflow

Programming Languages

Labels

Projects that are alternatives of or similar to Distributed-ResNet-Tensorflow

Distributed ResNet on Cifar and Imagenet Dataset.

Problem occured:

Usage