All Projects → Qihoo360 → Xlearning

Qihoo360 / Xlearning

Licence: apache-2.0
AI on Hadoop

Programming Languages

java
68154 projects - #9 most used programming language
shell
77523 projects

Projects that are alternatives of or similar to Xlearning

Netron
Visualizer for neural network, deep learning, and machine learning models
Stars: ✭ 17,193 (+906.03%)
Mutual labels:  ai, deeplearning, caffe, mxnet, machinelearning
Clearml Server
ClearML - Auto-Magical Suite of tools to streamline your ML workflow. Experiment Manager, ML-Ops and Data-Management
Stars: ✭ 186 (-89.12%)
Mutual labels:  ai, deeplearning, machinelearning
Best ai paper 2020
A curated list of the latest breakthroughs in AI by release date with a clear video explanation, link to a more in-depth article, and code
Stars: ✭ 2,140 (+25.22%)
Mutual labels:  ai, deeplearning, machinelearning
DLInfBench
CNN model inference benchmarks for some popular deep learning frameworks
Stars: ✭ 51 (-97.02%)
Mutual labels:  caffe, mxnet, deeplearning
All4nlp
All For NLP, especially Chinese.
Stars: ✭ 141 (-91.75%)
Mutual labels:  ai, deeplearning, machinelearning
Text summurization abstractive methods
Multiple implementations for abstractive text summurization , using google colab
Stars: ✭ 359 (-78.99%)
Mutual labels:  ai, deeplearning, machinelearning
Clearml
ClearML - Auto-Magical CI/CD to streamline your ML workflow. Experiment Manager, MLOps and Data-Management
Stars: ✭ 2,868 (+67.82%)
Mutual labels:  ai, deeplearning, machinelearning
XLearning-GPU
qihoo360 xlearning with GPU support; AI on Hadoop
Stars: ✭ 22 (-98.71%)
Mutual labels:  caffe, hadoop, mxnet
Tensorwatch
Debugging, monitoring and visualization for Python Machine Learning and Data Science
Stars: ✭ 3,191 (+86.72%)
Mutual labels:  ai, deeplearning, machinelearning
Polyaxon
Machine Learning Platform for Kubernetes (MLOps tools for experimentation and automation)
Stars: ✭ 2,966 (+73.55%)
Mutual labels:  ai, caffe, mxnet
Ffdl
Fabric for Deep Learning (FfDL, pronounced fiddle) is a Deep Learning Platform offering TensorFlow, Caffe, PyTorch etc. as a Service on Kubernetes
Stars: ✭ 640 (-62.55%)
Mutual labels:  ai, deeplearning, caffe
Deeplearning
深度学习入门教程, 优秀文章, Deep Learning Tutorial
Stars: ✭ 6,783 (+296.9%)
Mutual labels:  deeplearning, mxnet, machinelearning
Horovod
Distributed training framework for TensorFlow, Keras, PyTorch, and Apache MXNet.
Stars: ✭ 11,943 (+598.83%)
Mutual labels:  deeplearning, mxnet, machinelearning
Tools To Design Or Visualize Architecture Of Neural Network
Tools to Design or Visualize Architecture of Neural Network
Stars: ✭ 1,143 (-33.12%)
Mutual labels:  deeplearning, machinelearning
Mtcnn
face detection and alignment with mtcnn
Stars: ✭ 66 (-96.14%)
Mutual labels:  caffe, mxnet
Tf Yarn
Train TensorFlow models on YARN in just a few lines of code!
Stars: ✭ 76 (-95.55%)
Mutual labels:  hadoop, yarn
Dlcv for beginners
《深度学习与计算机视觉》配套代码
Stars: ✭ 1,244 (-27.21%)
Mutual labels:  caffe, mxnet
Pwc Net
PWC-Net: CNNs for Optical Flow Using Pyramid, Warping, and Cost Volume, CVPR 2018 (Oral)
Stars: ✭ 1,142 (-33.18%)
Mutual labels:  deeplearning, caffe
Kamonohashi
AI開発プラットフォームKAMONOHASHI
Stars: ✭ 80 (-95.32%)
Mutual labels:  deeplearning, machinelearning
Mobilenet Ssd
MobileNet-SSD(MobileNetSSD) + Neural Compute Stick(NCS) Faster than YoloV2 + Explosion speed by RaspberryPi · Multiple moving object detection with high accuracy.
Stars: ✭ 84 (-95.08%)
Mutual labels:  deeplearning, caffe

license Release Version PRs Welcome

XLearning is a convenient and efficient scheduling platform combined with the big data and artificial intelligence, support for a variety of machine learning, deep learning frameworks. XLearning is running on the Hadoop Yarn and has integrated deep learning frameworks such as TensorFlow, MXNet, Caffe, Theano, PyTorch, Keras, XGBoost. XLearning has the satisfactory scalability and compatibility.

中文文档

Architecture

architecture
There are three essential components in XLearning:

  • Client: start and get the state of the application.
  • ApplicationMaster(AM): the role for the internal schedule and lifecycle manager, including the input data distribution and containers management.
  • Container: the actual executor of the application to start the progress of Worker or PS(Parameter Server), monitor and report the status of the progress to AM, and save the output, especially start the TensorBoard service for TensorFlow application.

Functions

1 Support Multiple Deep Learning Frameworks

Besides the distributed mode of TensorFlow and MXNet frameworks, XLearning supports the standalone mode of all deep learning frameworks such as Caffe, Theano, PyTorch. Moreover, XLearning allows the custom versions and multi-version of frameworks flexibly.

2 Unified Data Management Based On HDFS

XLearning is enable to specify the input strategy for the input data --input by setting the --input-strategy parameter or xlearning.input.strategy configuration. XLearning support three ways to read the HDFS input data:

  • Download: AM traverses all files under the specified HDFS path and distributes data to workers in files. Each worker download files from the remote to local.
  • Placeholder: The difference with Download mode is that AM send the related HDFS file list to workers. The process in worker read the data from HDFS directly.
  • InputFormat: Integrated the InputFormat function of MapReduce, XLearning allows the user to specify any of the implementation of InputFormat for the input data. AM splits the input data and assigns fragments to the different workers. Each worker passes the assigned fragments through the pipeline to the execution progress.

Similar with the read strategy, XLearning allows to specify the output strategy for the output data --output by setting the --output-strategy parameter or xlearning.output.strategy configuration. There are two kinds of result output modes:

  • Upload: After the program finished, each worker upload the local directory of the output to specified HDFS path directly. The button, "Saved Model", on the web interface allows user to upload the intermediate result to remote during the execution.
  • OutputFormat: Integrated the OutputFormat function of MapReduce, XLearning allows the user to specify any of the implementation of OutputFormat for saving the result to HDFS.

More detail see data management

3 Visualization Display

The application interface can be divided into four parts:

  • All Containers:display the container list and corresponding information, including the container host, container role, current state of container, start time, finish time, current progress.
  • View TensorBoard:If set to start the service of TensorBoard when the type of application is TensorFlow, provide the link to enter the TensorBoard for real-time view.
  • Save Model:If the application has the output, user can upload the intermediate output to specified HDFS path during the execution of the application through the button of "Save Model". After the upload finished, display the list of the intermediate saved path.
  • Worker Metrix:display the resource usage information metrics of each worker.
    As shown below:

yarn1

4 Compatible With The Code At Native Frameworks

Except the automatic construction of the ClusterSpec at the distributed mode TensorFlow framework, the program at standalone mode TensorFlow and other deep learning frameworks can be executed at XLearning directly.

Compilation & Deployment Instructions

1 Compilation Environment Requirements

  • jdk >= 1.7
  • Maven >= 3.3

2 Compilation Method

Run the following command in the root directory of the source code:

mvn package

After compiling, a distribution package named xlearning-1.1-dist.tar.gz will be generated under target in the root directory.
Unpacking the distribution package, the following subdirectories will be generated under the root directory:

  • bin: scripts for application commit
  • lib: jars for XLearning and dependencies
  • conf: configuration files
  • sbin: scripts for history service
  • data: data and files for examples
  • examples: XLearning examples

3 Deployment Environment Requirements

  • CentOS 7.2
  • Java >= 1.7
  • Hadoop = 2.6, 2.7, 2.8
  • [optional] Dependent environment for deep learning frameworks at the cluster nodes, such as TensorFlow, numpy, Caffe.

4 XLearning Client Deployment Guide

Under the "conf" directory of the unpacking distribution package "$XLEARNING_HOME", configure the related files:

  • xlearning-env.sh: set the environment variables, such as:

    • JAVA_HOME
    • HADOOP_CONF_DIR
  • xlearning-site.xml: configure related properties. Note that the properties associated with the history service needs to be consistent with what has configured when the history service started.For more details, please see the Configuration part。

  • log4j.properties:configure the log level

5 Start Method of XLearning History Service [Optional]

  • run $XLEARNING_HOME/sbin/start-history-server.sh.

Quick Start

Use $XLEARNING_HOME/bin/xl-submit to submit the application to cluster in the XLearning client.
Here are the submit example for the TensorFlow application.

1 upload data to hdfs

upload the "data" directory under the root of unpacking distribution package to HDFS

cd $XLEARNING_HOME  
hadoop fs -put data /tmp/ 

2 submit

cd $XLEARNING_HOME/examples/tensorflow
$XLEARNING_HOME/bin/xl-submit \
   --app-type "tensorflow" \
   --app-name "tf-demo" \
   --input /tmp/data/tensorflow#data \
   --output /tmp/tensorflow_model#model \
   --files demo.py,dataDeal.py \
   --launch-cmd "python demo.py --data_path=./data --save_path=./model --log_dir=./eventLog --training_epochs=10" \
   --worker-memory 10G \
   --worker-num 2 \
   --worker-cores 3 \
   --ps-memory 1G \
   --ps-num 1 \
   --ps-cores 2 \
   --queue default \

The meaning of the parameters are as follows:

Property Name Meaning
app-name application name as "tf-demo"
app-type application type as "tensorflow"
input input file, HDFS path is "/tmp/data/tensorflow" related to local dir "./data"
output output file,HDFS path is "/tmp/tensorflow_model" related to local dir "./model"
files application program and required local files, including demo.py, dataDeal.py
launch-cmd execute command
worker-memory amount of memory to use for the worker process is 10GB
worker-num number of worker containers to use for the application is 2
worker-cores number of cores to use for the worker process is 3
ps-memory amount of memory to use for the ps process is 1GB
ps-num number of ps containers to use for the application is 1
ps-cores number of cores to use for the ps process is 2
queue the queue that application submit to

For more details, set the Submit Parameter part。

FAQ

XLearning FAQ

Authors

XLearning is designed, authored, reviewed and tested by the team at the github:

@Yuance Li, @Wen OuYang, @Runying Jia, @YuHan Jia, @Lei Wang

Contact us

Mail: [email protected]
QQ群:588356340
qq

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].