All Projects → linkedin → Tony

linkedin / Tony

Licence: other
TonY is a framework to natively run deep learning frameworks on Apache Hadoop.

Programming Languages

java
68154 projects - #9 most used programming language

Projects that are alternatives of or similar to Tony

Bigdl
Building Large-Scale AI Applications for Distributed Big Data
Stars: ✭ 3,813 (+509.11%)
Mutual labels:  hadoop
God Of Bigdata
专注大数据学习面试,大数据成神之路开启。Flink/Spark/Hadoop/Hbase/Hive...
Stars: ✭ 6,008 (+859.74%)
Mutual labels:  hadoop
Hadoop study
定期更新Hadoop生态圈中常用大数据组件文档 重心依次为: Flink Solr Sparksql ES Scala Kafka Hbase/phoenix Redis Kerberos (项目包含hadoop思维导图 印象笔记 Scala版本简单demo 常用工具类 去敏后的train code 持续更新!!!)
Stars: ✭ 567 (-9.42%)
Mutual labels:  hadoop
Orc
Apache ORC - the smallest, fastest columnar storage for Hadoop workloads
Stars: ✭ 389 (-37.86%)
Mutual labels:  hadoop
Devops Python Tools
80+ DevOps & Data CLI Tools - AWS, GCP, GCF Python Cloud Function, Log Anonymizer, Spark, Hadoop, HBase, Hive, Impala, Linux, Docker, Spark Data Converters & Validators (Avro/Parquet/JSON/CSV/INI/XML/YAML), Travis CI, AWS CloudFormation, Elasticsearch, Solr etc.
Stars: ✭ 406 (-35.14%)
Mutual labels:  hadoop
Pdf
编程电子书,电子书,编程书籍,包括C,C#,Docker,Elasticsearch,Git,Hadoop,HeadFirst,Java,Javascript,jvm,Kafka,Linux,Maven,MongoDB,MyBatis,MySQL,Netty,Nginx,Python,RabbitMQ,Redis,Scala,Solr,Spark,Spring,SpringBoot,SpringCloud,TCPIP,Tomcat,Zookeeper,人工智能,大数据类,并发编程,数据库类,数据挖掘,新面试题,架构设计,算法系列,计算机类,设计模式,软件测试,重构优化,等更多分类
Stars: ✭ 12,009 (+1818.37%)
Mutual labels:  hadoop
Wedatasphere
WeDataSphere is a financial level one-stop open-source suitcase for big data platforms. Currently the source code of Scriptis and Linkis has already been released to the open-source community. WeDataSphere, Big Data Made Easy!
Stars: ✭ 372 (-40.58%)
Mutual labels:  hadoop
H2o 3
H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random Forest, Generalized Linear Modeling (GLM with Elastic Net), K-Means, PCA, Generalized Additive Models (GAM), RuleFit, Support Vector Machine (SVM), Stacked Ensembles, Automatic Machine Learning (AutoML), etc.
Stars: ✭ 5,656 (+803.51%)
Mutual labels:  hadoop
Marmaray
Generic Data Ingestion & Dispersal Library for Hadoop
Stars: ✭ 414 (-33.87%)
Mutual labels:  hadoop
Bigdata
💎🔥大数据学习笔记
Stars: ✭ 488 (-22.04%)
Mutual labels:  hadoop
Iceberg
Iceberg is a table format for large, slow-moving tabular data
Stars: ✭ 393 (-37.22%)
Mutual labels:  hadoop
Kafka Connect Hdfs
Kafka Connect HDFS connector
Stars: ✭ 400 (-36.1%)
Mutual labels:  hadoop
School Of Sre
At LinkedIn, we are using this curriculum for onboarding our entry-level talents into the SRE role.
Stars: ✭ 5,141 (+721.25%)
Mutual labels:  hadoop
Ignite
Apache Ignite
Stars: ✭ 4,027 (+543.29%)
Mutual labels:  hadoop
Alluxio
Alluxio, data orchestration for analytics and machine learning in the cloud
Stars: ✭ 5,379 (+759.27%)
Mutual labels:  hadoop
Hive
Apache Hive
Stars: ✭ 4,031 (+543.93%)
Mutual labels:  hadoop
Data Science Ipython Notebooks
Data science Python notebooks: Deep learning (TensorFlow, Theano, Caffe, Keras), scikit-learn, Kaggle, big data (Spark, Hadoop MapReduce, HDFS), matplotlib, pandas, NumPy, SciPy, Python essentials, AWS, and various command lines.
Stars: ✭ 22,048 (+3422.04%)
Mutual labels:  hadoop
Javapdf
🍣100本 Java电子书 技术书籍PDF(以下载阅读为荣,以点赞收藏为耻)
Stars: ✭ 609 (-2.72%)
Mutual labels:  hadoop
Dist Keras
Distributed Deep Learning, with a focus on distributed training, using Keras and Apache Spark.
Stars: ✭ 613 (-2.08%)
Mutual labels:  hadoop
Gis Tools For Hadoop
The GIS Tools for Hadoop are a collection of GIS tools for spatial analysis of big data.
Stars: ✭ 485 (-22.52%)
Mutual labels:  hadoop

TonY CircleCI

tony-logo-small

TonY is a framework to natively run deep learning jobs on Apache Hadoop. It currently supports TensorFlow, PyTorch, MXNet and Horovod. TonY enables running either single node or distributed training as a Hadoop application. This native connector, together with other TonY features, aims to run machine learning jobs reliably and flexibly. For a quick overview of TonY and comparisons to other frameworks, please see this presentation.

Compatibility Notes

It is recommended to run TonY with Hadoop 3.1.1 and above. TonY itself is compatible with Hadoop 2.7.4 and above. If you need GPU isolation from TonY, you need Hadoop 3.1.0 or higher.

Build

How to build

TonY is built using Gradle. To build TonY, run:

./gradlew build

This will automatically run tests, if want to build without running tests, run:

./gradlew build -x test

The jar required to run TonY will be located in ./tony-cli/build/libs/.

Publishing (for admins)

Follow this guide to generate a key pair using GPG. Publish your public key.

Create a Nexus account at https://oss.sonatype.org/ and request access to publish to com.linkedin.tony. Here's an example Jira ticket: https://issues.sonatype.org/browse/OSSRH-47350.

Configure your ~/.gradle/gradle.properties file:

# signing plugin uses these
signing.keyId=...
signing.secretKeyRingFile=/home/<ldap>/.gnupg/secring.gpg
signing.password=...

# maven repo credentials
mavenUser=...
mavenPassword=...

# gradle-nexus-staging-plugin uses these
nexusUsername=<sameAsMavenUser>
nexusPassword=<sameAsMavenPassword>

Now you can publish and release artifacts by running ./gradlew publish closeAndReleaseRepository.

Usage

TonY is a Java library, so it is as simple as running a Java program. There are two ways to launch your deep learning jobs with TonY:

  • Use Docker container.
  • Use a zipped Python virtual environment.

Use a Docker container

Note that this requires you have a properly configured Hadoop cluster with Docker support. Check this documentation if you are unsure how to set it up. Assuming you have properly set up your Hadoop cluster with Docker container runtime, you should have already built a proper Docker image with required Hadoop configurations. The next thing you need is to install your Python dependencies inside your Docker image - TensorFlow or PyTorch.

Below is a folder structure of what you need to launch the job:

MyJob/
  > src/
    > models/
      mnist_distributed.py
  tony.xml
  tony-cli-0.1.5-all.jar

The src/ folder would contain all your training script. The tony.xml is used to config your training job. Specifically for using Docker as the container runtime, your configuration should be similar to something below:

$ cat MyJob/tony.xml
<configuration>
  <property>
    <name>tony.worker.instances</name>
    <value>4</value>
  </property>
  <property>
    <name>tony.worker.memory</name>
    <value>4g</value>
  </property>
  <property>
    <name>tony.worker.gpus</name>
    <value>1</value>
  </property>
  <property>
    <name>tony.ps.memory</name>
    <value>3g</value>
  </property>
  <property>
    <name>tony.docker.enabled</name>
    <value>true</value>
  </property>
  <property>
    <name>tony.docker.containers.image</name>
    <value>YOUR_DOCKER_IMAGE_NAME</value>
  </property>
</configuration>

For a full list of configurations, please see the wiki.

Now you're ready to launch your job:

$ java -cp "`hadoop classpath --glob`:MyJob/*:MyJob/" \
        com.linkedin.tony.cli.ClusterSubmitter \
        -executes models/mnist_distributed.py \
        -task_params '--input_dir /path/to/hdfs/input --output_dir /path/to/hdfs/output' \
        -src_dir src \
        -python_binary_path /home/user_name/python_virtual_env/bin/python

Use a zipped Python virtual environment

The difference between this approach and the one with Docker is

  • You don't need to set up your Hadoop cluster with Docker support.
  • There is no requirement on a Docker image registry.

As you know, nothing comes for free. If you don't want to bother setting your cluster with Docker support, you'd need to prepare a zipped virtual environment for your job and your cluster should have the same OS version as the computer which builds the Python virtual environment.

Python virtual environment in a zip

$ unzip -Z1 my-venv.zip | head -n 10
  Python/
  Python/bin/
  Python/bin/rst2xml.py
  Python/bin/wheel
  Python/bin/rst2html5.py
  Python/bin/rst2odt.py
  Python/bin/rst2s5.py
  Python/bin/pip2.7
  Python/bin/saved_model_cli
  Python/bin/rst2pseudoxml.pyc

TonY jar and tony.xml

MyJob/
  > src/
    > models/
      mnist_distributed.py
  tony.xml
  tony-cli-0.1.5-all.jar
  my-venv.zip # The additional file you need.

A similar tony.xml but without Docker related configurations:

$ cat tony/tony.xml
<configuration>
  <property>
    <name>tony.worker.instances</name>
    <value>4</value>
  </property>
  <property>
    <name>tony.worker.memory</name>
    <value>4g</value>
  </property>
  <property>
    <name>tony.worker.gpus</name>
    <value>1</value>
  </property>
  <property>
    <name>tony.ps.memory</name>
    <value>3g</value>
  </property>
</configuration>

Then you can launch your job:

$ java -cp "`hadoop classpath --glob`:MyJob/*:MyJob" \
            com.linkedin.tony.cli.ClusterSubmitter \
            -executes models/mnist_distributed.py \ # relative path to model program inside the src_dir
            -task_params '--input_dir /path/to/hdfs/input --output_dir /path/to/hdfs/output \
            -python_venv my-venv.zip \
            -python_binary_path Python/bin/python \  # relative path to the Python binary inside the my-venv.zip
            -src_dir src

TonY arguments

The command line arguments are as follows:

Name Required? Example Meaning
executes yes --executes model/mnist.py Location to the entry point of your training code.
src_dir yes --src src/ Specifies the name of the root directory locally which contains all of your python model source code. This directory will be copied to all worker node.
task_params no --input_dir /hdfs/input --output_dir /hdfs/output The command line arguments which will be passed to your entry point
python_venv no --python_venv venv.zip Path to the zipped local Python virtual environment
python_binary_path no --python_binary_path Python/bin/python Used together with python_venv, describes the relative path in your python virtual environment which contains the python binary, or an absolute path to use a python binary already installed on all worker nodes
shell_env no --shell_env LD_LIBRARY_PATH=/usr/local/lib64/ Specifies key-value pairs for environment variables which will be set in your python worker/ps processes.
conf_file no --conf_file tony-local.xml Location of a TonY configuration file.
conf no --conf tony.application.security.enabled=false Override configurations from your configuration file via command line

TonY configurations

There are multiple ways to specify configurations for your TonY job. As above, you can create an XML file called tony.xml and add its parent directory to your java classpath.

Alternatively, you can pass -conf_file <name_of_conf_file> to the java command line if you have a file not named tony.xml containing your configurations. (As before, the parent directory of this file must be added to the java classpath.)

If you wish to override configurations from your configuration file via command line, you can do so by passing -conf <tony.conf.key>=<tony.conf.value> argument pairs on the command line.

Please check our wiki for all TonY configurations and their default values.

TonY Examples

Below are examples to run distributed deep learning jobs with TonY:

More information

For more information about TonY, check out the following:

FAQ

  1. My tensorflow process hangs with

    2018-09-13 03:02:31.538790: E tensorflow/core/distributed_runtime/master.cc:272] CreateSession failed because worker /job:worker/replica:0/task:0 returned error: Unavailable: OS Error
    INFO:tensorflow:An error was raised while a session was being created. This may be due to a preemption of a connected worker or parameter server. A new session will be created. Error: OS Error
    INFO:tensorflow:Graph was finalized.
    2018-09-13 03:03:33.792490: I tensorflow/core/distributed_runtime/master_session.cc:1150] Start master session ea811198d338cc1d with config: 
    INFO:tensorflow:Waiting for model to be ready.  Ready_for_local_init_op:  Variables not initialized: conv1/Variable, conv1/Variable_1, conv2/Variable, conv2/Variable_1, fc1/Variable, fc1/Variable_1, fc2/Variable, fc2/Variable_1, global_step, adam_optimizer/beta1_power, adam_optimizer/beta2_power, conv1/Variable/Adam, conv1/Variable/Adam_1, conv1/Variable_1/Adam, conv1/Variable_1/Adam_1, conv2/Variable/Adam, conv2/Variable/Adam_1, conv2/Variable_1/Adam, conv2/Variable_1/Adam_1, fc1/Variable/Adam, fc1/Variable/Adam_1, fc1/Variable_1/Adam, fc1/Variable_1/Adam_1, fc2/Variable/Adam, fc2/Variable/Adam_1, fc2/Variable_1/Adam, fc2/Variable_1/Adam_1, ready: None
    

    Why?

    Try adding the path to your libjvm.so shared library to your LD_LIBRARY_PATH environment variable for your workers. See above for an example.

  2. How do I configure arbitrary TensorFlow job types?

    Please see the wiki on TensorFlow task configuration for details.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].