All Projects → hortonworks → spark-native-yarn

hortonworks / spark-native-yarn

Licence: Apache-2.0 license
Tez port for Spark API

Programming Languages

scala
5932 projects
java
68154 projects - #9 most used programming language

spark-native-yarn

Native YARN integration with Apache Spark

For feedback and suggestions please use this project's Issues feature.

============

IMPORTANT: At the time or writing, the project represents a prototype with the goal of demonstrating the validity of the approach described in SPARK-3561. To get an idea of currently supported functionality please refer to APIDemoTests as well as Samples project.

==

spark-native-yarn project represents an extension to Apache Spark which enables DAGs assembled using SPARK API to run on Apache Tez, thus allowing one to benefit from native features of Tez, especially related to large scale Batch/ETL applications.

Aside from enabling SPARK DAG execution to run on Apache Tez, this project provides additional functionality which addresses developer productivity including but not limited to:

  • executing your code on YARN cluster directly from the IDE (Eclipse and/or IntelliJ)
  • remote submission (submission from the remote client)
  • transparent classpath management
  • seamless and simplified integration with mini-cluster environment
  • enhanced debugging capabilities ability to place and step thru the breakpoints in SPARK application code when using mini-cluster (see InJvmContainerExecutor provided with mini-dev-cluster)
  • ability to utilize Tez local mode

At the moment of writing, spark-native-yarn is dependent on modifications to SPARK code described in SPARK-3561. This means that to use it, one must have a custom build of Spark which incorporates pending GitHub Pull Request. You can build your own by following instructions below or you can download a pre-built distribution from here.

IMPORTANT: If you opt out for a pre-build distribution keep in mind that it is based on Spark 1.1 release, which means you have to use a compatible spark-native-yarn version branch 1.1.1.

For those who want to take their chances with the latest Spark's snapshot, please follow the instructions below, otherwise (for pre-built) skip and go straight to build spark-native-yarn or follow the pre-built spark-shell and/or spark-submit instructions.

Below are the prerequisites and instructions on how to proceed.

IMPORTANT: Please follow the prerequisites described below and then continue to Getting Started guide.

Checkout and Build SPARK-3561

$> git clone https://github.com/olegz/spark-1.git
$> cd spark-1
$> git fetch --all

Switch to SPARK-3561 branch

$> git branch --track SH-1 origin/SH-1
$> git checkout SH-1

Spark uses Maven for its build so it must be present. And to ensure there are no OOM errors set up Maven options as below. See Spark's documentation for more details.

export MAVEN_OPTS="-Xmx2g -XX:MaxPermSize=512M -XX:ReservedCodeCacheSize=512m"
Build and install SPARK-3561 into your local maven repository
$> mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -DskipTests clean install

The build should take 20-30 min depending on your machine. You should see a successful build

INFO] ------------------------------------------------------------------------
[INFO] Reactor Summary:
[INFO]
[INFO] Spark Project Parent POM .......................... SUCCESS [  2.281 s]
[INFO] Spark Project Core ................................ SUCCESS [02:33 min]
[INFO] Spark Project Bagel ............................... SUCCESS [ 18.959 s]
. . .
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
Clone spark-native-yarn
$> git clone https://github.com/hortonworks/spark-native-yarn.git
$> cd spark-native-yarn

To switch to 1.1.1 branch:

$> git fetch --all
$> git branch --track 1.1.1 origin/1.1.1
$> git checkout 1.1.1

This completes the pre-requisite required to run STARK and you can now continue to Getting Started guide.

==

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].