All Projects → Mellanox → Sparkrdma

Mellanox / Sparkrdma

Licence: apache-2.0
RDMA accelerated, high-performance, scalable and efficient ShuffleManager plugin for Apache Spark

Programming Languages

java
68154 projects - #9 most used programming language
scala
5932 projects

Projects that are alternatives of or similar to Sparkrdma

leaflet heatmap
简单的可视化湖州通话数据 假设数据量很大,没法用浏览器直接绘制热力图,把绘制热力图这一步骤放到线下计算分析。使用Apache Spark并行计算数据之后,再使用Apache Spark绘制热力图,然后用leafletjs加载OpenStreetMap图层和热力图图层,以达到良好的交互效果。现在使用Apache Spark实现绘制,可能是Apache Spark不擅长这方面的计算或者是我没有设计好算法,并行计算的速度比不上单机计算。Apache Spark绘制热力图和计算代码在这 https://github.com/yuanzhaokang/ParallelizeHeatmap.git .
Stars: ✭ 13 (-93.95%)
Mutual labels:  big-data, spark, apache-spark, hadoop, bigdata
Bigdata Notes
大数据入门指南 ⭐
Stars: ✭ 10,991 (+5012.09%)
Mutual labels:  spark, big-data, hadoop, bigdata
Spark With Python
Fundamentals of Spark with Python (using PySpark), code examples
Stars: ✭ 150 (-30.23%)
Mutual labels:  spark, big-data, hadoop, apache-spark
aut
The Archives Unleashed Toolkit is an open-source toolkit for analyzing web archives.
Stars: ✭ 111 (-48.37%)
Mutual labels:  big-data, spark, apache-spark, hadoop
Apache Spark Hands On
Educational notes,Hands on problems w/ solutions for hadoop ecosystem
Stars: ✭ 74 (-65.58%)
Mutual labels:  spark, hadoop, bigdata
Big Data Engineering Coursera Yandex
Big Data for Data Engineers Coursera Specialization from Yandex
Stars: ✭ 71 (-66.98%)
Mutual labels:  spark, big-data, bigdata
Spark Py Notebooks
Apache Spark & Python (pySpark) tutorials for Big Data Analysis and Machine Learning as IPython / Jupyter notebooks
Stars: ✭ 1,338 (+522.33%)
Mutual labels:  spark, big-data, bigdata
Hadoopcryptoledger
Hadoop Crypto Ledger - Analyzing CryptoLedgers, such as Bitcoin Blockchain, on Big Data platforms, such as Hadoop/Spark/Flink/Hive
Stars: ✭ 126 (-41.4%)
Mutual labels:  spark, hadoop, bigdata
Mobius
C# and F# language binding and extensions to Apache Spark
Stars: ✭ 929 (+332.09%)
Mutual labels:  spark, bigdata, apache-spark
Bigdata Notebook
Stars: ✭ 100 (-53.49%)
Mutual labels:  spark, hadoop, bigdata
Mmlspark
Simple and Distributed Machine Learning
Stars: ✭ 2,899 (+1248.37%)
Mutual labels:  spark, big-data, apache-spark
Docker Spark Cluster
A Spark cluster setup running on Docker containers
Stars: ✭ 57 (-73.49%)
Mutual labels:  spark, big-data, hadoop
Bigdata Playground
A complete example of a big data application using : Kubernetes (kops/aws), Apache Spark SQL/Streaming/MLib, Apache Flink, Scala, Python, Apache Kafka, Apache Hbase, Apache Parquet, Apache Avro, Apache Storm, Twitter Api, MongoDB, NodeJS, Angular, GraphQL
Stars: ✭ 177 (-17.67%)
Mutual labels:  big-data, hadoop, apache-spark
Bigdata Interview
🎯 🌟[大数据面试题]分享自己在网络上收集的大数据相关的面试题以及自己的答案总结.目前包含Hadoop/Hive/Spark/Flink/Hbase/Kafka/Zookeeper框架的面试题知识总结
Stars: ✭ 857 (+298.6%)
Mutual labels:  spark, hadoop, bigdata
Gaffer
A large-scale entity and relation database supporting aggregation of properties
Stars: ✭ 1,642 (+663.72%)
Mutual labels:  spark, big-data, hadoop
Spark
.NET for Apache® Spark™ makes Apache Spark™ easily accessible to .NET developers.
Stars: ✭ 1,721 (+700.47%)
Mutual labels:  spark, bigdata, apache-spark
Splash
Splash, a flexible Spark shuffle manager that supports user-defined storage backends for shuffle data storage and exchange
Stars: ✭ 105 (-51.16%)
Mutual labels:  spark, bigdata, apache-spark
Azure Event Hubs Spark
Enabling Continuous Data Processing with Apache Spark and Azure Event Hubs
Stars: ✭ 140 (-34.88%)
Mutual labels:  spark, bigdata, apache-spark
Bigdataguide
大数据学习,从零开始学习大数据,包含大数据学习各阶段学习视频、面试资料
Stars: ✭ 817 (+280%)
Mutual labels:  spark, hadoop, bigdata
Hadoop For Geoevent
ArcGIS GeoEvent Server sample Hadoop connector for storing GeoEvents in HDFS.
Stars: ✭ 5 (-97.67%)
Mutual labels:  big-data, hadoop, bigdata

SparkRDMA ShuffleManager Plugin

SparkRDMA is a high performance ShuffleManager plugin for Apache Spark that uses RDMA (instead of TCP) when performing Shuffle data transfers in Spark jobs.

This open-source project is developed, maintained and supported by Mellanox Technologies.

Performance results

Terasort

TeraSort results

Running 320GB TeraSort workload with SparkRDMA is x2.63 faster than standard Spark (runtime in seconds)

Test environment:

7 Spark standalone workers on Azure "h16mr" VM instance, Intel Haswell E5-2667 V3,

224GB RAM, 2000GB SSD for temporary storage, Mellanox InfiniBand FDR (56Gb/s)

Also featured at the Spark+AI Summit 2018, please see more info on our session: https://databricks.com/session/accelerated-spark-on-azure-seamless-and-scalable-hardware-offloads-in-the-cloud

Pagerank

PageRank results

Running 19GB Pagerank with SparkRDMA is x2.01 faster than standard Spark (runtime in seconds)

Test environment:

5 Spark standalone workers, 2x Intel Xeon E5-2697 v3 @ 2.60GHz, 25 cores per Worker, 150GB RAM, non-flash storage (HDD)

Mellanox ConnectX-5 network adapter with 100GbE RoCE fabric, connected with a Mellanox Spectrum switch

Wiki pages

For more information on configuration, performance tuning and troubleshooting, please visit the SparkRDMA GitHub Wiki

Runtime requirements

  • Apache Spark 2.0.0/2.1.0/2.2.0/2.3.0/2.4.0
  • Java 8
  • An RDMA-supported network, e.g. RoCE or Infiniband

Installation

Obtain SparkRDMA and DiSNI binaries

Please use the "Releases" page to download pre-built binaries.
If you would like to build the project yourself, please refer to the "Build" section below.

The pre-built binaries are packed as an archive that contains the following files:

  • spark-rdma-3.1-for-spark-2.0.0-jar-with-dependencies.jar
  • spark-rdma-3.1-for-spark-2.1.0-jar-with-dependencies.jar
  • spark-rdma-3.1-for-spark-2.2.0-jar-with-dependencies.jar
  • spark-rdma-3.1-for-spark-2.3.0-jar-with-dependencies.jar
  • spark-rdma-3.1-for-spark-2.4.0-jar-with-dependencies.jar
  • libdisni.so

libdisni.so must be in java.library.path on every Spark Master and Worker (usually in /usr/lib)

Configuration

Provide Spark the location of the SparkRDMA plugin jars by using the extraClassPath option. For standalone mode this can be added to either spark-defaults.conf or any runtime configuration file. For client mode this must be added to spark-defaults.conf. For Spark 2.0.0 (Replace with 2.1.0, 2.2.0, 2.3.0, 2.4.0 according to your Spark version):

spark.driver.extraClassPath   /path/to/SparkRDMA/target/spark-rdma-3.1-for-spark-2.0.0-jar-with-dependencies.jar
spark.executor.extraClassPath /path/to/SparkRDMA/target/spark-rdma-3.1-for-spark-2.0.0-jar-with-dependencies.jar

Running

To enable the SparkRDMA Shuffle Manager plugin, add the following line to either spark-defaults.conf or any runtime configuration file:

spark.shuffle.manager   org.apache.spark.shuffle.rdma.RdmaShuffleManager

Build

Building the SparkRDMA plugin requires Apache Maven and Java 8

  1. Obtain a clone of SparkRDMA

  2. Build the plugin for your Spark version (either 2.0.0, 2.1.0, 2.2.0, 2.3.0, 2.4.0), e.g. for Spark 2.0.0:

mvn -DskipTests clean package -Pspark-2.0.0
  1. Obtain a clone of DiSNI for building libdisni:
git clone https://github.com/zrlio/disni.git
cd disni
git checkout tags/v1.7 -b v1.7
  1. Compile and install only libdisni (the jars are already included in the SparkRDMA plugin):
cd libdisni
autoprepare.sh
./configure --with-jdk=/path/to/java8/jdk
make
make install

Community discussions and support

For any questions, issues or suggestions, please use our Google group: https://groups.google.com/forum/#!forum/sparkrdma

Contributions

Any PR submissions are welcome

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].