All Projects → ehiggs → Spark Terasort

ehiggs / Spark Terasort

Licence: apache-2.0
Spark Terasort

Programming Languages

java
68154 projects - #9 most used programming language

Labels

Projects that are alternatives of or similar to Spark Terasort

Flint
Webex Bot SDK for Node.js (deprecated in favor of https://github.com/webex/webex-bot-node-framework)
Stars: ✭ 85 (-15.84%)
Mutual labels:  spark
Big Data
🔧 Use dplyr to analyze Big Data 🐘
Stars: ✭ 93 (-7.92%)
Mutual labels:  spark
Logisland
Scalable stream processing platform for advanced realtime analytics on top of Kafka and Spark. LogIsland also supports MQTT and Kafka Streams (Flink being in the roadmap). The platform does complex event processing and is suitable for time series analysis. A large set of valuable ready to use processors, data sources and sinks are available.
Stars: ✭ 97 (-3.96%)
Mutual labels:  spark
Laravel Spark Google2fa
Google Authenticator support for Laravel Spark
Stars: ✭ 86 (-14.85%)
Mutual labels:  spark
Udacity Data Engineering
Udacity Data Engineering Nano Degree (DEND)
Stars: ✭ 89 (-11.88%)
Mutual labels:  spark
Repository
个人学习知识库涉及到数据仓库建模、实时计算、大数据、Java、算法等。
Stars: ✭ 92 (-8.91%)
Mutual labels:  spark
Spark States
Custom state store providers for Apache Spark
Stars: ✭ 83 (-17.82%)
Mutual labels:  spark
Bigdata Notebook
Stars: ✭ 100 (-0.99%)
Mutual labels:  spark
Spark On Kubernetes Helm
Spark on Kubernetes infrastructure Helm charts repo
Stars: ✭ 92 (-8.91%)
Mutual labels:  spark
Schemer
Schema registry for CSV, TSV, JSON, AVRO and Parquet schema. Supports schema inference and GraphQL API.
Stars: ✭ 97 (-3.96%)
Mutual labels:  spark
Spark python ml examples
Spark 2.0 Python Machine Learning examples
Stars: ✭ 87 (-13.86%)
Mutual labels:  spark
Ammonite Spark
Run spark calculations from Ammonite
Stars: ✭ 88 (-12.87%)
Mutual labels:  spark
Spark Py Notebooks
Apache Spark & Python (pySpark) tutorials for Big Data Analysis and Machine Learning as IPython / Jupyter notebooks
Stars: ✭ 1,338 (+1224.75%)
Mutual labels:  spark
Cuesheet
A framework for writing Spark 2.x applications in a pretty way
Stars: ✭ 86 (-14.85%)
Mutual labels:  spark
Almond
A Scala kernel for Jupyter
Stars: ✭ 1,354 (+1240.59%)
Mutual labels:  spark
Hops Examples
Examples for Deep Learning/Feature Store/Spark/Flink/Hive/Kafka jobs and Jupyter notebooks on Hops
Stars: ✭ 84 (-16.83%)
Mutual labels:  spark
Spark Summit 2017 Sanfrancisco
spark summit 2017 SanFrancisco
Stars: ✭ 93 (-7.92%)
Mutual labels:  spark
Spark Ffm
FFM (Field-Awared Factorization Machine) on Spark
Stars: ✭ 101 (+0%)
Mutual labels:  spark
Bigdata Notes
大数据入门指南 ⭐
Stars: ✭ 10,991 (+10782.18%)
Mutual labels:  spark
Relation extraction
Relation Extraction using Deep learning(CNN)
Stars: ✭ 96 (-4.95%)
Mutual labels:  spark

TeraSort benchmark for Spark

Build Status

This is an example Spark program for running TeraSort benchmarks. It is based on work from Reynold Xin's branch, but it is not the same TeraSort program that currently holds the record. That program is here.

Building

mvn install

The default is to link against Spark 2.4.4 jars (released September 2019). If you plan to run using an older version of Spark (e.g. 1.6) you will have to try -Dspark.version=1.6. If possible, it's probably a better idea to just update to a more recent version or Spark.

Running

cd to your your Spark install.

Generate data

./bin/spark-submit --class com.github.ehiggs.spark.terasort.TeraGen 
path/to/spark-terasort/target/spark-terasort-1.2-SNAPSHOT-jar-with-dependencies.jar 
1g file://$HOME/data/terasort_in 

Sort the data

./bin/spark-submit --class com.github.ehiggs.spark.terasort.TeraSort
path/to/spark-terasort/target/spark-terasort-1.2-SNAPSHOT-jar-with-dependencies.jar 
file://$HOME/data/terasort_in file://$HOME/data/terasort_out

Validate the data

./bin/spark-submit --class com.github.ehiggs.spark.terasort.TeraValidate
path/to/spark-terasort/target/spark-terasort-1.2-SNAPSHOT-jar-with-dependencies.jar 
file://$HOME/data/terasort_out file://$HOME/data/terasort_validate

Known issues

Performance

This terasort doesn't use the partitioning scheme that Hadoop's Terasort uses. This results in not very good performance. I could copy the Partitioning code from the Hadoop tree verbatim but I thought it would be more appropriate to rewrite more of it in Scala.

I haven't pulled the DaytonaPartitioner from the record holding sort yet because it's pretty intertwined into the rest of the code and AFAIK it's not really idiomatic Spark.

Functionality on native file systems

TeraValidate can read the file parts in the wrong order on native file systems (e.g. if you run Spark on your laptop, on Lustre, Panasas, etc). HDFS apparently always returns the files in alphanumeric order so most Hadoop users aren't affected. I thought I fixed this in the TeraInputFormat, but I was able to reproduce it since migrating the code from my Spark terasort branch.

Contributing

PRs are very welcome!

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].