Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

Scalable stream processing platform for advanced realtime analytics on top of Kafka and Spark. LogIsland also supports MQTT and Kafka Streams (Flink being in the roadmap). The platform does complex event processing and is suitable for time series analysis. A large set of valuable ready to use processors, data sources and sinks are available.

Stars: ✭ 97 (-3.96%)

Mutual labels: spark

Laravel Spark Google2fa

Google Authenticator support for Laravel Spark

Stars: ✭ 86 (-14.85%)

Mutual labels: spark

Udacity Data Engineering

Udacity Data Engineering Nano Degree (DEND)

Stars: ✭ 89 (-11.88%)

Mutual labels: spark

Repository

个人学习知识库涉及到数据仓库建模、实时计算、大数据、Java、算法等。

Stars: ✭ 92 (-8.91%)

Mutual labels: spark

Spark States

Custom state store providers for Apache Spark

Stars: ✭ 83 (-17.82%)

Mutual labels: spark

Bigdata Notebook

Stars: ✭ 100 (-0.99%)

Mutual labels: spark

Spark On Kubernetes Helm

Spark on Kubernetes infrastructure Helm charts repo

Stars: ✭ 92 (-8.91%)

Mutual labels: spark

Schemer

Schema registry for CSV, TSV, JSON, AVRO and Parquet schema. Supports schema inference and GraphQL API.

Stars: ✭ 97 (-3.96%)

Mutual labels: spark

Spark python ml examples

Spark 2.0 Python Machine Learning examples

Stars: ✭ 87 (-13.86%)

Mutual labels: spark

Ammonite Spark

Run spark calculations from Ammonite

Stars: ✭ 88 (-12.87%)

Mutual labels: spark

Spark Py Notebooks

Apache Spark & Python (pySpark) tutorials for Big Data Analysis and Machine Learning as IPython / Jupyter notebooks

Stars: ✭ 1,338 (+1224.75%)

Mutual labels: spark

Cuesheet

A framework for writing Spark 2.x applications in a pretty way

Stars: ✭ 86 (-14.85%)

Mutual labels: spark

Almond

A Scala kernel for Jupyter

Stars: ✭ 1,354 (+1240.59%)

Mutual labels: spark

Hops Examples

Examples for Deep Learning/Feature Store/Spark/Flink/Hive/Kafka jobs and Jupyter notebooks on Hops

Stars: ✭ 84 (-16.83%)

Mutual labels: spark

Spark Summit 2017 Sanfrancisco

spark summit 2017 SanFrancisco

Stars: ✭ 93 (-7.92%)

Mutual labels: spark

Spark Ffm

FFM (Field-Awared Factorization Machine) on Spark

Stars: ✭ 101 (+0%)

Mutual labels: spark

Bigdata Notes

大数据入门指南 ⭐

Stars: ✭ 10,991 (+10782.18%)

Mutual labels: spark

Relation extraction

Relation Extraction using Deep learning(CNN)

Stars: ✭ 96 (-4.95%)

Mutual labels: spark

View All Similar Projects ➔

TeraSort benchmark for Spark

This is an example Spark program for running TeraSort benchmarks. It is based on work from Reynold Xin's branch, but it is not the same TeraSort program that currently holds the record. That program is here.

Building

mvn install

The default is to link against Spark 2.4.4 jars (released September 2019). If you plan to run using an older version of Spark (e.g. 1.6) you will have to try -Dspark.version=1.6. If possible, it's probably a better idea to just update to a more recent version or Spark.

Running

cd to your your Spark install.

Generate data

./bin/spark-submit --class com.github.ehiggs.spark.terasort.TeraGen 
path/to/spark-terasort/target/spark-terasort-1.2-SNAPSHOT-jar-with-dependencies.jar 
1g file://$HOME/data/terasort_in

Sort the data

./bin/spark-submit --class com.github.ehiggs.spark.terasort.TeraSort
path/to/spark-terasort/target/spark-terasort-1.2-SNAPSHOT-jar-with-dependencies.jar 
file://$HOME/data/terasort_in file://$HOME/data/terasort_out

Validate the data

./bin/spark-submit --class com.github.ehiggs.spark.terasort.TeraValidate
path/to/spark-terasort/target/spark-terasort-1.2-SNAPSHOT-jar-with-dependencies.jar 
file://$HOME/data/terasort_out file://$HOME/data/terasort_validate

Known issues

Performance

This terasort doesn't use the partitioning scheme that Hadoop's Terasort uses. This results in not very good performance. I could copy the Partitioning code from the Hadoop tree verbatim but I thought it would be more appropriate to rewrite more of it in Scala.

I haven't pulled the DaytonaPartitioner from the record holding sort yet because it's pretty intertwined into the rest of the code and AFAIK it's not really idiomatic Spark.

Functionality on native file systems

TeraValidate can read the file parts in the wrong order on native file systems (e.g. if you run Spark on your laptop, on Lustre, Panasas, etc). HDFS apparently always returns the files in alphanumeric order so most Hadoop users aren't affected. I thought I fixed this in the TeraInputFormat, but I was able to reproduce it since migrating the code from my Spark terasort branch.

Contributing

PRs are very welcome!

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Stars: ✭ 101

Visit Git Page 🔗Visit User Page 🔗Visit Issues Page (2) 🔗