All Projects → jgperrin → Net.jgp.labs.spark

jgperrin / Net.jgp.labs.spark

Licence: apache-2.0
Apache Spark examples exclusively in Java

Programming Languages

java
68154 projects - #9 most used programming language

Projects that are alternatives of or similar to Net.jgp.labs.spark

Spark Redis
A connector for Spark that allows reading and writing to/from Redis cluster
Stars: ✭ 773 (+1305.45%)
Mutual labels:  dataframe, spark
Geni
A Clojure dataframe library that runs on Spark
Stars: ✭ 152 (+176.36%)
Mutual labels:  dataframe, spark
Spark With Python
Fundamentals of Spark with Python (using PySpark), code examples
Stars: ✭ 150 (+172.73%)
Mutual labels:  dataframe, spark
Koalas
Koalas: pandas API on Apache Spark
Stars: ✭ 3,044 (+5434.55%)
Mutual labels:  dataframe, spark
Spark Daria
Essential Spark extensions and helper methods ✨😲
Stars: ✭ 553 (+905.45%)
Mutual labels:  dataframe, spark
aut
The Archives Unleashed Toolkit is an open-source toolkit for analyzing web archives.
Stars: ✭ 111 (+101.82%)
Mutual labels:  spark, dataframe
Ballista
Distributed compute platform implemented in Rust, and powered by Apache Arrow.
Stars: ✭ 2,274 (+4034.55%)
Mutual labels:  dataframe, spark
Datafusion
DataFusion has now been donated to the Apache Arrow project
Stars: ✭ 611 (+1010.91%)
Mutual labels:  dataframe, spark
Mobius
C# and F# language binding and extensions to Apache Spark
Stars: ✭ 929 (+1589.09%)
Mutual labels:  dataframe, spark
Snappydata
Project SnappyData - memory optimized analytics database, based on Apache Spark™ and Apache Geode™. Stream, Transact, Analyze, Predict in one cluster
Stars: ✭ 995 (+1709.09%)
Mutual labels:  spark
Spark As Service Using Embedded Server
This application comes as Spark2.1-as-Service-Provider using an embedded, Reactive-Streams-based, fully asynchronous HTTP server
Stars: ✭ 46 (-16.36%)
Mutual labels:  spark
Real Time Stream Processing Engine
This is an example of real time stream processing using Spark Streaming, Kafka & Elasticsearch.
Stars: ✭ 37 (-32.73%)
Mutual labels:  spark
Data Ingestion Platform
Stars: ✭ 39 (-29.09%)
Mutual labels:  spark
Awesome Recommendation Engine
The purpose of this tiny project is to put things together with the know how that i learned from the course big data expert from formacionhadoop.com The idea is to show how to play with apache spark streaming, kafka,mongo, spark machine learning algorithms.
Stars: ✭ 47 (-14.55%)
Mutual labels:  spark
Optimus
🚚 Agile Data Preparation Workflows made easy with dask, cudf, dask_cudf and pyspark
Stars: ✭ 986 (+1692.73%)
Mutual labels:  spark
Spark Submit Ui
This is a based on playframwork for submit spark app
Stars: ✭ 53 (-3.64%)
Mutual labels:  spark
Weblogsanalysissystem
A big data platform for analyzing web access logs
Stars: ✭ 37 (-32.73%)
Mutual labels:  spark
Learning Spark
零基础学习spark,大数据学习
Stars: ✭ 37 (-32.73%)
Mutual labels:  spark
Docker Hadoop
A Docker container with a full Hadoop cluster setup with Spark and Zeppelin
Stars: ✭ 54 (-1.82%)
Mutual labels:  spark
Play Spark Scala
Stars: ✭ 51 (-7.27%)
Mutual labels:  spark

Some Java examples for Apache Spark

Welcome to this project I started several years ago with this simple idea: let's use Spark with Java and not learn all those complex stuff like Hadoop or Scala. I am not that smart anyway...

This project is evolving in a book, creatively named "Spark in Action, 2nd edition, with Java" published by Manning. If you want to know more, and be guided through your Java and Spark learning process, I can only recommend to read the book at Manning. Find out more about Spark in Action, 2nd edition, on the Manning website. The book contains more examples, more explanation, is professionally written and edited. The book also talks about Spark with Python (PySpark) and Scala.

Here are the repos with the book examples:

Chapter 1 So, what is Spark, anyway? An introduction to Spark with a simple ingestion example.

Chapter 2 Architecture and flows Mental model around Spark and exporting data to PostgreSQL from Spark.

Chapter 3 The majestic role of the dataframe.

Chapter 4 Fundamentally lazy.

Chapter 5 Building a simple app for deployment and Deploying your simple app.

Chapter 7 Ingestion from files.

Chapter 8 Ingestion from databases.

Chapter 9 Advanced ingestion: finding data sources & building your own.

Chapter 10 Ingestion through structured streaming.

Chapter 11 Working with Spark SQL.

Chapter 12 Transforming your data.

Chapter 13 Transforming entire documents.

Chapter 14 Extending transformations with user-defined functions (UDFs).

Chapter 15 Aggregating your data.

Chapter 16 Cache and checkpoint: enhancing Spark’s performances.

Chapter 17 Exporting data & building full data pipelines.

In the meanwhile, this project is still live, with more raw-level examples, that may (or may not) work.

Environment

These labs rely on:

  • Apache Spark v3.0.0 (based on Scala v2.12)
  • Java 8

Notes on Branches

The master branch will always contain the latest version of Spark, currently v3.0.0.

Labs

A few labs around Apache Spark, exclusively in Java.

Organization is now in sub packages:

  • l000_ingestion: Data ingestion from various sources.
  • l020_streaming: Data ingestion via streaming. Special note on Streaming.
  • l050_connection: Connect to Spark.
  • l100_checkpoint: Checkpoint introduced in Spark v2.1.0.
  • l150_udf: UDF (User Defined Functions).
  • l200_join: added join examples.
  • l250_map: map (in the context of mapping, not always linked to map/reduce).
  • l300_reduce: reduce.
  • l400_industry_formats: working with industry formats, limited, for now, to HL7 and FHIR.
  • l500_misc: other examples.
  • l600_ml: ML (Machine Learning).
  • l700_save: saving your results.
  • l800_concurrency: labs around concurrency access, work in progress.
  • l900_analytics: More complex examples of using Spark for Analytics.

If you would like to see more labs, send your request to jgp at jgp dot net or @jgperrin on Twitter.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].