All Projects → tomaztk → Spark-for-data-engineers

tomaztk / Spark-for-data-engineers

Licence: other
Apache Spark for data engineers

Programming Languages

Jupyter Notebook
11667 projects
r
7636 projects
python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to Spark-for-data-engineers

SynapseML
Simple and Distributed Machine Learning
Stars: ✭ 3,355 (+15150%)
Mutual labels:  apache-spark, pyspark
Quinn
pyspark methods to enhance developer productivity 📣 👯 🎉
Stars: ✭ 217 (+886.36%)
Mutual labels:  apache-spark, pyspark
Pyspark Stubs
Apache (Py)Spark type annotations (stub files).
Stars: ✭ 98 (+345.45%)
Mutual labels:  apache-spark, pyspark
Pyspark Boilerplate
A boilerplate for writing PySpark Jobs
Stars: ✭ 318 (+1345.45%)
Mutual labels:  apache-spark, pyspark
spark-twitter-sentiment-analysis
Sentiment Analysis of a Twitter Topic with Spark Structured Streaming
Stars: ✭ 55 (+150%)
Mutual labels:  apache-spark, pyspark
Live log analyzer spark
Spark Application for analysis of Apache Access logs and detect anamolies! Along with Medium Article.
Stars: ✭ 14 (-36.36%)
Mutual labels:  apache-spark, pyspark
Azure Cosmosdb Spark
Apache Spark Connector for Azure Cosmos DB
Stars: ✭ 165 (+650%)
Mutual labels:  apache-spark, pyspark
pyspark-asyncactions
Asynchronous actions for PySpark
Stars: ✭ 30 (+36.36%)
Mutual labels:  apache-spark, pyspark
learn-by-examples
Real-world Spark pipelines examples
Stars: ✭ 84 (+281.82%)
Mutual labels:  apache-spark, pyspark
isarn-sketches-spark
Routines and data structures for using isarn-sketches idiomatically in Apache Spark
Stars: ✭ 28 (+27.27%)
Mutual labels:  apache-spark, pyspark
Spark Gotchas
Spark Gotchas. A subjective compilation of the Apache Spark tips and tricks
Stars: ✭ 308 (+1300%)
Mutual labels:  apache-spark, pyspark
Sparkora
Powerful rapid automatic EDA and feature engineering library with a very easy to use API 🌟
Stars: ✭ 51 (+131.82%)
Mutual labels:  apache-spark, pyspark
Mmlspark
Simple and Distributed Machine Learning
Stars: ✭ 2,899 (+13077.27%)
Mutual labels:  apache-spark, pyspark
Awesome Spark
A curated list of awesome Apache Spark packages and resources.
Stars: ✭ 1,061 (+4722.73%)
Mutual labels:  apache-spark, pyspark
mmtf-workshop-2018
Structural Bioinformatics Training Workshop & Hackathon 2018
Stars: ✭ 50 (+127.27%)
Mutual labels:  apache-spark, pyspark
Spark With Python
Fundamentals of Spark with Python (using PySpark), code examples
Stars: ✭ 150 (+581.82%)
Mutual labels:  apache-spark, pyspark
pyspark-cheatsheet
PySpark Cheat Sheet - example code to help you learn PySpark and develop apps faster
Stars: ✭ 115 (+422.73%)
Mutual labels:  apache-spark, pyspark
aut
The Archives Unleashed Toolkit is an open-source toolkit for analyzing web archives.
Stars: ✭ 111 (+404.55%)
Mutual labels:  apache-spark, pyspark
spark3D
Spark extension for processing large-scale 3D data sets: Astrophysics, High Energy Physics, Meteorology, …
Stars: ✭ 23 (+4.55%)
Mutual labels:  apache-spark, pyspark
datalake-etl-pipeline
Simplified ETL process in Hadoop using Apache Spark. Has complete ETL pipeline for datalake. SparkSession extensions, DataFrame validation, Column extensions, SQL functions, and DataFrame transformations
Stars: ✭ 39 (+77.27%)
Mutual labels:  apache-spark, pyspark

Spark for data engineers

Spark for data engineers is repository that will provide readers overview, code samples and examples for better tackling Spark.

What is Spark and why does it matter for Data Engineers

Data Analysts, data Scientist, Business Intelligence analysts and many other roles require data on demand. Fighting with data silos, many scatter databases, Excel files, CSV files, JSON files, APIs and potentially different flavours of cloud storage may be tedious, nerve-wracking and time-consuming.

Automated process that would follow set of steps, procedures and processes take subsets of data, columns from database, binary files and merged them together to serve business needs and potentials is and still will be a favorable job for many organizations and teams.

Spark is an absolute winner for this tasks and a great choice for adoption.

Data Engineering should have the extent and capability to do:

  1. System architecture
  2. Programming
  3. Database design and configuration
  4. Interface and sensor configuration

And in addition to that, it is as important as familiarity with the technical tools is, the concepts of data architecture and pipeline design are even more important. The tools are worthless without a solid conceptual understanding of:

  1. Data models
  2. Relational and non-relational database design
  3. Information flow
  4. Query execution and optimization
  5. Comparative analysis of data stores
  6. Logical operations

Apache Spark have all the technology built-in to cover these topics and has the capacity for achieving a concrete goal for assembling together functional systems to do the goal.

Apache Spark™ is designed to to build faster and more reliable data pipelines, cover low level and structured API and brings tools and packages for Streaming data, Machine Learning, data engineering and building pipelines and extending the Spark ecosystem.

Spark’s Basic Architecture

Single machines do not have enough power and resources to perform computations on huge amounts of information (or the user may not have time to wait for the computation to finish).

A cluster, or group of machines, pools the resources of many machines together allowing us to use all the cumulative resources as if they were one. Now a group of machines alone is not powerful, you need a framework to coordinate work across them. Spark is a tool for just that, managing and coordinating the execution of tasks on data across a cluster of computers. The cluster of machines that Spark will leverage to execute tasks will be managed by a cluster manager like Spark’s Standalone cluster manager, YARN, or Mesos. We then submit Spark Applications to these cluster managers which will grant resources to our application so that we can complete our work.

Spark Applications

Spark Applications consist of a driver process and a set of executor processes. The driver process runs your main() function, sits on a node in the cluster, and is responsible for three things: maintaining information about the Spark Application; responding to a user’s program or input; and analyzing, distributing, and scheduling work across the executors (defined momentarily). The driver process is absolutely essential - it’s the heart of a Spark Application and maintains all relevant information during the lifetime of the application.

The executors are responsible for actually executing the work that the driver assigns them. This means, each executor is responsible for only two things: executing code assigned to it by the driver and reporting the state of the computation, on that executor, back to the driver node.

Learning Spark for Data Engineers

Data engineers position is slightly different of analytical positions. Instead of mathematics, statistics and advanced analytics skills, learning Spark for data engineers will be focus on topics:

  1. Installation and seting up the environment
  2. Data transformation, data modeling
  3. Using relational and non-relational data
  4. Desinging pipelines, ETL and data movement
  5. Orchestration and architectural view

Table of content / Featured blogposts

  1. What is Apache Spark (blogpost)
  2. Installing Apache Spark (blogpost)
  3. Getting around CLI and WEB UI in Apache Spark (blogpost)
  4. Spark Architecture – Local and cluster mode (blogpost)
  5. Setting up Spark Cluster (blogpost)
  6. Setting up IDE (blogpost)
  7. Starting Spark with R and Python (blogpost)
  8. Creating RDD files (blogpost)
  9. RDD Operations (blogpost)
  10. Working with data frames (blogpost)
  11. Working with packages and spark DataFrames (blogpost)
  12. Spark SQL (blogpost)
  13. Spark SQL bucketing and partitioning (blogpost)
  14. Spark SQL query hints and executions (blogpost)
  15. Introduction to Spark Streaming (blogpost)
  16. Dataframe operations for Spark streaming (blogpost)
  17. Watermarking and joins for Spark streaming (blogpost)
  18. Time windows for Spark streaming (blogpost)
  19. Data Engineering for Spark Streaming (blogpost)
  20. Spark GraphX processing (blogpost)
  21. Spak GraphX operators (blogpost)
  22. Spark in Azure Databricks (blogpost)
  23. Delta live tables with Azure Databricks (blogpost)
  24. Data visualisation with Spark (blogpost)
  25. Spark literature, documentation, courses and books (blogpost)

Blog

All posts were originally posted on my blog and made copy here at Github. On Github is extremely simple to clone the code, markdown file and all the materials.

Cloning the repository

You can follow the steps below to clone the repository.

sudo git clone -n https://github.com/tomaztk/Spark-for-data-engineers.git

Contact

Get in contact if you would like to contribute or simply fork a repository and alter the code.

Contributing

Do the usual GitHub fork and pull request dance. Add yourself (or I will add you to the contributors section) if you want to.

Suggestions

Feel free to suggest any new topics that you would like to be covered.

Github.io

All code is available also at github tomaztk.github.io and in this repository.

Book is created using mdBook (with Rust and Cargo).

License

MIT © Tomaž Kaštrun

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].