Spark for data engineers
Spark for data engineers is repository that will provide readers overview, code samples and examples for better tackling Spark.
What is Spark and why does it matter for Data Engineers
Data Analysts, data Scientist, Business Intelligence analysts and many other roles require data on demand. Fighting with data silos, many scatter databases, Excel files, CSV files, JSON files, APIs and potentially different flavours of cloud storage may be tedious, nerve-wracking and time-consuming.
Automated process that would follow set of steps, procedures and processes take subsets of data, columns from database, binary files and merged them together to serve business needs and potentials is and still will be a favorable job for many organizations and teams.
Spark is an absolute winner for this tasks and a great choice for adoption.
Data Engineering should have the extent and capability to do:
- System architecture
- Programming
- Database design and configuration
- Interface and sensor configuration
And in addition to that, it is as important as familiarity with the technical tools is, the concepts of data architecture and pipeline design are even more important. The tools are worthless without a solid conceptual understanding of:
- Data models
- Relational and non-relational database design
- Information flow
- Query execution and optimization
- Comparative analysis of data stores
- Logical operations
Apache Spark have all the technology built-in to cover these topics and has the capacity for achieving a concrete goal for assembling together functional systems to do the goal.
Apache Spark™ is designed to to build faster and more reliable data pipelines, cover low level and structured API and brings tools and packages for Streaming data, Machine Learning, data engineering and building pipelines and extending the Spark ecosystem.
Spark’s Basic Architecture
Single machines do not have enough power and resources to perform computations on huge amounts of information (or the user may not have time to wait for the computation to finish).
A cluster, or group of machines, pools the resources of many machines together allowing us to use all the cumulative resources as if they were one. Now a group of machines alone is not powerful, you need a framework to coordinate work across them. Spark is a tool for just that, managing and coordinating the execution of tasks on data across a cluster of computers. The cluster of machines that Spark will leverage to execute tasks will be managed by a cluster manager like Spark’s Standalone cluster manager, YARN, or Mesos. We then submit Spark Applications to these cluster managers which will grant resources to our application so that we can complete our work.
Spark Applications
Spark Applications consist of a driver process and a set of executor processes. The driver process runs your main() function, sits on a node in the cluster, and is responsible for three things: maintaining information about the Spark Application; responding to a user’s program or input; and analyzing, distributing, and scheduling work across the executors (defined momentarily). The driver process is absolutely essential - it’s the heart of a Spark Application and maintains all relevant information during the lifetime of the application.
The executors are responsible for actually executing the work that the driver assigns them. This means, each executor is responsible for only two things: executing code assigned to it by the driver and reporting the state of the computation, on that executor, back to the driver node.
Learning Spark for Data Engineers
Data engineers position is slightly different of analytical positions. Instead of mathematics, statistics and advanced analytics skills, learning Spark for data engineers will be focus on topics:
- Installation and seting up the environment
- Data transformation, data modeling
- Using relational and non-relational data
- Desinging pipelines, ETL and data movement
- Orchestration and architectural view
Table of content / Featured blogposts
- What is Apache Spark (blogpost)
- Installing Apache Spark (blogpost)
- Getting around CLI and WEB UI in Apache Spark (blogpost)
- Spark Architecture – Local and cluster mode (blogpost)
- Setting up Spark Cluster (blogpost)
- Setting up IDE (blogpost)
- Starting Spark with R and Python (blogpost)
- Creating RDD files (blogpost)
- RDD Operations (blogpost)
- Working with data frames (blogpost)
- Working with packages and spark DataFrames (blogpost)
- Spark SQL (blogpost)
- Spark SQL bucketing and partitioning (blogpost)
- Spark SQL query hints and executions (blogpost)
- Introduction to Spark Streaming (blogpost)
- Dataframe operations for Spark streaming (blogpost)
- Watermarking and joins for Spark streaming (blogpost)
- Time windows for Spark streaming (blogpost)
- Data Engineering for Spark Streaming (blogpost)
- Spark GraphX processing (blogpost)
- Spak GraphX operators (blogpost)
- Spark in Azure Databricks (blogpost)
- Delta live tables with Azure Databricks (blogpost)
- Data visualisation with Spark (blogpost)
- Spark literature, documentation, courses and books (blogpost)
Blog
All posts were originally posted on my blog and made copy here at Github. On Github is extremely simple to clone the code, markdown file and all the materials.
Cloning the repository
You can follow the steps below to clone the repository.
sudo git clone -n https://github.com/tomaztk/Spark-for-data-engineers.git
Contact
Get in contact if you would like to contribute or simply fork a repository and alter the code.
Contributing
Do the usual GitHub fork and pull request dance. Add yourself (or I will add you to the contributors section) if you want to.
Suggestions
Feel free to suggest any new topics that you would like to be covered.
Github.io
All code is available also at github tomaztk.github.io and in this repository.
Book is created using mdBook (with Rust and Cargo).
License
MIT © Tomaž Kaštrun