ericxiao251 / Spark Syntax
Projects that are alternatives of or similar to Spark Syntax
Spark-Syntax
This is a public repo documenting all of the "best practices" of writing PySpark code from what I have learnt from working with PySpark
for 3 years. This will mainly focus on the Spark DataFrames and SQL
library.
you can also visit ericxiao251.github.io/spark-syntax/ for a online book version.
Contributing/Topic Requests
If you notice an improvements in terms of typos, spellings, grammar, etc. feel free to create a PR and I'll review it 😁, you'll most likely be right.
If you have any topics that I could potentially go over, please create an issue and describe the topic. I'll try my best to address it 😁.
Acknowledgement
Huge thanks to Levon for turning everything into a gitbook. You can follow his github at https://github.com/tumregels.
Table of Contexts:
Chapter 1 - Getting Started with Spark:
-
Useful Material
1.1 - -
Creating your First DataFrame
1.2 - -
Reading your First Dataset
1.3 - -
More Comfortable with SQL?
1.4 -
Chapter 2 - Exploring the Spark APIs:
-
-
Struct Types (
2.1.1 -StructType
) -
Arrays and Lists (
2.1.2 -ArrayType
) -
Maps and Dictionaries (
2.1.3 -MapType
) -
Decimals and Why did my Decimals overflow :( (
2.1.4 -DecimalType
)
-
-
Performing your First Transformations
2.2 --
Looking at Your Data (
2.2.1 -collect
/head
/take
/first
/toPandas
/show
) -
Selecting a Subset of Columns (
2.2.2 -drop
/select
) -
Creating New Columns and Transforming Data (
2.2.3 -withColumn
/withColumnRenamed
) -
Constant Values and Column Expressions (
2.2.4 -lit
/col
) -
Casting Columns to a Different Type (
2.2.5 -cast
) -
Filtering Data (
2.2.6 -where
/filter
/isin
) -
Equality Statements in Spark and Comparisons with Nulls (
2.2.7 -isNotNull()
/isNull()
) -
Case Statements (
2.2.8 -when
/otherwise
) -
Filling in Null Values (
2.2.9 -fillna
/coalesce
) -
Spark Functions aren't Enough, I Need my Own! (
2.2.10 -udf
/pandas_udf
) -
Unionizing Multiple Dataframes (
2.2.11 -union
) -
Performing Joins (clean one) (
2.2.12 -join
)
-
-
-
One to Many Rows (
2.3.1 -explode
) -
Range Join Conditions (WIP) (
2.3.2 -join
)
-
-
-
repartition
) -
coalesce
) -
cache
) -
broadcast
)
-
Chapter 3 - Aggregates:
-
Clean Aggregations
3.1 - -
Non Deterministic Behaviours
3.2 -
Chapter 4 - Window Objects:
Chapter 5 - Error Logs:
Chapter 6 - Understanding Spark Performance:
-
-
Understanding how Spark Works
6.1.1 - -
-
-
-
-
-
-
Chapter 7 - High Performance Code:
-
-
Broadcast Join
) -
BroadcastHashJoin
) -
SortMergeJoin
)
-
-
-
Filter Pushdown
7.1.1 - -
Joining on Skewed Data (Null Keys)
7.1.2 - -
Joining on Skewed Data (High Frequency Keys I)
7.1.3 - -
-
-
-
caching
) -
-
-
dynamic allocation
) -
2001
(partitions
) -
UDF
s? (python memory
)
-
-