All Projects → MrPowers → Spark Daria

MrPowers / Spark Daria

Licence: mit
Essential Spark extensions and helper methods ✨😲

Programming Languages

scala
5932 projects

Projects that are alternatives of or similar to Spark Daria

Koalas
Koalas: pandas API on Apache Spark
Stars: ✭ 3,044 (+450.45%)
Mutual labels:  dataframe, spark
Spark Redis
A connector for Spark that allows reading and writing to/from Redis cluster
Stars: ✭ 773 (+39.78%)
Mutual labels:  dataframe, spark
Datafusion
DataFusion has now been donated to the Apache Arrow project
Stars: ✭ 611 (+10.49%)
Mutual labels:  dataframe, spark
Net.jgp.labs.spark
Apache Spark examples exclusively in Java
Stars: ✭ 55 (-90.05%)
Mutual labels:  dataframe, spark
Geni
A Clojure dataframe library that runs on Spark
Stars: ✭ 152 (-72.51%)
Mutual labels:  dataframe, spark
Spark With Python
Fundamentals of Spark with Python (using PySpark), code examples
Stars: ✭ 150 (-72.88%)
Mutual labels:  dataframe, spark
Mobius
C# and F# language binding and extensions to Apache Spark
Stars: ✭ 929 (+67.99%)
Mutual labels:  dataframe, spark
Ballista
Distributed compute platform implemented in Rust, and powered by Apache Arrow.
Stars: ✭ 2,274 (+311.21%)
Mutual labels:  dataframe, spark
aut
The Archives Unleashed Toolkit is an open-source toolkit for analyzing web archives.
Stars: ✭ 111 (-79.93%)
Mutual labels:  spark, dataframe
Moonbox
Moonbox is a DVtaaS (Data Virtualization as a Service) Platform
Stars: ✭ 424 (-23.33%)
Mutual labels:  spark
Spark
Cross-platform real-time collaboration client optimized for business and organizations.
Stars: ✭ 471 (-14.83%)
Mutual labels:  spark
Learningspark
Scala examples for learning to use Spark
Stars: ✭ 421 (-23.87%)
Mutual labels:  spark
Dji Firmware Tools
Tools for handling firmwares of DJI products, with focus on quadcopters.
Stars: ✭ 424 (-23.33%)
Mutual labels:  spark
Pdf
编程电子书,电子书,编程书籍,包括C,C#,Docker,Elasticsearch,Git,Hadoop,HeadFirst,Java,Javascript,jvm,Kafka,Linux,Maven,MongoDB,MyBatis,MySQL,Netty,Nginx,Python,RabbitMQ,Redis,Scala,Solr,Spark,Spring,SpringBoot,SpringCloud,TCPIP,Tomcat,Zookeeper,人工智能,大数据类,并发编程,数据库类,数据挖掘,新面试题,架构设计,算法系列,计算机类,设计模式,软件测试,重构优化,等更多分类
Stars: ✭ 12,009 (+2071.61%)
Mutual labels:  spark
Featran
A Scala feature transformation library for data science and machine learning
Stars: ✭ 420 (-24.05%)
Mutual labels:  spark
Cdap
An open source framework for building data analytic applications.
Stars: ✭ 509 (-7.96%)
Mutual labels:  spark
Listenbrainz Server
Server for the ListenBrainz project
Stars: ✭ 420 (-24.05%)
Mutual labels:  spark
Sparkle
Haskell on Apache Spark.
Stars: ✭ 419 (-24.23%)
Mutual labels:  spark
Lopq
Training of Locally Optimized Product Quantization (LOPQ) models for approximate nearest neighbor search of high dimensional data in Python and Spark.
Stars: ✭ 530 (-4.16%)
Mutual labels:  spark
Magellan
Geo Spatial Data Analytics on Spark
Stars: ✭ 507 (-8.32%)
Mutual labels:  spark

spark-daria

Spark helper methods to maximize developer productivity.

CI: GitHub Build Status

Code quality: Codacy Badge Maintainability

typical daria

Setup

Fetch the JAR file from Maven.

// Spark 3
libraryDependencies += "com.github.mrpowers" %% "spark-daria" % "1.0.0"

// Spark 2
libraryDependencies += "com.github.mrpowers" %% "spark-daria" % "0.39.0"

You can find the spark-daria Scala 2.11 versions here and the Scala 2.12 versions here. The legacy versions are here.

Writing Beautiful Spark Code

Reading Beautiful Spark Code is the best way to learn how to build Spark projects and leverage spark-daria.

spark-daria will make you a more productive Spark programmer. Studying the spark-daria codebase will help you understand how to organize Spark codebases.

PySpark

Use quinn to access similar functions in PySpark.

Usage

spark-daria provides different types of functions that will make your life as a Spark developer easier:

  1. Core extensions
  2. Column functions / UDFs
  3. Custom transformations
  4. Helper methods
  5. DataFrame validators

The following overview will give you an idea of the types of functions that are provided by spark-daria, but you'll need to dig into the docs to learn about all the methods.

Core extensions

The core extensions add methods to existing Spark classes that will help you write beautiful code.

The native Spark API forces you to write code like this.

col("is_nice_person").isNull && col("likes_peanut_butter") === false

When you import the spark-daria ColumnExt class, you can write idiomatic Scala code like this:

import com.github.mrpowers.spark.daria.sql.ColumnExt._

col("is_nice_person").isNull && col("likes_peanut_butter").isFalse

This blog post describes how to use the spark-daria createDF() method that's much better than the toDF() and createDataFrame() methods provided by Spark.

See the ColumnExt, DataFrameExt, and SparkSessionExt objects for all the core extensions offered by spark-daria.

Column functions

Column functions can be used in addition to the org.apache.spark.sql.functions.

Here is how to remove all whitespace from a string with the native Spark API:

import org.apache.spark.sql.functions._

regexp_replace(col("first_name"), "\\s+", "")

The spark-daria removeAllWhitespace() function lets you express this logic with code that's more readable.

import com.github.mrpowers.spark.daria.sql.functions._

removeAllWhitespace(col("first_name"))

Datetime functions

  • beginningOfWeek
  • endOfWeek
  • beginningOfMonth
  • endOfMonth

Custom transformations

Custom transformations have the following method signature so they can be passed as arguments to the Spark DataFrame#transform() method.

def someCustomTransformation(arg1: String)(df: DataFrame): DataFrame = {
  // code that returns a DataFrame
}

The spark-daria snakeCaseColumns() custom transformation snake_cases all of the column names in a DataFrame.

import com.github.mrpowers.spark.daria.sql.transformations._

val betterDF = df.transform(snakeCaseColumns())

Protip: You'll always want to deal with snake_case column names in Spark - use this function if your column names contain spaces of uppercase letters.

Helper methods

The DataFrame helper methods make it easy to convert DataFrame columns into Arrays or Maps. Here's how to convert a column to an Array.

import com.github.mrpowers.spark.daria.sql.DataFrameHelpers._

val arr = columnToArray[Int](sourceDF, "num")

DataFrame validators

DataFrame validators check that DataFrames contain certain columns or a specific schema. They throw descriptive error messages if the DataFrame schema is not as expected. DataFrame validators are a great way to make sure your application gives descriptive error messages.

Let's look at a method that makes sure a DataFrame contains the expected columns.

val sourceDF = Seq(
  ("jets", "football"),
  ("nacional", "soccer")
).toDF("team", "sport")

val requiredColNames = Seq("team", "sport", "country", "city")

validatePresenceOfColumns(sourceDF, requiredColNames)

// throws this error message: com.github.mrpowers.spark.daria.sql.MissingDataFrameColumnsException: The [country, city] columns are not included in the DataFrame with the following columns [team, sport]

Documentation

Here is the latest spark-daria documentation.

Studying these docs will make you a better Spark developer!

👭 👬 👫 Contribution Criteria

We are actively looking for contributors to add functionality that fills in the gaps of the Spark source code.

To get started, fork the project and submit a pull request. Please write tests!

After submitting a couple of good pull requests, you'll be added as a contributor to the project.

Publishing

  1. Version bump commit and create GitHub tag

  2. Publish documentation with sbt ghpagesPushSite

  3. Publish JAR

Run sbt to open the SBT console.

Run > ; + publishSigned; sonatypeBundleRelease to create the JAR files and release them to Maven. These commands are made available by the sbt-sonatype plugin.

When the release command is run, you'll be prompted to enter your GPG passphrase.

The Sonatype credentials should be stored in the ~/.sbt/sonatype_credentials file in this format:

realm=Sonatype Nexus Repository Manager
host=oss.sonatype.org
user=$USERNAME
password=$PASSWORD
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].