Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

Created with love in Canada, visit hostnodejs.com today

Feel like to post an Ad? Learn Details

All Projects → MrPowers → Quinn

MrPowers / Quinn

pyspark methods to enhance developer productivity 📣 👯 🎉

Programming Languages

139335 projects - #7 most used programming language

Labels

apache-spark pyspark

Projects that are alternatives of or similar to Quinn

Simple and Distributed Machine Learning

Stars: ✭ 2,899 (+1235.94%)

Mutual labels: pyspark, apache-spark

Azure Cosmosdb Spark

Apache Spark Connector for Azure Cosmos DB

Stars: ✭ 165 (-23.96%)

Mutual labels: apache-spark, pyspark

Spark-for-data-engineers

Apache Spark for data engineers

Stars: ✭ 22 (-89.86%)

Mutual labels: apache-spark, pyspark

jupyterlab-sparkmonitor

JupyterLab extension that enables monitoring launched Apache Spark jobs from within a notebook

Stars: ✭ 78 (-64.06%)

Mutual labels: apache-spark, pyspark

Pyspark Boilerplate

A boilerplate for writing PySpark Jobs

Stars: ✭ 318 (+46.54%)

Mutual labels: apache-spark, pyspark

Simple and Distributed Machine Learning

Stars: ✭ 3,355 (+1446.08%)

Mutual labels: apache-spark, pyspark

The Archives Unleashed Toolkit is an open-source toolkit for analyzing web archives.

Stars: ✭ 111 (-48.85%)

Mutual labels: apache-spark, pyspark

learn-by-examples

Real-world Spark pipelines examples

Stars: ✭ 84 (-61.29%)

Mutual labels: apache-spark, pyspark

Spark Gotchas. A subjective compilation of the Apache Spark tips and tricks

Stars: ✭ 308 (+41.94%)

Mutual labels: apache-spark, pyspark

mmtf-workshop-2018

Structural Bioinformatics Training Workshop & Hackathon 2018

Stars: ✭ 50 (-76.96%)

Mutual labels: apache-spark, pyspark

Powerful rapid automatic EDA and feature engineering library with a very easy to use API 🌟

Stars: ✭ 51 (-76.5%)

Mutual labels: apache-spark, pyspark

A curated list of awesome Apache Spark packages and resources.

Stars: ✭ 1,061 (+388.94%)

Mutual labels: apache-spark, pyspark

datalake-etl-pipeline

Simplified ETL process in Hadoop using Apache Spark. Has complete ETL pipeline for datalake. SparkSession extensions, DataFrame validation, Column extensions, SQL functions, and DataFrame transformations

Stars: ✭ 39 (-82.03%)

Mutual labels: apache-spark, pyspark

Spark With Python

Fundamentals of Spark with Python (using PySpark), code examples

Stars: ✭ 150 (-30.88%)

Mutual labels: apache-spark, pyspark

spark-twitter-sentiment-analysis

Sentiment Analysis of a Twitter Topic with Spark Structured Streaming

Stars: ✭ 55 (-74.65%)

Mutual labels: apache-spark, pyspark

pyspark-cheatsheet

PySpark Cheat Sheet - example code to help you learn PySpark and develop apps faster

Stars: ✭ 115 (-47%)

Mutual labels: apache-spark, pyspark

Spark extension for processing large-scale 3D data sets: Astrophysics, High Energy Physics, Meteorology, …

Stars: ✭ 23 (-89.4%)

Mutual labels: apache-spark, pyspark

isarn-sketches-spark

Routines and data structures for using isarn-sketches idiomatically in Apache Spark

Stars: ✭ 28 (-87.1%)

Mutual labels: apache-spark, pyspark

pyspark-asyncactions

Asynchronous actions for PySpark

Stars: ✭ 30 (-86.18%)

Mutual labels: apache-spark, pyspark

Live log analyzer spark

Spark Application for analysis of Apache Access logs and detect anamolies! Along with Medium Article.

Stars: ✭ 14 (-93.55%)

Mutual labels: apache-spark, pyspark

View All Similar Projects ➔

Quinn

Pyspark helper methods to maximize developer productivity.

Quinn validates DataFrames, extends core classes, defines DataFrame transformations, and provides SQL functions.

Setup

Quinn is uploaded to PyPi and can be installed with this command:

pip install quinn

Pyspark Core Class Extensions

from quinn.extensions import *

Column Extensions

isFalsy()

source_df.withColumn("is_stuff_falsy", F.col("has_stuff").isFalsy())

Returns True if has_stuff is None or False.

isTruthy()

source_df.withColumn("is_stuff_truthy", F.col("has_stuff").isTruthy())

Returns True unless has_stuff is None or False.

isNullOrBlank()

source_df.withColumn("is_blah_null_or_blank", F.col("blah").isNullOrBlank())

Returns True if blah is null or blank (the empty string or a string that only contains whitespace).

isNotIn()

source_df.withColumn("is_not_bobs_hobby", F.col("fun_thing").isNotIn(bobs_hobbies))

Returns True if fun_thing is not included in the bobs_hobbies list.

nullBetween()

source_df.withColumn("is_between", F.col("age").nullBetween(F.col("lower_age"), F.col("upper_age")))

Returns True if age is between lower_age and upper_age. If lower_age is populated and upper_age is null, it will return True if age is greater than or equal to lower_age. If lower_age is null and upper_age is populate, it will return True if age is lower than or equal to upper_age.

SparkSession Extensions

create_df()

spark.create_df(
    [("jose", "a"), ("li", "b"), ("sam", "c")],
    [("name", StringType(), True), ("blah", StringType(), True)]
)

Creates DataFrame with a syntax that's less verbose than the built-in createDataFrame method.

DataFrame Extensions

transform()

source_df\
    .transform(lambda df: with_greeting(df))\
    .transform(lambda df: with_something(df, "crazy"))

Allows for multiple DataFrame transformations to be run and executed.

Quinn Helper Functions

import quinn

DataFrame Validations

validate_presence_of_columns()

quinn.validate_presence_of_columns(source_df, ["name", "age", "fun"])

Raises an exception unless source_df contains the name, age, and fun column.

validate_schema()

quinn.validate_schema(source_df, required_schema)

Raises an exception unless source_df contains all the StructFields defined in the required_schema.

validate_absence_of_columns()

quinn.validate_absence_of_columns(source_df, ["age", "cool"])

Raises an exception if source_df contains age or cool columns.

Functions

single_space()

actual_df = source_df.withColumn(
    "words_single_spaced",
    quinn.single_space(col("words"))
)

Replaces all multispaces with single spaces (e.g. changes "this has some" to "this has some".

remove_all_whitespace()

actual_df = source_df.withColumn(
    "words_without_whitespace",
    quinn.remove_all_whitespace(col("words"))
)

Removes all whitespace in a string (e.g. changes "this has some" to "thishassome".

anti_trim()

actual_df = source_df.withColumn(
    "words_anti_trimmed",
    quinn.anti_trim(col("words"))
)

Removes all inner whitespace, but doesn't delete leading or trailing whitespace (e.g. changes " this has some " to " thishassome ".

remove_non_word_characters()

actual_df = source_df.withColumn(
    "words_without_nonword_chars",
    quinn.remove_non_word_characters(col("words"))
)

Removes all non-word characters from a string (e.g. changes "si%$#@!# to "simpsons".


exists()
source_df.withColumn(
    "any_num_greater_than_5",
    quinn.exists(lambda n: n > 5)(col("nums"))
)

nums contains lists of numbers and exists() returns True if any of the numbers in the list are greater than 5.  It's similar to the Python any function.
forall()
source_df.withColumn(
    "all_nums_greater_than_3",
    quinn.forall(lambda n: n > 3)(col("nums"))
)

nums contains lists of numbers and forall() returns True if all of the numbers in the list are greater than 3.  It's similar to the Python all function.
multi_equals()
source_df.withColumn(
    "are_s1_and_s2_cat",
    quinn.multi_equals("cat")(col("s1"), col("s2"))
)

multi_equals returns true if s1 and s2 are both equal to "cat".

Transformations
snake_case_col_names()
quinn.snake_case_col_names(source_df)

Converts all the column names in a DataFrame to snake_case.  It's annoying to write SQL queries when columns aren't snake cased.
sort_columns()
quinn.sort_columns(source_df, "asc")

Sorts the DataFrame columns in alphabetical order.  Wide DataFrames are easier to navigate when they're sorted alphabetically.

DataFrame Helpers
column_to_list()
quinn.column_to_list(source_df, "name")

Converts a column in a DataFrame to a list of values.
two_columns_to_dictionary()
quinn.two_columns_to_dictionary(source_df, "name", "age")

Converts two columns of a DataFrame into a dictionary.  In this example, name is the key and age is the value.
to_list_of_dictionaries()
quinn.to_list_of_dictionaries(source_df)

Converts an entire DataFrame into a list of dictionaries.

Contributing
We are actively looking for feature requests, pull requests, and bug fixes.
Any developer that demonstrates excellence will be invited to be a maintainer of the project.

Note that the project description data, including the texts, logos, images, and/or trademarks, 
for each open source project belongs to its rightful owner. 
If you wish to add or remove any projects, please contact us at [email protected].

Stars: ✭ 217

Visit Git Page 🔗Visit User Page 🔗Visit Issues Page (14) 🔗