Top 95 pyspark open source projects

lineage
Generate beautiful documentation for your data pipelines in markdown format
DataEngineering
This repo contains commands that data engineers use in day to day work.
kuwala
Kuwala is the no-code data platform for BI analysts and engineers enabling you to build powerful analytics workflows. We are set out to bring state-of-the-art data engineering tools you love, such as Airbyte, dbt, or Great Expectations together in one intuitive interface built with React Flow. In addition we provide third-party data into data sc…
jobAnalytics and search
JobAnalytics system consumes data from multiple sources and provides valuable information to both job hunters and recruiters.
check-engine
Data validation library for PySpark 3.0.0
phrase-at-scale
Detect common phrases in large amounts of text using a data-driven approach. Size of discovered phrases can be arbitrary. Can be used in languages other than English
pyspark-ML-in-Colab
Pyspark in Google Colab: A simple machine learning (Linear Regression) model
pyspark-for-data-processing
Code for my presentation: Using PySpark to Process Boat Loads of Data
oshinko-s2i
This is a place to put s2i images and utilities for spark application builders for openshift
datalake-etl-pipeline
Simplified ETL process in Hadoop using Apache Spark. Has complete ETL pipeline for datalake. SparkSession extensions, DataFrame validation, Column extensions, SQL functions, and DataFrame transformations
OSCI
Open Source Contributor Index
learn-by-examples
Real-world Spark pipelines examples
soda-spark
Soda Spark is a PySpark library that helps you with testing your data in Spark Dataframes
jgit-spark-connector
jgit-spark-connector is a library for running scalable data retrieval pipelines that process any number of Git repositories for source code analysis.
pyspark-cassandra
pyspark-cassandra is a Python port of the awesome @datastax Spark Cassandra connector. Compatible w/ Spark 2.0, 2.1, 2.2, 2.3 and 2.4
spark3D
Spark extension for processing large-scale 3D data sets: Astrophysics, High Energy Physics, Meteorology, …
workshop-spark
Código para workshops Spark com ambiente de desenvolvimento em docker
spark-dgraph-connector
A connector for Apache Spark and PySpark to Dgraph databases.
61-95 of 95 pyspark projects