lineageGenerate beautiful documentation for your data pipelines in markdown format
DataEngineeringThis repo contains commands that data engineers use in day to day work.
kuwalaKuwala is the no-code data platform for BI analysts and engineers enabling you to build powerful analytics workflows. We are set out to bring state-of-the-art data engineering tools you love, such as Airbyte, dbt, or Great Expectations together in one intuitive interface built with React Flow. In addition we provide third-party data into data sc…
sparklanesA lightweight data processing framework for Apache Spark
jobAnalytics and searchJobAnalytics system consumes data from multiple sources and provides valuable information to both job hunters and recruiters.
SynapseMLSimple and Distributed Machine Learning
jupyterlab-sparkmonitorJupyterLab extension that enables monitoring launched Apache Spark jobs from within a notebook
phrase-at-scaleDetect common phrases in large amounts of text using a data-driven approach. Size of discovered phrases can be arbitrary. Can be used in languages other than English
cejaPySpark phonetic and string matching algorithms
pyspark-ML-in-ColabPyspark in Google Colab: A simple machine learning (Linear Regression) model
SparkoraPowerful rapid automatic EDA and feature engineering library with a very easy to use API 🌟
oshinko-s2iThis is a place to put s2i images and utilities for spark application builders for openshift
datalake-etl-pipelineSimplified ETL process in Hadoop using Apache Spark. Has complete ETL pipeline for datalake. SparkSession extensions, DataFrame validation, Column extensions, SQL functions, and DataFrame transformations
anovosAnovos - An Open Source Library for Scalable feature engineering Using Apache-Spark
OSCIOpen Source Contributor Index
pyspark-algorithmsPySpark Algorithms Book: https://www.amazon.com/dp/B07X4B2218/ref=sr_1_2
soda-sparkSoda Spark is a PySpark library that helps you with testing your data in Spark Dataframes
jgit-spark-connectorjgit-spark-connector is a library for running scalable data retrieval pipelines that process any number of Git repositories for source code analysis.
isarn-sketches-sparkRoutines and data structures for using isarn-sketches idiomatically in Apache Spark
pyspark-cassandrapyspark-cassandra is a Python port of the awesome @datastax Spark Cassandra connector. Compatible w/ Spark 2.0, 2.1, 2.2, 2.3 and 2.4
spark3DSpark extension for processing large-scale 3D data sets: Astrophysics, High Energy Physics, Meteorology, …
optimus🚚 Agile Data Preparation Workflows made easy with Pandas, Dask, cuDF, Dask-cuDF, Vaex and PySpark
workshop-sparkCódigo para workshops Spark com ambiente de desenvolvimento em docker