All Projects → ksbg → sparklanes

ksbg / sparklanes

Licence: MIT license
A lightweight data processing framework for Apache Spark

Programming Languages

python
139335 projects - #7 most used programming language
Jupyter Notebook
11667 projects

Projects that are alternatives of or similar to sparklanes

basin
Basin is a visual programming editor for building Spark and PySpark pipelines. Easily build, debug, and deploy complex ETL pipelines from your browser
Stars: ✭ 25 (+47.06%)
Mutual labels:  pipeline, etl, pyspark
lineage
Generate beautiful documentation for your data pipelines in markdown format
Stars: ✭ 16 (-5.88%)
Mutual labels:  pipeline, etl, pyspark
Mara Pipelines
A lightweight opinionated ETL framework, halfway between plain scripts and Apache Airflow
Stars: ✭ 1,841 (+10729.41%)
Mutual labels:  pipeline, etl
Metl
mito ETL tool
Stars: ✭ 153 (+800%)
Mutual labels:  pipeline, etl
Morphl Community Edition
MorphL Community Edition uses big data and machine learning to predict user behaviors in digital products and services with the end goal of increasing KPIs (click-through rates, conversion rates, etc.) through personalization
Stars: ✭ 253 (+1388.24%)
Mutual labels:  pipeline, pyspark
Stetl
Stetl, Streaming ETL, is a lightweight geospatial processing and ETL framework written in Python.
Stars: ✭ 64 (+276.47%)
Mutual labels:  pipeline, etl
Setl
A simple Spark-powered ETL framework that just works 🍺
Stars: ✭ 79 (+364.71%)
Mutual labels:  pipeline, etl
Bulk Writer
Provides guidance for fast ETL jobs, an IDataReader implementation for SqlBulkCopy (or the MySql or Oracle equivalents) that wraps an IEnumerable, and libraries for mapping entites to table columns.
Stars: ✭ 210 (+1135.29%)
Mutual labels:  pipeline, etl
machine-learning-data-pipeline
Pipeline module for parallel real-time data processing for machine learning models development and production purposes.
Stars: ✭ 22 (+29.41%)
Mutual labels:  data-preprocessing, data-processing
naas
⚙️ Schedule notebooks, run them like APIs, expose securely your assets: Jupyter as a viable ⚡️ Production environment
Stars: ✭ 219 (+1188.24%)
Mutual labels:  pipeline, etl
datalake-etl-pipeline
Simplified ETL process in Hadoop using Apache Spark. Has complete ETL pipeline for datalake. SparkSession extensions, DataFrame validation, Column extensions, SQL functions, and DataFrame transformations
Stars: ✭ 39 (+129.41%)
Mutual labels:  etl, pyspark
Phila Airflow
Stars: ✭ 16 (-5.88%)
Mutual labels:  pipeline, etl
Go Streams
A lightweight stream processing library for Go
Stars: ✭ 615 (+3517.65%)
Mutual labels:  pipeline, etl
Forte
Forte is a flexible and powerful NLP builder FOR TExt. This is part of the CASL project: http://casl-project.ai/
Stars: ✭ 89 (+423.53%)
Mutual labels:  pipeline, data-processing
Datavec
ETL Library for Machine Learning - data pipelines, data munging and wrangling
Stars: ✭ 272 (+1500%)
Mutual labels:  pipeline, etl
Airbyte
Airbyte is an open-source EL(T) platform that helps you replicate your data in your warehouses, lakes and databases.
Stars: ✭ 4,919 (+28835.29%)
Mutual labels:  pipeline, etl
SeqTools
A python library to manipulate and transform indexable data (lists, arrays, ...)
Stars: ✭ 42 (+147.06%)
Mutual labels:  pipeline, preprocessing
skippa
SciKIt-learn Pipeline in PAndas
Stars: ✭ 33 (+94.12%)
Mutual labels:  pipeline, preprocessing
dropEst
Pipeline for initial analysis of droplet-based single-cell RNA-seq data
Stars: ✭ 71 (+317.65%)
Mutual labels:  pipeline, preprocessing
etl
[READ-ONLY] PHP - ETL (Extract Transform Load) data processing library
Stars: ✭ 279 (+1541.18%)
Mutual labels:  etl, data-processing

sparklanes

PyPI version Build Status Coverage Status Doc status pylint Score license

sparklanes is a lightweight data processing framework for Apache Spark written in Python. It was built with the intention to make building complex spark processing pipelines simpler, by shifting the focus towards writing data processing code without having to spent much time on the surrounding application architecture.

Data processing pipelines, or lanes, are built by stringing together encapsulated processor classes, which allows creation of lane definitions with an arbitrary processor order, where processors can be easily removed, added or swapped.

Processing pipelines can be defined using lane configuration YAML files, to then be packaged and submitted to spark using a single command. Alternatively, the same can be achieved manually by using the framework's API.

Usage

Check out the documentation at sparklanes.readthedocs.io, as well as the example Jupyter notebook

Installation

Using pip:

pip install sparklanes

Tests & Docs

Install the development requirements:

pip install -r requirements-dev.txt

Run the test suite from the project root using:

python -m tests

Build the documentation:

cd docs && make html

Disclaimer

I don't recommend using this in production, as I'm not actively maintaining it.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].