All Projects → mara → Mara Pipelines

mara / Mara Pipelines

Licence: mit
A lightweight opinionated ETL framework, halfway between plain scripts and Apache Airflow

Programming Languages

python
139335 projects - #7 most used programming language
javascript
184084 projects - #8 most used programming language
PLpgSQL
1095 projects
CSS
56736 projects

Projects that are alternatives of or similar to Mara Pipelines

Airbyte
Airbyte is an open-source EL(T) platform that helps you replicate your data in your warehouses, lakes and databases.
Stars: ✭ 4,919 (+167.19%)
Mutual labels:  pipeline, etl, data, data-integration
Stetl
Stetl, Streaming ETL, is a lightweight geospatial processing and ETL framework written in Python.
Stars: ✭ 64 (-96.52%)
Mutual labels:  pipeline, etl
Kiba Plus
Kiba enhancement for Ruby ETL.
Stars: ✭ 47 (-97.45%)
Mutual labels:  etl, postgresql
Locopy
locopy: Loading/Unloading to Redshift and Snowflake using Python.
Stars: ✭ 73 (-96.03%)
Mutual labels:  etl, data
Reddit Detective
Play detective on Reddit: Discover political disinformation campaigns, secret influencers and more
Stars: ✭ 129 (-92.99%)
Mutual labels:  etl, data
Ether sql
A python library to push ethereum blockchain data into an sql database.
Stars: ✭ 41 (-97.77%)
Mutual labels:  etl, postgresql
Luigi Warehouse
A luigi powered analytics / warehouse stack
Stars: ✭ 72 (-96.09%)
Mutual labels:  etl, postgresql
Pdpipe
Easy pipelines for pandas DataFrames.
Stars: ✭ 590 (-67.95%)
Mutual labels:  pipeline, data
Udacity Data Engineering
Udacity Data Engineering Nano Degree (DEND)
Stars: ✭ 89 (-95.17%)
Mutual labels:  etl, postgresql
Od
Česká otevřená data
Stars: ✭ 99 (-94.62%)
Mutual labels:  etl, postgresql
Csv2db
The CSV to database command line loader
Stars: ✭ 102 (-94.46%)
Mutual labels:  etl, postgresql
Phila Airflow
Stars: ✭ 16 (-99.13%)
Mutual labels:  pipeline, etl
Metabase
The simplest, fastest way to get business intelligence and analytics to everyone in your company 😋
Stars: ✭ 26,803 (+1355.89%)
Mutual labels:  postgresql, data
Ensembl Hive
EnsEMBL Hive - a system for creating and running pipelines on a distributed compute resource
Stars: ✭ 44 (-97.61%)
Mutual labels:  pipeline, postgresql
Go Streams
A lightweight stream processing library for Go
Stars: ✭ 615 (-66.59%)
Mutual labels:  pipeline, etl
Transporter
Sync data between persistence engines, like ETL only not stodgy
Stars: ✭ 1,175 (-36.18%)
Mutual labels:  etl, postgresql
Riko
A Python stream processing engine modeled after Yahoo! Pipes
Stars: ✭ 1,571 (-14.67%)
Mutual labels:  etl, data
Datacleaner
The premier open source Data Quality solution
Stars: ✭ 391 (-78.76%)
Mutual labels:  etl, data
Pglogical
Logical Replication extension for PostgreSQL 13, 12, 11, 10, 9.6, 9.5, 9.4 (Postgres), providing much faster replication than Slony, Bucardo or Londiste, as well as cross-version upgrades.
Stars: ✭ 455 (-75.29%)
Mutual labels:  etl, postgresql
Setl
A simple Spark-powered ETL framework that just works 🍺
Stars: ✭ 79 (-95.71%)
Mutual labels:  pipeline, etl

Mara Pipelines

Build Status PyPI - License PyPI version Slack Status

This package contains a lightweight data transformation framework with a focus on transparency and complexity reduction. It has a number of baked-in assumptions/ principles:

  • Data integration pipelines as code: pipelines, tasks and commands are created using declarative Python code.

  • PostgreSQL as a data processing engine.

  • Extensive web ui. The web browser as the main tool for inspecting, running and debugging pipelines.

  • GNU make semantics. Nodes depend on the completion of upstream nodes. No data dependencies or data flows.

  • No in-app data processing: command line tools as the main tool for interacting with databases and data.

  • Single machine pipeline execution based on Python's multiprocessing. No need for distributed task queues. Easy debugging and output logging.

  • Cost based priority queues: nodes with higher cost (based on recorded run times) are run first.

 

Installation

To use the library directly, use pip:

pip install mara-pipelines

or

pip install git+https://github.com/mara/mara-pipelines.git

For an example of an integration into a flask application, have a look at the mara example project 1 and mara example project 2.

Due to the heavy use of forking, Mara Pipelines does not run natively on Windows. If you want to run it on Windows, then please use Docker or the Windows Subsystem for Linux.

 

Example

Here is a pipeline "demo" consisting of three nodes that depend on each other: the task ping_localhost, the pipeline sub_pipeline and the task sleep:

from mara_pipelines.commands.bash import RunBash
from mara_pipelines.pipelines import Pipeline, Task
from mara_pipelines.ui.cli import run_pipeline, run_interactively

pipeline = Pipeline(
    id='demo',
    description='A small pipeline that demonstrates the interplay between pipelines, tasks and commands')

pipeline.add(Task(id='ping_localhost', description='Pings localhost',
                  commands=[RunBash('ping -c 3 localhost')]))

sub_pipeline = Pipeline(id='sub_pipeline', description='Pings a number of hosts')

for host in ['google', 'amazon', 'facebook']:
    sub_pipeline.add(Task(id=f'ping_{host}', description=f'Pings {host}',
                          commands=[RunBash(f'ping -c 3 {host}.com')]))

sub_pipeline.add_dependency('ping_amazon', 'ping_facebook')
sub_pipeline.add(Task(id='ping_foo', description='Pings foo',
                      commands=[RunBash('ping foo')]), ['ping_amazon'])

pipeline.add(sub_pipeline, ['ping_localhost'])

pipeline.add(Task(id='sleep', description='Sleeps for 2 seconds',
                  commands=[RunBash('sleep 2')]), ['sub_pipeline'])

Tasks contain lists of commands, which do the actual work (in this case running bash commands that ping various hosts).

 

In order to run the pipeline, a PostgreSQL database needs to be configured for storing run-time information, run output and status of incremental processing:

import mara_db.auto_migration
import mara_db.config
import mara_db.dbs

mara_db.config.databases \
    = lambda: {'mara': mara_db.dbs.PostgreSQLDB(host='localhost', user='root', database='example_etl_mara')}

mara_db.auto_migration.auto_discover_models_and_migrate()

Given that PostgresSQL is running and the credentials work, the output looks like this (a database with a number of tables is created):

Created database "postgresql+psycopg2://root@localhost/example_etl_mara"

CREATE TABLE data_integration_file_dependency (
    node_path TEXT[] NOT NULL, 
    dependency_type VARCHAR NOT NULL, 
    hash VARCHAR, 
    timestamp TIMESTAMP WITHOUT TIME ZONE, 
    PRIMARY KEY (node_path, dependency_type)
);

.. more tables

CLI UI

This runs a pipeline with output to stdout:

from mara_pipelines.ui.cli import run_pipeline

run_pipeline(pipeline)

Example run cli 1

 

And this runs a single node of pipeline sub_pipeline together with all the nodes that it depends on:

run_pipeline(sub_pipeline, nodes=[sub_pipeline.nodes['ping_amazon']], with_upstreams=True)

Example run cli 2

 

And finally, there is some sort of menu based on pythondialog that allows to navigate and run pipelines like this:

from mara_pipelines.ui.cli import run_interactively

run_interactively()

Example run cli 3

Web UI

More importantly, this package provides an extensive web interface. It can be easily integrated into any Flask based app and the mara example project demonstrates how to do this using mara-app.

For each pipeline, there is a page that shows

  • a graph of all child nodes and the dependencies between them
  • a chart of the overal run time of the pipeline and it's most expensive nodes over the last 30 days (configurable)
  • a table of all the pipeline's nodes with their average run times and the resulting queuing priority
  • output and timeline for the last runs of the pipeline

Mara pipelines web ui 1

For each task, there is a page showing

  • the upstreams and downstreams of the task in the pipeline
  • the run times of the task in the last 30 days
  • all commands of the task
  • output of the last runs of the task

Mara pipelines web ui 2

Pipelines and tasks can be run from the web ui directly, which is probably one of the main features of this package:

Example run web ui

 

Getting started

Documentation is currently work in progress. Please use the mara example project 1 and mara example project 2 as a reference for getting started.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].