Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

🐍💻📊 All material from the PyCon.DE 2018 Talk "Beyond Jupyter Notebooks - Building your own data science platform with Python & Docker" (incl. Slides, Video, Udemy MOOC & other References)

Stars: ✭ 135 (-38.91%)

Mutual labels: data-science, jupyter

Learnpythonforresearch

This repository provides everything you need to get started with Python for (social science) research.

Stars: ✭ 163 (-26.24%)

Mutual labels: data-science, jupyter

Geni

A Clojure dataframe library that runs on Spark

Stars: ✭ 152 (-31.22%)

Mutual labels: data-science, data-engineering

Auptimizer

An automatic ML model optimization tool.

Stars: ✭ 166 (-24.89%)

Mutual labels: data-science, data-engineering

Lets Plot Kotlin

Kotlin API for Lets-Plot - an open-source plotting library for statistical data.

Stars: ✭ 181 (-18.1%)

Mutual labels: data-science, jupyter

Dash

Analytical Web Apps for Python, R, Julia, and Jupyter. No JavaScript Required.

Stars: ✭ 15,592 (+6955.2%)

Mutual labels: data-science, jupyter

Ml Hub

🧰 Multi-user development platform for machine learning teams. Simple to setup within minutes.

Stars: ✭ 148 (-33.03%)

Mutual labels: data-science, jupyter

Batchflow

BatchFlow helps you conveniently work with random or sequential batches of your data and define data processing and machine learning workflows even for datasets that do not fit into memory.

Stars: ✭ 156 (-29.41%)

Mutual labels: data-science, workflow

Accelerator

The Accelerator is a tool for fast and reproducible processing of large amounts of data.

Stars: ✭ 137 (-38.01%)

Mutual labels: data-science, data-engineering

Primehub

A toil-free multi-tenancy machine learning platform in your Kubernetes cluster

Stars: ✭ 160 (-27.6%)

Mutual labels: data-science, jupyter

Soda Sql

Metric collection, data testing and monitoring for SQL accessible data

Stars: ✭ 173 (-21.72%)

Mutual labels: data-science, data-engineering

Plynx

PLynx is a domain agnostic platform for managing reproducible experiments and data-oriented workflows.

Stars: ✭ 192 (-13.12%)

Mutual labels: data-science, workflow

Python For Data Science

A collection of Jupyter Notebooks for learning Python for Data Science.

Stars: ✭ 205 (-7.24%)

Mutual labels: data-science, jupyter

Butterfree

A tool for building feature stores.

Stars: ✭ 126 (-42.99%)

Mutual labels: data-science, data-engineering

View All Similar Projects ➔

Ploomber is the simplest way to build reliable data pipelines for Data Science and Machine Learning. Provide your source code in a standard form, and Ploomber automatically constructs the pipeline for you. Tasks can be anything from Python functions, Jupyter notebooks, Python/R/shell scripts, and SQL scripts.

When you're ready, deploy to Airflow or Kubernetes (using Argo) without code changes.

Here's how pipeline tasks look like:

Function	Jupyter notebook or Python script	SQL script	Pipeline declaration
def clean_users(product, upstream): # run 'get_users' before this function. # upstream['get_users'] returns the output # of such task, used as input here df = pd.read_csv(upstream['get_users']) # your code here... # save output using the provided # product variable df.to_csv(product)	# + tags=["parameters"] # run 'clean users' and 'clean_activity' # before this script/notebook upstream = ['clean_users', 'clean_activity'] # - # a new cell is injected here with # the product variable # e.g., product = '/path/output.csv' # and a new upstream variable: # e.g., upstream = {'clean_users': '/path/...' # 'clean_activity': '/another/...'} # your code here... # save output using the provided product variable Path(product).write_bytes(pickle.dumps(model))	-- {{product}} is replaced by the table name CREATE TABLE AS {{product}} /* run 'raw_data' before this task. Replace {{upstream['raw_data']}} with table name at runtime / SELECT FROM {{upstream['raw_data']}}	tasks: # function - source: functions.clean_users product: output/users-clean.csv # python script (or notebook) - source: notebooks/model-template.py product: model: output/model.pickle nb: output/model-evaluation.html # sql script - source: scripts/some_script.sql product: [schema, name, table] client: db.get_client

Function

Jupyter notebook or Python script

SQL script

Pipeline declaration

def clean_users(product, upstream):
    # run 'get_users' before this function.
    # upstream['get_users'] returns the output
    # of such task, used as input here
    df = pd.read_csv(upstream['get_users'])

    # your code here...

    # save output using the provided
    # product variable
    df.to_csv(product)

# + tags=["parameters"]
# run 'clean users' and 'clean_activity'
# before this script/notebook
upstream = ['clean_users', 'clean_activity']
# -

# a new cell is injected here with
# the product variable
# e.g., product = '/path/output.csv'
# and a new upstream variable:
# e.g., upstream = {'clean_users': '/path/...'
#                   'clean_activity': '/another/...'}

# your code here...

# save output using the provided product variable
Path(product).write_bytes(pickle.dumps(model))

-- {{product}} is replaced by the table name
CREATE TABLE AS {{product}}
/*
run 'raw_data' before this task. Replace
{{upstream['raw_data']}} with table name
at runtime
*/
SELECT * FROM {{upstream['raw_data']}}

tasks:
  # function
  - source: functions.clean_users
    product: output/users-clean.csv

  # python script (or notebook)
  - source: notebooks/model-template.py
    product:
      model: output/model.pickle
      nb: output/model-evaluation.html
  
  # sql script
  - source: scripts/some_script.sql
    product: [schema, name, table]
    client: db.get_client

Resources

Installation

pip install ploomber

Compatible with Python 3.6 and higher.

Try it out!

You can choose from one of the hosted options:

Or run locally:

# ML pipeline example
ploomber examples --name ml-basic
cd ml-basic

# if using pip
pip install -r requirements.txt

# if using conda
conda env create --file environment.yml
conda activate ml-basic

# run pipeline
ploomber build

Pipeline output saved in the output/ folder. Check out the pipeline definition in the pipeline.yaml file.

To get a list of examples, run ploomber examples.

Main features

Jupyter integration. When you open your notebooks, Ploomber will automatically inject a new cell with the location of your input files, as inferred from your upstream variable. If you open a Python or R script, it's converted to a notebook on the fly.
Incremental builds. Speed up execution by skipping tasks whose source code hasn't changed.
Parallelization. Run tasks in parallel to speed up computations.
Pipeline testing. Run tests upon task execution to verify that the output data has the right properties (e.g., values within expected range).
Pipeline inspection. Start an interactive session with ploomber interact to debug your pipeline. Call dag['task_name'].debug() to start a debugging session.
Deployment to Kubernetes and Airflow. You can develop and execute locally. Once you are ready to deploy, export to Kubernetes or Airflow.

How does Ploomber compare to X?

Ploomber has two goals:

Provide an excellent development experience for Data Science/Machine learning projects, which require a lot of experimentation/iteration: incremental builds and Jupyter integration are a fundamental part of this.
Integrate with deployment tools (Airflow and Argo) to streamline deployment.

For a complete comparison, read our survey on workflow management tools.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Stars: ✭ 221

Visit Git Page 🔗Visit User Page 🔗Visit Issues Page (39) 🔗