All Projects → ploomber → Ploomber

ploomber / Ploomber

Licence: apache-2.0
A convention over configuration workflow orchestrator. Develop locally (Jupyter or your favorite editor), deploy to Airflow or Kubernetes.

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to Ploomber

Polyaxon
Machine Learning Platform for Kubernetes (MLOps tools for experimentation and automation)
Stars: ✭ 2,966 (+1242.08%)
Mutual labels:  data-science, jupyter, workflow
Prefect
The easiest way to automate your data
Stars: ✭ 7,956 (+3500%)
Mutual labels:  data-science, data-engineering, workflow
Vds
Verteego Data Suite
Stars: ✭ 9 (-95.93%)
Mutual labels:  data-science, jupyter, workflow
Gspread Pandas
A package to easily open an instance of a Google spreadsheet and interact with worksheets through Pandas DataFrames.
Stars: ✭ 226 (+2.26%)
Mutual labels:  data-science, data-engineering
Ml Workspace
🛠 All-in-one web-based IDE specialized for machine learning and data science.
Stars: ✭ 2,337 (+957.47%)
Mutual labels:  data-science, jupyter
Data Science Stack Cookiecutter
🐳📊🤓Cookiecutter template to launch an awesome dockerized Data Science toolstack (incl. Jupyster, Superset, Postgres, Minio, AirFlow & API Star)
Stars: ✭ 153 (-30.77%)
Mutual labels:  data-science, jupyter
Beyond Jupyter
🐍💻📊 All material from the PyCon.DE 2018 Talk "Beyond Jupyter Notebooks - Building your own data science platform with Python & Docker" (incl. Slides, Video, Udemy MOOC & other References)
Stars: ✭ 135 (-38.91%)
Mutual labels:  data-science, jupyter
Learnpythonforresearch
This repository provides everything you need to get started with Python for (social science) research.
Stars: ✭ 163 (-26.24%)
Mutual labels:  data-science, jupyter
Geni
A Clojure dataframe library that runs on Spark
Stars: ✭ 152 (-31.22%)
Mutual labels:  data-science, data-engineering
Auptimizer
An automatic ML model optimization tool.
Stars: ✭ 166 (-24.89%)
Mutual labels:  data-science, data-engineering
Lets Plot Kotlin
Kotlin API for Lets-Plot - an open-source plotting library for statistical data.
Stars: ✭ 181 (-18.1%)
Mutual labels:  data-science, jupyter
Dash
Analytical Web Apps for Python, R, Julia, and Jupyter. No JavaScript Required.
Stars: ✭ 15,592 (+6955.2%)
Mutual labels:  data-science, jupyter
Ml Hub
🧰 Multi-user development platform for machine learning teams. Simple to setup within minutes.
Stars: ✭ 148 (-33.03%)
Mutual labels:  data-science, jupyter
Batchflow
BatchFlow helps you conveniently work with random or sequential batches of your data and define data processing and machine learning workflows even for datasets that do not fit into memory.
Stars: ✭ 156 (-29.41%)
Mutual labels:  data-science, workflow
Accelerator
The Accelerator is a tool for fast and reproducible processing of large amounts of data.
Stars: ✭ 137 (-38.01%)
Mutual labels:  data-science, data-engineering
Primehub
A toil-free multi-tenancy machine learning platform in your Kubernetes cluster
Stars: ✭ 160 (-27.6%)
Mutual labels:  data-science, jupyter
Soda Sql
Metric collection, data testing and monitoring for SQL accessible data
Stars: ✭ 173 (-21.72%)
Mutual labels:  data-science, data-engineering
Plynx
PLynx is a domain agnostic platform for managing reproducible experiments and data-oriented workflows.
Stars: ✭ 192 (-13.12%)
Mutual labels:  data-science, workflow
Python For Data Science
A collection of Jupyter Notebooks for learning Python for Data Science.
Stars: ✭ 205 (-7.24%)
Mutual labels:  data-science, jupyter
Butterfree
A tool for building feature stores.
Stars: ✭ 126 (-42.99%)
Mutual labels:  data-science, data-engineering

CI Linux CI macOS CI Windows Documentation Status PyPI Coverage Twitter Binder Deepnote

Diagram

Ploomber is the simplest way to build reliable data pipelines for Data Science and Machine Learning. Provide your source code in a standard form, and Ploomber automatically constructs the pipeline for you. Tasks can be anything from Python functions, Jupyter notebooks, Python/R/shell scripts, and SQL scripts.

When you're ready, deploy to Airflow or Kubernetes (using Argo) without code changes.

Here's how pipeline tasks look like:

Function Jupyter notebook or Python script SQL script Pipeline declaration
def clean_users(product, upstream):
    # run 'get_users' before this function.
    # upstream['get_users'] returns the output
    # of such task, used as input here
    df = pd.read_csv(upstream['get_users'])

    # your code here...

    # save output using the provided
    # product variable
    df.to_csv(product)
# + tags=["parameters"]
# run 'clean users' and 'clean_activity'
# before this script/notebook
upstream = ['clean_users', 'clean_activity']
# -

# a new cell is injected here with
# the product variable
# e.g., product = '/path/output.csv'
# and a new upstream variable:
# e.g., upstream = {'clean_users': '/path/...'
#                   'clean_activity': '/another/...'}

# your code here...

# save output using the provided product variable
Path(product).write_bytes(pickle.dumps(model))
-- {{product}} is replaced by the table name
CREATE TABLE AS {{product}}
/*
run 'raw_data' before this task. Replace
{{upstream['raw_data']}} with table name
at runtime
*/
SELECT * FROM {{upstream['raw_data']}}
tasks:
  # function
  - source: functions.clean_users
    product: output/users-clean.csv

  # python script (or notebook)
  - source: notebooks/model-template.py
    product:
      model: output/model.pickle
      nb: output/model-evaluation.html
  
  # sql script
  - source: scripts/some_script.sql
    product: [schema, name, table]
    client: db.get_client

Resources

Installation

pip install ploomber

Compatible with Python 3.6 and higher.

Try it out!

You can choose from one of the hosted options:

image image

Or run locally:

# ML pipeline example
ploomber examples --name ml-basic
cd ml-basic

# if using pip
pip install -r requirements.txt

# if using conda
conda env create --file environment.yml
conda activate ml-basic

# run pipeline
ploomber build

Pipeline output saved in the output/ folder. Check out the pipeline definition in the pipeline.yaml file.

To get a list of examples, run ploomber examples.

Main features

  1. Jupyter integration. When you open your notebooks, Ploomber will automatically inject a new cell with the location of your input files, as inferred from your upstream variable. If you open a Python or R script, it's converted to a notebook on the fly.
  2. Incremental builds. Speed up execution by skipping tasks whose source code hasn't changed.
  3. Parallelization. Run tasks in parallel to speed up computations.
  4. Pipeline testing. Run tests upon task execution to verify that the output data has the right properties (e.g., values within expected range).
  5. Pipeline inspection. Start an interactive session with ploomber interact to debug your pipeline. Call dag['task_name'].debug() to start a debugging session.
  6. Deployment to Kubernetes and Airflow. You can develop and execute locally. Once you are ready to deploy, export to Kubernetes or Airflow.

How does Ploomber compare to X?

Ploomber has two goals:

  1. Provide an excellent development experience for Data Science/Machine learning projects, which require a lot of experimentation/iteration: incremental builds and Jupyter integration are a fundamental part of this.
  2. Integrate with deployment tools (Airflow and Argo) to streamline deployment.

For a complete comparison, read our survey on workflow management tools.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].