ploomber / Ploomber
Programming Languages
Projects that are alternatives of or similar to Ploomber
Ploomber is the simplest way to build reliable data pipelines for Data Science and Machine Learning. Provide your source code in a standard form, and Ploomber automatically constructs the pipeline for you. Tasks can be anything from Python functions, Jupyter notebooks, Python/R/shell scripts, and SQL scripts.
When you're ready, deploy to Airflow or Kubernetes (using Argo) without code changes.
Here's how pipeline tasks look like:
Function | Jupyter notebook or Python script | SQL script | Pipeline declaration |
---|---|---|---|
def clean_users(product, upstream):
# run 'get_users' before this function.
# upstream['get_users'] returns the output
# of such task, used as input here
df = pd.read_csv(upstream['get_users'])
# your code here...
# save output using the provided
# product variable
df.to_csv(product)
|
# + tags=["parameters"]
# run 'clean users' and 'clean_activity'
# before this script/notebook
upstream = ['clean_users', 'clean_activity']
# -
# a new cell is injected here with
# the product variable
# e.g., product = '/path/output.csv'
# and a new upstream variable:
# e.g., upstream = {'clean_users': '/path/...'
# 'clean_activity': '/another/...'}
# your code here...
# save output using the provided product variable
Path(product).write_bytes(pickle.dumps(model))
|
-- {{product}} is replaced by the table name
CREATE TABLE AS {{product}}
/*
run 'raw_data' before this task. Replace
{{upstream['raw_data']}} with table name
at runtime
*/
SELECT * FROM {{upstream['raw_data']}}
|
tasks:
# function
- source: functions.clean_users
product: output/users-clean.csv
# python script (or notebook)
- source: notebooks/model-template.py
product:
model: output/model.pickle
nb: output/model-evaluation.html
# sql script
- source: scripts/some_script.sql
product: [schema, name, table]
client: db.get_client
|
Resources
- Documentation
- Sample projects (Machine Learning pipeline, ETL, among others)
- Watch JupyterCon 2020 talk
Installation
pip install ploomber
Compatible with Python 3.6 and higher.
Try it out!
You can choose from one of the hosted options:
Or run locally:
# ML pipeline example
ploomber examples --name ml-basic
cd ml-basic
# if using pip
pip install -r requirements.txt
# if using conda
conda env create --file environment.yml
conda activate ml-basic
# run pipeline
ploomber build
Pipeline output saved in the output/
folder. Check out the pipeline definition
in the pipeline.yaml
file.
To get a list of examples, run ploomber examples
.
Main features
-
Jupyter integration. When you open your notebooks, Ploomber will
automatically inject a new cell with the location of your input
files, as inferred from your
upstream
variable. If you open a Python or R script, it's converted to a notebook on the fly. - Incremental builds. Speed up execution by skipping tasks whose source code hasn't changed.
- Parallelization. Run tasks in parallel to speed up computations.
- Pipeline testing. Run tests upon task execution to verify that the output data has the right properties (e.g., values within expected range).
-
Pipeline inspection. Start an interactive session with
ploomber interact
to debug your pipeline. Calldag['task_name'].debug()
to start a debugging session. - Deployment to Kubernetes and Airflow. You can develop and execute locally. Once you are ready to deploy, export to Kubernetes or Airflow.
How does Ploomber compare to X?
Ploomber has two goals:
- Provide an excellent development experience for Data Science/Machine learning projects, which require a lot of experimentation/iteration: incremental builds and Jupyter integration are a fundamental part of this.
- Integrate with deployment tools (Airflow and Argo) to streamline deployment.
For a complete comparison, read our survey on workflow management tools.