All Projects → stitchfix → hamilton

stitchfix / hamilton

Licence: BSD-3-Clause-Clear license
A scalable general purpose micro-framework for defining dataflows. You can use it to create dataframes, numpy matrices, python objects, ML models, etc.

Programming Languages

python
139335 projects - #7 most used programming language
shell
77523 projects

Projects that are alternatives of or similar to hamilton

gallia-core
A schema-aware Scala library for data transformation
Stars: ✭ 44 (-92.81%)
Mutual labels:  etl, data-engineering, feature-engineering
csvplus
csvplus extends the standard Go encoding/csv package with fluent interface, lazy stream operations, indices and joins.
Stars: ✭ 67 (-89.05%)
Mutual labels:  etl, etl-framework, etl-pipeline
redis-connect-dist
Real-Time Event Streaming & Change Data Capture
Stars: ✭ 21 (-96.57%)
Mutual labels:  etl, etl-framework, etl-pipeline
etlflow
EtlFlow is an ecosystem of functional libraries in Scala based on ZIO for writing various different tasks, jobs on GCP and AWS.
Stars: ✭ 38 (-93.79%)
Mutual labels:  etl, etl-framework, etl-pipeline
vixtract
www.vixtract.ru
Stars: ✭ 40 (-93.46%)
Mutual labels:  etl, etl-framework, etl-pipeline
DaFlow
Apache-Spark based Data Flow(ETL) Framework which supports multiple read, write destinations of different types and also support multiple categories of transformation rules.
Stars: ✭ 24 (-96.08%)
Mutual labels:  etl, etl-framework, etl-pipeline
Aws Data Wrangler
Pandas on AWS - Easy integration with Athena, Glue, Redshift, Timestream, QuickSight, Chime, CloudWatchLogs, DynamoDB, EMR, SecretManager, PostgreSQL, MySQL, SQLServer and S3 (Parquet, CSV, JSON and EXCEL).
Stars: ✭ 2,385 (+289.71%)
Mutual labels:  etl, pandas, data-engineering
saddle
SADDLE: Scala Data Library
Stars: ✭ 23 (-96.24%)
Mutual labels:  numpy, pandas, dataframe
pyjanitor
Clean APIs for data cleaning. Python implementation of R package Janitor
Stars: ✭ 970 (+58.5%)
Mutual labels:  pandas, data-engineering, dataframe
Eland
Python Client and Toolkit for DataFrames, Big Data, Machine Learning and ETL in Elasticsearch
Stars: ✭ 235 (-61.6%)
Mutual labels:  etl, pandas, dataframe
Mars
Mars is a tensor-based unified framework for large-scale data computation which scales numpy, pandas, scikit-learn and Python functions.
Stars: ✭ 2,308 (+277.12%)
Mutual labels:  numpy, pandas, dataframe
DIRECT
DIRECT, the Data Integration Run-time Execution Control Tool, is a data logistics framework that can be used to monitor, log, audit and control data integration / ETL processes.
Stars: ✭ 20 (-96.73%)
Mutual labels:  etl, etl-framework, etl-pipeline
Ditching Excel For Python
Functionalities in Excel translated to Python
Stars: ✭ 172 (-71.9%)
Mutual labels:  numpy, pandas, dataframe
datalake-etl-pipeline
Simplified ETL process in Hadoop using Apache Spark. Has complete ETL pipeline for datalake. SparkSession extensions, DataFrame validation, Column extensions, SQL functions, and DataFrame transformations
Stars: ✭ 39 (-93.63%)
Mutual labels:  etl, etl-framework, etl-pipeline
Panthera
Data-frames & arrays on Clojure
Stars: ✭ 168 (-72.55%)
Mutual labels:  numpy, pandas, dataframe
Baby Names Analysis
Data ETL & Analysis on the dataset 'Baby Names from Social Security Card Applications - National Data'.
Stars: ✭ 557 (-8.99%)
Mutual labels:  etl, numpy, pandas
Pyjanitor
Clean APIs for data cleaning. Python implementation of R package Janitor
Stars: ✭ 647 (+5.72%)
Mutual labels:  pandas, data-engineering, dataframe
Butterfree
A tool for building feature stores.
Stars: ✭ 126 (-79.41%)
Mutual labels:  etl, data-engineering, etl-framework
AirflowETL
Blog post on ETL pipelines with Airflow
Stars: ✭ 20 (-96.73%)
Mutual labels:  etl, data-engineering, etl-pipeline
covid-19
Data ETL & Analysis on the global and Mexican datasets of the COVID-19 pandemic.
Stars: ✭ 14 (-97.71%)
Mutual labels:  etl, numpy, pandas

Welcome to Hamilton's Github Repository

Hamilton CircleCI Hamilton Slack Twitter
Python supported PyPi Version Total Downloads

Hamilton

The general purpose micro-framework for creating dataflows from python functions!

Specifically, Hamilton defines a novel paradigm, that allows you to specify a flow of (delayed) execution, that forms a Directed Acyclic Graph (DAG). It was original built to solve creating wide (1000+) column dataframes. Core to the design of Hamilton is a clear mapping of function name to dataflow output. That is, Hamilton forces a certain paradigm with writing functions, and aims for DAG clarity, easy modifications, with always unit testable and naturally documentable code.

For the backstory on how Hamilton came about, see our blog post!.

Getting Started

Here's a quick getting started guide to get you up and running in less than 15 minutes. If you need help join our slack community to chat/ask Qs/etc. For the latest updates, follow us on twitter!

Installation

Requirements:

  • Python 3.6+

To get started, first you need to install hamilton. It is published to pypi under sf-hamilton:

pip install sf-hamilton

Note: to use the DAG visualization functionality, you should instead do:

pip install sf-hamilton[visualization]

While it is installing we encourage you to start on the next section.

Note: the content (i.e. names, function bodies) of our example code snippets are for illustrative purposes only, and don't reflect what we actually do internally.

Hamilton in <15 minutes

Hamilton is a new paradigm when it comes to creating, um, dataframes (let's use dataframes as an example, otherwise you can create ANY python object). Rather than thinking about manipulating a central dataframe, as is normal in some data engineering/data science work, you instead think about the column(s) you want to create, and what inputs are required. There is no need for you to think about maintaining this dataframe, meaning you do not need to think about any "glue" code; this is all taken care of by the Hamilton framework.

For example rather than writing the following to manipulate a central dataframe object df:

df['col_c'] = df['col_a'] + df['col_b']

you write

def col_c(col_a: pd.Series, col_b: pd.Series) -> pd.Series:
    """Creating column c from summing column a and column b."""
    return col_a + col_b

In diagram form: example The Hamilton framework will then be able to build a DAG from this function definition.

So let's create a "Hello World" and start using Hamilton!

Your first hello world.

By now, you should have installed Hamilton, so let's write some code.

  1. Create a file my_functions.py and add the following functions:
import pandas as pd

def avg_3wk_spend(spend: pd.Series) -> pd.Series:
    """Rolling 3 week average spend."""
    return spend.rolling(3).mean()

def spend_per_signup(spend: pd.Series, signups: pd.Series) -> pd.Series:
    """The cost per signup in relation to spend."""
    return spend / signups

The astute observer will notice we have not defined spend or signups as functions. That is okay, this just means these need to be provided as input when we come to actually wanting to create a dataframe.

Note: functions can take or create scalar values, in addition to any python object type.

  1. Create a my_script.py which is where code will live to tell Hamilton what to do:
import logging
import importlib

import pandas as pd
from hamilton import driver

logging.basicConfig(stream=sys.stdout)
initial_columns = {  # load from actuals or wherever -- this is our initial data we use as input.
    # Note: these do not have to be all series, they could be scalar inputs.
    'signups': pd.Series([1, 10, 50, 100, 200, 400]),
    'spend': pd.Series([10, 10, 20, 40, 40, 50]),
}
# we need to tell hamilton where to load function definitions from
module_name = 'my_functions'
module = importlib.import_module(module_name)
dr = driver.Driver(initial_columns, module)  # can pass in multiple modules
# we need to specify what we want in the final dataframe.
output_columns = [
    'spend',
    'signups',
    'avg_3wk_spend',
    'spend_per_signup',
]
# let's create the dataframe!
# if you only did `pip install sf-hamilton` earlier:
df = dr.execute(output_columns)
# else if you did `pip install sf-hamilton[visualization]` earlier:
# dr.visualize_execution(output_columns, './my-dag.dot', {})
print(df)
  1. Run my_script.py

python my_script.py

You should see the following output:

   spend  signups  avg_3wk_spend  spend_per_signup
0     10        1            NaN            10.000
1     10       10            NaN             1.000
2     20       50      13.333333             0.400
3     40      100      23.333333             0.400
4     40      200      33.333333             0.200
5     50      400      43.333333             0.125

You should see the following image if you ran dr.visualize_execution(output_columns, './my-dag.dot', {}):

hello_world_image

Congratulations - you just created your Hamilton dataflow that created a dataframe!

Example Hamilton Dataflows

We have a growing list of examples showcasing how one might use Hamilton. You can find them all under the examples/ directory. E.g.

Slack Community

We have a small but active community on slack. Come join us!

License

Hamilton is released under the BSD 3-Clause Clear License. If you need to get in touch about something, contact us at algorithms-opensource (at) stitchfix.com.

Used by

To add your company, make a pull request to add it here.

Contributing

We take contributions, large and small. We operate via a Code of Conduct and expect anyone contributing to do the same.

To see how you can contribute, please read our contributing guidelines and then our developer setup guide.

Blog Posts

Videos of talks

Prescribed Development Workflow

In general we prescribe the following:

  1. Ensure you understand Hamilton Basics.
  2. Familiarize yourself with some of the Hamilton decorators. They will help keep your code DRY.
  3. Start creating Hamilton Functions that represent your work. We suggest grouping them in modules where it makes sense.
  4. Write a simple script so that you can easily run things end to end.
  5. Join our Slack community to chat/ask Qs/etc.

For the backstory on Hamilton we invite you to watch a roughly-9 minute lightning talk on it that we gave at the apply conference: video, slides.

PyCharm Tips

If you're using Hamilton, it's likely that you'll need to migrate some code. Here are some useful tricks we found to speed up that process.

Live templates

Live templates are a cool feature and allow you to type in a name which expands into some code.

E.g. For example, we wrote one to make it quick to stub out Hamilton functions: typing graphfunc would turn into ->

def _(_: pd.Series) -> pd.Series:
   """"""
   return _

Where the blanks are where you can tab with the cursor and fill things in. See your pycharm preferences for setting this up.

Multiple Cursors

If you are doing a lot of repetitive work, one might consider multiple cursors. Multiple cursors allow you to do things on multiple lines at once.

To use it hit option + mouse click to create multiple cursors. Esc to revert back to a normal mode.

Contributors

Code Contributors

  • Stefan Krawczyk (@skrawcz)
  • Elijah ben Izzy (@elijahbenizzy)
  • Danielle Quinn (@danfisher-sf)
  • Rachel Insoft (@rinsoft-sf)
  • Shelly Jang (@shellyjang)
  • Vincent Chu (@vslchusf)
  • Christopher Prohm (@chmp)
  • James Lamb (@jameslamb)

Bug Hunters/Special Mentions

  • Nils Olsson (@nilsso)
  • Michał Siedlaczek (@elshize)
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].