All Projects → rstojnic → Lazydata

rstojnic / Lazydata

Licence: apache-2.0
Lazydata: Scalable data dependencies for Python projects

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to Lazydata

Pygam
[HELP REQUESTED] Generalized Additive Models in Python
Stars: ✭ 569 (-9.25%)
Mutual labels:  data-science
Datasheets
Read data from, write data to, and modify the formatting of Google Sheets
Stars: ✭ 593 (-5.42%)
Mutual labels:  data-science
Elki
ELKI Data Mining Toolkit
Stars: ✭ 613 (-2.23%)
Mutual labels:  data-science
Data Science Competitions
Goal of this repo is to provide the solutions of all Data Science Competitions(Kaggle, Data Hack, Machine Hack, Driven Data etc...).
Stars: ✭ 572 (-8.77%)
Mutual labels:  data-science
Getting Started With Genomics Tools And Resources
Unix, R and python tools for genomics and data science
Stars: ✭ 587 (-6.38%)
Mutual labels:  data-science
Moviegeek
A django website used in the book Practical Recommender Systems to illustrate how recommender algorithms can be implemented.
Stars: ✭ 608 (-3.03%)
Mutual labels:  data-science
Alphapy
Automated Machine Learning [AutoML] with Python, scikit-learn, Keras, XGBoost, LightGBM, and CatBoost
Stars: ✭ 564 (-10.05%)
Mutual labels:  data-science
Matrixprofile Ts
A Python library for detecting patterns and anomalies in massive datasets using the Matrix Profile
Stars: ✭ 621 (-0.96%)
Mutual labels:  data-science
Pdpipe
Easy pipelines for pandas DataFrames.
Stars: ✭ 590 (-5.9%)
Mutual labels:  data-science
Dist Keras
Distributed Deep Learning, with a focus on distributed training, using Keras and Apache Spark.
Stars: ✭ 613 (-2.23%)
Mutual labels:  data-science
Vehicle counting tensorflow
🚘 "MORE THAN VEHICLE COUNTING!" This project provides prediction for speed, color and size of the vehicles with TensorFlow Object Counting API.
Stars: ✭ 582 (-7.18%)
Mutual labels:  data-science
Awesome Ai Usecases
A list of awesome and proven Artificial Intelligence use cases and applications
Stars: ✭ 587 (-6.38%)
Mutual labels:  data-science
Book sample
another book on data science
Stars: ✭ 611 (-2.55%)
Mutual labels:  data-science
Baikal
A graph-based functional API for building complex scikit-learn pipelines.
Stars: ✭ 573 (-8.61%)
Mutual labels:  data-science
H2o 3
H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random Forest, Generalized Linear Modeling (GLM with Elastic Net), K-Means, PCA, Generalized Additive Models (GAM), RuleFit, Support Vector Machine (SVM), Stacked Ensembles, Automatic Machine Learning (AutoML), etc.
Stars: ✭ 5,656 (+802.07%)
Mutual labels:  data-science
Datasets For Recommender Systems
This is a repository of a topic-centric public data sources in high quality for Recommender Systems (RS)
Stars: ✭ 564 (-10.05%)
Mutual labels:  data-science
Smile
Statistical Machine Intelligence & Learning Engine
Stars: ✭ 5,412 (+763.16%)
Mutual labels:  data-science
Nfstream
NFStream: a Flexible Network Data Analysis Framework.
Stars: ✭ 622 (-0.8%)
Mutual labels:  data-science
Engsoccerdata
English and European soccer results 1871-2020
Stars: ✭ 615 (-1.91%)
Mutual labels:  data-science
Sigma coding youtube
This is a collection of all the code that can be found on my YouTube channel Sigma Coding.
Stars: ✭ 611 (-2.55%)
Mutual labels:  data-science

CircleCI

lazydata: scalable data dependencies

lazydata is a minimalist library for including data dependencies into Python projects.

Problem: Keeping all data files in git (e.g. via git-lfs) results in a bloated repository copy that takes ages to pull. Keeping code and data out of sync is a disaster waiting to happen.

Solution: lazydata only stores references to data files in git, and syncs data files on-demand when they are needed.

Why: The semantics of code and data are different - code needs to be versioned to merge it, and data just needs to be kept in sync. lazydata achieves exactly this in a minimal way.

Benefits:

  • Keeps your git repository clean with just code, while enabling seamless access to any number of linked data files
  • Data consistency assured using file hashes and automatic versioning
  • Choose your own remote storage backend: AWS S3 or (coming soon:) directory over SSH

lazydata is primarily designed for machine learning and data science projects. See this medium post for more.

Getting started

In this section we'll show how to use lazydata on an example project.

Installation

Install with pip (requires Python 3.5+):

$ pip install lazydata

Add to your project

To enable lazydata, run in project root:

$ lazydata init 

This will initialise lazydata.yml which will hold the list of files managed by lazydata.

Tracking a file

To start tracking a file use track("<path_to_file>") in your code:

my_script.py

from lazydata import track

# store the file when loading  
import pandas as pd
df = pd.read_csv(track("data/my_big_table.csv"))

print("Data shape:" + df.shape)

Running the script the first time will start tracking the file:

$ python my_script.py
## lazydata: Tracking a new file data/my_big_table.csv
## Data shape: (10000,100)

The file is now tracked and has been backed-up in your local lazydata cache in ~/.lazydata and added to lazydata.yml:

files:
  - path: data/my_big_table.csv
    hash: 2C94697198875B6E...
    usage: my_script.py

If you re-run the script without modifying the data file, lazydata will just quickly check that the data file hasn't changed and won't do anything else.

If you modify the data file and re-run the script, this will add another entry to the yml file with the new hash of the data file, i.e. data files are automatically versioned. If you don't want to keep past versions, simply remove them from the yml.

And you are done! This data file is now tracked and linked to your local repository.

Sharing your tracked files

To access your tracked files from multiple machines add a remote storage backend where they can be uploaded. To use S3 as a remote storage backend run:

$ lazydata add-remote s3://mybucket/lazydata

This will configure the S3 backend and also add it to lazydata.yml for future reference.

You can now git commit and push your my_script.py and lazydata.yml files as you normally would.

To copy the stored data files to S3 use:

$ lazydata push

When your collaborator pulls the latest version of the git repository, they will get the script and the lazydata.yml file as usual.

Data files will be downloaded when your collaborator runs my_script.py and the track("my_big_table.csv") is executed:

$ python my_script.py
## lazydata: Downloading stored file my_big_table.csv ...
## Data shape: (10000,100)

To get the data files without running the code, you can also use the command line utility:

# download just this file
$ lazydata pull my_big_table.csv

# download everything used in this script
$ lazydata pull my_script.py

# download everything stored in the data/ directory and subdirs
$ lazydata pull data/

# download the latest version of all data files
$ lazydata pull

Because lazydata.yml is tracked by git you can safely make and switch git branches.

Data dependency scenarios

You can achieve multiple data dependency scenarios by putting lazydata.track() into different parts of the code:

  • Jupyter notebook data dependencies by using tracking in notebooks
  • Data pipeline output tracking by tracking saved files
  • Class-level data dependencies by tracking files in __init__(self)
  • Module-level data dependencies by tracking files in __init__.py
  • Package-level data dependencies by tracking files in setup.py

Coming soon...

  • Examine stored file provenance and properties
  • Faceting multiple files into portable datasets
  • Storing data coming from databases and APIs
  • More remote storage options

Stay in touch

This is an early stable beta release. To find out about new releases subscribe to our new releases mailing list.

Contributing

The library is licenced under Apache-2 licence. All contributions are welcome!

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].