All Projects → Yu-Group → veridical-flow

Yu-Group / veridical-flow

Licence: MIT License
Making it easier to build stable, trustworthy data-science pipelines.

Programming Languages

Jupyter Notebook
11667 projects
python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to veridical-flow

Weightedcalcs
Pandas-based utility to calculate weighted means, medians, distributions, standard deviations, and more.
Stars: ✭ 83 (+196.43%)
Mutual labels:  statistics, pandas
Ee Outliers
Open-source framework to detect outliers in Elasticsearch events
Stars: ✭ 172 (+514.29%)
Mutual labels:  statistics, ml
Sweetviz
Visualize and compare datasets, target values and associations, with one line of code.
Stars: ✭ 1,851 (+6510.71%)
Mutual labels:  statistics, pandas
Pandas Profiling
Create HTML profiling reports from pandas DataFrame objects
Stars: ✭ 8,329 (+29646.43%)
Mutual labels:  statistics, pandas
Data Science Free
Free Resources For Data Science created by Shubham Kumar
Stars: ✭ 232 (+728.57%)
Mutual labels:  statistics, ml
Pycm
Multi-class confusion matrix library in Python
Stars: ✭ 1,076 (+3742.86%)
Mutual labels:  statistics, ml
Machine Learning With Python
Practice and tutorial-style notebooks covering wide variety of machine learning techniques
Stars: ✭ 2,197 (+7746.43%)
Mutual labels:  statistics, pandas
Stats Maths With Python
General statistics, mathematical programming, and numerical/scientific computing scripts and notebooks in Python
Stars: ✭ 381 (+1260.71%)
Mutual labels:  statistics, pandas
Algorithmic-Trading
I have been deeply interested in algorithmic trading and systematic trading algorithms. This Repository contains the code of what I have learnt on the way. It starts form some basic simple statistics and will lead up to complex machine learning algorithms.
Stars: ✭ 47 (+67.86%)
Mutual labels:  statistics, pandas
Imodels
Interpretable ML package 🔍 for concise, transparent, and accurate predictive modeling (sklearn-compatible).
Stars: ✭ 194 (+592.86%)
Mutual labels:  statistics, ml
Fecon235
Notebooks for financial economics. Keywords: Jupyter notebook pandas Federal Reserve FRED Ferbus GDP CPI PCE inflation unemployment wage income debt Case-Shiller housing asset portfolio equities SPX bonds TIPS rates currency FX euro EUR USD JPY yen XAU gold Brent WTI oil Holt-Winters time-series forecasting statistics econometrics
Stars: ✭ 708 (+2428.57%)
Mutual labels:  statistics, pandas
Polyaxon
Machine Learning Platform for Kubernetes (MLOps tools for experimentation and automation)
Stars: ✭ 2,966 (+10492.86%)
Mutual labels:  workflow, ml
Pingouin
Statistical package in Python based on Pandas
Stars: ✭ 651 (+2225%)
Mutual labels:  statistics, pandas
Fecon236
Tools for financial economics. Curated wrapper over Python ecosystem. Source code for fecon235 Jupyter notebooks.
Stars: ✭ 72 (+157.14%)
Mutual labels:  statistics, pandas
Dataframe Go
DataFrames for Go: For statistics, machine-learning, and data manipulation/exploration
Stars: ✭ 487 (+1639.29%)
Mutual labels:  statistics, pandas
Ml Dl Scripts
The repository provides usefull python scripts for ML and data analysis
Stars: ✭ 119 (+325%)
Mutual labels:  statistics, ml
fairlens
Identify bias and measure fairness of your data
Stars: ✭ 51 (+82.14%)
Mutual labels:  statistics, pandas
Csinva.github.io
Slides, paper notes, class notes, blog posts, and research on ML 📉, statistics 📊, and AI 🤖.
Stars: ✭ 342 (+1121.43%)
Mutual labels:  statistics, ml
Choochoo
Training Diary
Stars: ✭ 186 (+564.29%)
Mutual labels:  statistics, pandas
prosto
Prosto is a data processing toolkit radically changing how data is processed by heavily relying on functions and operations with functions - an alternative to map-reduce and join-groupby
Stars: ✭ 54 (+92.86%)
Mutual labels:  workflow, pandas

vflow logo

A library for making stability analysis simple. Easily evaluate the effect of judgement calls to your data-science pipeline (e.g. choice of imputation strategy)!

mit license python3.6+ tests tests joss downloads

Why use vflow?

Using vflow's simple wrappers easily enables many best practices for data science, and makes writing pipelines easy (following the veridical data-science framework.

Stability Computation Reproducibility
Replace a single function (e.g. preprocessing) with a set of functions and easily assess the stability of downstream results Automatic parallelization and caching throughout the pipeline Automatic experiment tracking and saving

Here we show a simple example of an entire data-science pipeline with several perturbations (e.g. different data subsamples, models, and metrics) written simply using vflow.

import sklearn
from sklearn.metrics import accuracy_score, balanced_accuracy_score
from vflow import init_args, Vset

# initialize data
X, y = sklearn.datasets.make_classification()
X_train, X_test, y_train, y_test = init_args(
    sklearn.model_selection.train_test_split(X, y),
    names=['X_train', 'X_test', 'y_train', 'y_test']  # optionally name the args
)

# subsample data
subsampling_funcs = [
    sklearn.utils.resample for _ in range(3)
]
subsampling_set = Vset(name='subsampling',
                       modules=subsampling_funcs,
                       output_matching=True)
X_trains, y_trains = subsampling_set(X_train, y_train)

# fit models
models = [
    sklearn.linear_model.LogisticRegression(),
    sklearn.tree.DecisionTreeClassifier()
]
modeling_set = Vset(name='modeling',
                    modules=models,
                    module_keys=["LR", "DT"])
modeling_set.fit(X_trains, y_trains)
preds_test = modeling_set.predict(X_test)

# get metrics
binary_metrics_set = Vset(name='binary_metrics',
                          modules=[accuracy_score, balanced_accuracy_score],
                          module_keys=["Acc", "Bal_Acc"])
binary_metrics = binary_metrics_set.evaluate(preds_test, y_test)

Once we've written this pipeline, we can easily measure the stability of metrics (e.g. "Accuracy") to our choice of subsampling or model.

Documentation

See the docs for reference on the API

Notebook examples (Note that some of these require more dependencies than just those required for vflow - to install all, use the notebooks dependencies in the setup.py file)

Synthetic classification

Enhancer genomics

fMRI voxel prediction

Fashion mnist classification

Feature importance stability

Clinical decision rule vetting

Installation

Install with pip install vflow (see here for help). For dev version (unstable), clone the repo and run python setup.py develop from the repo directory.

References

@software{duncan2020vflow,
   author = {Duncan, James and Kapoor, Rush and Agarwal, Abhineet and Singh, Chandan and Yu, Bin},
   doi = {10.21105/joss.03895},
   month = {1},
   title = {{VeridicalFlow: a Python package for building trustworthy data science pipelines with PCS}},
   url = {https://doi.org/10.21105/joss.03895},
   year = {2022}
}
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].