All Projects → dcherian → flox

dcherian / flox

Licence: Apache-2.0 License
Fast & furious GroupBy operations for dask.array

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to flox

Xarray
N-D labeled arrays and datasets in Python
Stars: ✭ 2,353 (+5502.38%)
Mutual labels:  xarray, dask
xarray-beam
Distributed Xarray with Apache Beam
Stars: ✭ 83 (+97.62%)
Mutual labels:  xarray, dask
esmlab
Earth System Model Lab (esmlab). ⚠️⚠️ ESMLab functionality has been moved into <https://github.com/NCAR/geocat-comp>. ⚠️⚠️
Stars: ✭ 23 (-45.24%)
Mutual labels:  xarray, dask
aospy
Python package for automated analysis and management of gridded climate data
Stars: ✭ 80 (+90.48%)
Mutual labels:  xarray
dvc dask use case
A use case of a reproducible machine learning pipeline using Dask, DVC, and MLflow.
Stars: ✭ 22 (-47.62%)
Mutual labels:  dask
spyndex
Awesome Spectral Indices in Python.
Stars: ✭ 56 (+33.33%)
Mutual labels:  xarray
pypar
Efficient and scalable parallelism using the message passing interface (MPI) to handle big data and highly computational problems.
Stars: ✭ 66 (+57.14%)
Mutual labels:  map-reduce
FoldsCUDA.jl
Data-parallelism on CUDA using Transducers.jl and for loops (FLoops.jl)
Stars: ✭ 48 (+14.29%)
Mutual labels:  map-reduce
gcpy
Python toolkit for GEOS-Chem.
Stars: ✭ 34 (-19.05%)
Mutual labels:  xarray
climate system
Notes and practicals for my "Physics of the Climate System" lecture
Stars: ✭ 13 (-69.05%)
Mutual labels:  xarray
future.mapreduce
[EXPERIMENTAL] R package: future.mapreduce - Utility Functions for Future Map-Reduce API Packages
Stars: ✭ 12 (-71.43%)
Mutual labels:  map-reduce
xpublish
Publish Xarray Datasets via a REST API.
Stars: ✭ 86 (+104.76%)
Mutual labels:  xarray
php-uavt-adreskodu-botu
Php ile uavt adres kodu botu
Stars: ✭ 2 (-95.24%)
Mutual labels:  dask
clisops
Climate Simulation Operations
Stars: ✭ 17 (-59.52%)
Mutual labels:  xarray
hypothesis-gufunc
Extension to hypothesis for testing numpy general universal functions
Stars: ✭ 32 (-23.81%)
Mutual labels:  xarray
madpy-dask
MadPy Dask talk materials
Stars: ✭ 33 (-21.43%)
Mutual labels:  dask
restee
Python package to call processed EE objects via the REST API to local data
Stars: ✭ 26 (-38.1%)
Mutual labels:  xarray
arboreto
A scalable python-based framework for gene regulatory network inference using tree-based ensemble regressors.
Stars: ✭ 33 (-21.43%)
Mutual labels:  dask
open-soql
Open source implementation of the SOQL.
Stars: ✭ 15 (-64.29%)
Mutual labels:  map-reduce
dask-rasterio
Read and write rasters in parallel using Rasterio and Dask
Stars: ✭ 82 (+95.24%)
Mutual labels:  dask

GitHub Workflow CI Statuspre-commit.ci statusimagePyPIConda-forgeDocumentation StatusNASA-80NSSC18M0156

flox

This project explores strategies for fast GroupBy reductions with dask.array. It used to be called dask_groupby It was motivated by

  1. Dask Dataframe GroupBy blogpost
  2. numpy_groupies in Xarray issue

(See a presentation about this package, from the Pangeo Showcase).

Acknowledgements

This work was funded in part by NASA-ACCESS 80NSSC18M0156 "Community tools for analysis of NASA Earth Observing System Data in the Cloud" (PI J. Hamman), and NCAR's Earth System Data Science Initiative. It was motivated by very very many discussions in the Pangeo community.

API

There are two main functions

  1. flox.groupby_reduce(dask_array, by_dask_array, "mean") "pure" dask array interface
  2. flox.xarray.xarray_reduce(xarray_object, by_dataarray, "mean") "pure" xarray interface; though work is ongoing to integrate this package in xarray.

Implementation

See the documentation for details on the implementation.

Custom reductions

flox implements all common reductions provided by numpy_groupies in aggregations.py. It also allows you to specify a custom Aggregation (again inspired by dask.dataframe), though this might not be fully functional at the moment. See aggregations.py for examples.

    mean = Aggregation(
        # name used for dask tasks
        name="mean",
        # operation to use for pure-numpy inputs
        numpy="mean",
        # blockwise reduction
        chunk=("sum", "count"),
        # combine intermediate results: sum the sums, sum the counts
        combine=("sum", "sum"),
        # generate final result as sum / count
        finalize=lambda sum_, count: sum_ / count,
        # Used when "reindexing" at combine-time
        fill_value=0,
        # Used when any member of `expected_groups` is not found
        final_fill_value=np.nan,
    )
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].