Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

Created with love in Canada, visit hostnodejs.com today

Feel like to post an Ad? Learn Details

All Projects → pydata → Pyrallel

pydata / Pyrallel

Licence: mit

Experimental parallel data analysis toolkit.

Programming Languages

python

139335 projects - #7 most used programming language

Pyrallel - Parallel Data Analytics in Python

Unmaintained warning: this project has no future, use dask and dask-distributed instead.

Overview: experimental project to investigate distributed computation patterns for machine learning and other semi-interactive data analytics tasks.

Scope:

focus on small to medium dataset that fits in memory on a small (10+ nodes) to medium cluster (100+ nodes).
focus on small to medium data (with data locality when possible).
focus on CPU bound tasks (e.g. training Random Forests) while trying to limit disk / network access to a minimum.
do not focus on HA / Fault Tolerance (yet).
do not try to invent new set of high level programming abstractions (yet): use a low level programming model (IPython.parallel) to finely control the cluster elements and messages transfered and help identify what are the practical underlying constraints in distributed machine learning setting.

Disclaimer: the public API of this library will probably not be stable soon as the current goal of this project is to experiment.

Dependencies

The usual suspects: Python 2.7, NumPy, SciPy.

Fetch the development version (master branch) from:

StarCluster develop branch and its IPCluster plugin is also required to easily startup a bunch of nodes with IPython.parallel setup.

Patterns currently under investigation

Asynchronous & randomized hyper-parameters search (a.k.a. Randomized Grid Search) for machine learning models
Share numerical arrays efficiently over the nodes and make them available to concurrently running Python processes without making copies in memory using memory-mapped files.
Distributed Random Forests fitting.
Ensembling heterogeneous library models.
Parallel implementation of online averaged models using a MPI AllReduce, for instance using MiniBatchKMeans on partitioned data.

See the content of the examples/ folder for more details.

License

MIT

History

This project started at the PyCon 2012 PyData sprint as a set of proof of concept IPython.parallel scripts.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Stars: ✭ 116

Visit Git Page 🔗Visit User Page 🔗Visit Issues Page (1) 🔗