All Projects → bshillingford → python-sharearray

bshillingford / python-sharearray

Licence: Apache-2.0 license
Share numpy arrays across processes efficiently ideal for large, read-only datasets

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to python-sharearray

portable-memory-mapping
Portable Memory Mapping C++ Class (Windows/Linux)
Stars: ✭ 34 (+6.25%)
Mutual labels:  mmap
seapy
State Estimation and Analysis in Python
Stars: ✭ 25 (-21.87%)
Mutual labels:  numpy
UDACITY-Deep-Learning-Nanodegree-PROJECTS
These are the projects I did on my Udacity Deep Learning Nanodegree 🌟 💻 💻. 💥 🌈
Stars: ✭ 18 (-43.75%)
Mutual labels:  numpy
lidar-buster
Collection of Python snippets for processing LiDAR point cloud.
Stars: ✭ 15 (-53.12%)
Mutual labels:  numpy
python demo
一些简单有趣的Python小Demo
Stars: ✭ 109 (+240.63%)
Mutual labels:  numpy
gau2grid
Fast computation of a gaussian and its derivative on a grid.
Stars: ✭ 23 (-28.12%)
Mutual labels:  numpy
onelinerhub
2.5k code solutions with clear explanation @ onelinerhub.com
Stars: ✭ 645 (+1915.63%)
Mutual labels:  numpy
Information-Retrieval
Information Retrieval algorithms developed in python. To follow the blog posts, click on the link:
Stars: ✭ 103 (+221.88%)
Mutual labels:  numpy
npbench
NPBench - A Benchmarking Suite for High-Performance NumPy
Stars: ✭ 40 (+25%)
Mutual labels:  numpy
ml-workflow-automation
Python Machine Learning (ML) project that demonstrates the archetypal ML workflow within a Jupyter notebook, with automated model deployment as a RESTful service on Kubernetes.
Stars: ✭ 44 (+37.5%)
Mutual labels:  numpy
NDScala
N-dimensional arrays in Scala 3. Think NumPy ndarray, but type-safe over shapes, array/axis labels & numeric data types
Stars: ✭ 37 (+15.63%)
Mutual labels:  numpy
datasets
🤗 The largest hub of ready-to-use datasets for ML models with fast, easy-to-use and efficient data manipulation tools
Stars: ✭ 13,870 (+43243.75%)
Mutual labels:  numpy
AmeisenNavigation
Navigationmesh Server for my bot based on the TrinityCore MMAP's and Recast & Detour
Stars: ✭ 35 (+9.38%)
Mutual labels:  mmap
natural-neighbor-interpolation
Fast, discrete natural neighbor interpolation in 3D on the CPU.
Stars: ✭ 63 (+96.88%)
Mutual labels:  numpy
Object-sorting-using-Robotic-arm-and-Image-processing
Sorting objects of different colors using robotic arm and using computer vision (image processing).
Stars: ✭ 21 (-34.37%)
Mutual labels:  numpy
audiophile
Audio fingerprinting and recognition
Stars: ✭ 17 (-46.87%)
Mutual labels:  numpy
Poke-Pi-Dex
Our deep learning for computer vision related project for nostalgic poke weebs (Sistemi digitali, Unibo).
Stars: ✭ 18 (-43.75%)
Mutual labels:  numpy
vidpipe
Video data processing pipeline using OpenCV
Stars: ✭ 33 (+3.13%)
Mutual labels:  numpy
Python-for-data-analysis
No description or website provided.
Stars: ✭ 18 (-43.75%)
Mutual labels:  numpy
equilib
🌎→🗾Equirectangular (360/panoramic) image processing library for Python with minimal dependencies only using Numpy and PyTorch
Stars: ✭ 43 (+34.38%)
Mutual labels:  numpy

sharearray

Have you worried about creating large identical numpy arrays across processes due to RAM wastage, e.g. datasets that are big enough to fit in RAM but large enough to cause concern when running multiple jobs using the same data? sharearray efficiently caches numpy arrays in RAM (using shared memory in /dev/shm, no root needed) locally on a machine.

Usage is simple, using the cache function or decorator decorator. A first call saves the result of the call into the built-in RAM disk, and returns a read-only memory-mapped view into it. Since it's in RAM, there's no performance penalty. Any subsequent calls with the same ID will return an identical read-only memory mapped view, even across processes. The IDs are global.

Installation:

pip install git+https://github.com/bshillingford/python-sharearray

or

git clone https://github.com/bshillingford/python-sharearray
python setup.py install

Usage

Using decorator:

@sharearray.decorator('some_unique_id', verbose=False)
def get_training_data():
    # create largeish / expensive-to-generate data
    return my_array # some instance of np.ndarray

# first call, across all processes, creates the array
arr_view = get_training_data()

# all further calls are cached/memoized: we return a view into memory
arr_view_2 = get_training_data()

Using the cache function:

import sharearray
import numpy as np
arr = sharearray.cache('my_global_id', lambda: create_large_array())
# or:
arr = sharearray.cache('my_global_id', lambda: create_large_array())

where, for instance, create_large_array returns a large training set, potentially performing expensive feature transformations or data augmentations first.

By default, the file is at /dev/shm/sharearray_my_global_id.npy, and to avoid concurrency issues when first generating the array, and to avoid duplicated computation,

For futher details, read the docstrings. You may be interested in the timeout, verbose, and log_func arguments (to either cache or decorator).

PyTorch

Since PyTorch does not yet support memmapped files (at time of writing), we can instead just create torch Tensors that point to the memory mapped by numpy:

data_numpy = get_training_data()          # numpy.ndarray
data_torch = torch.from_numpy(data_numpy) # torch.Tensor

Notes

TODO: support returning multiple arrays (e.g. as a tuple or dict) from the callback / decorated function

There exist similar libraries in Python already, but this just makes it easier to do as a memoization-style API. Also, this module is a single file, and does not write anything in C.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].