All Projects → podgorskiy → Dareblopy

podgorskiy / Dareblopy

Licence: apache-2.0
Data Reading Blocks for Python

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to Dareblopy

Andrew Ng Notes
This is Andrew NG Coursera Handwritten Notes.
Stars: ✭ 180 (+119.51%)
Mutual labels:  jupyter-notebook, deep-neural-networks, numpy
Leanify
lightweight lossless file minifier/optimizer
Stars: ✭ 694 (+746.34%)
Mutual labels:  zip, png, jpeg
Format parser
file metadata parsing, done cheap
Stars: ✭ 46 (-43.9%)
Mutual labels:  zip, png, jpeg
Vitech
tuyển chọn các tài liệu về công nghệ bằng tiếng Việt
Stars: ✭ 63 (-23.17%)
Mutual labels:  jupyter-notebook, deep-neural-networks
Mlkatas
A series of self-correcting challenges for practicing your Machine Learning and Deep Learning skills
Stars: ✭ 58 (-29.27%)
Mutual labels:  jupyter-notebook, numpy
Deej A.i.
Create automatic playlists by using Deep Learning to *listen* to the music
Stars: ✭ 57 (-30.49%)
Mutual labels:  jupyter-notebook, deep-neural-networks
Lipreading
Stars: ✭ 49 (-40.24%)
Mutual labels:  jupyter-notebook, deep-neural-networks
Optimise Images
Batch image resizer, optimiser and profiler using ImageMagick convert, OptiPNG, JpegOptim and optional ZopfliPNG, Guetzli and MozJPEG.
Stars: ✭ 64 (-21.95%)
Mutual labels:  png, jpeg
Tiny Site
图片优化
Stars: ✭ 65 (-20.73%)
Mutual labels:  png, jpeg
Bitcoin Price Prediction Using Lstm
Bitcoin price Prediction ( Time Series ) using LSTM Recurrent neural network
Stars: ✭ 67 (-18.29%)
Mutual labels:  jupyter-notebook, deep-neural-networks
Mit 6.s094
MIT-6.S094: Deep Learning for Self-Driving Cars Assignments solutions
Stars: ✭ 74 (-9.76%)
Mutual labels:  jupyter-notebook, deep-neural-networks
25daysinmachinelearning
I will update this repository to learn Machine learning with python with statistics content and materials
Stars: ✭ 53 (-35.37%)
Mutual labels:  jupyter-notebook, numpy
Numerical Linear Algebra
Free online textbook of Jupyter notebooks for fast.ai Computational Linear Algebra course
Stars: ✭ 8,263 (+9976.83%)
Mutual labels:  jupyter-notebook, numpy
Gdax Orderbook Ml
Application of machine learning to the Coinbase (GDAX) orderbook
Stars: ✭ 60 (-26.83%)
Mutual labels:  jupyter-notebook, deep-neural-networks
Ncar Python Tutorial
Numerical & Scientific Computing with Python Tutorial
Stars: ✭ 50 (-39.02%)
Mutual labels:  jupyter-notebook, numpy
Gtsrb
Convolutional Neural Network for German Traffic Sign Recognition Benchmark
Stars: ✭ 65 (-20.73%)
Mutual labels:  jupyter-notebook, deep-neural-networks
My Journey In The Data Science World
📢 Ready to learn or review your knowledge!
Stars: ✭ 1,175 (+1332.93%)
Mutual labels:  jupyter-notebook, deep-neural-networks
Swae
Implementation of the Sliced Wasserstein Autoencoders
Stars: ✭ 75 (-8.54%)
Mutual labels:  jupyter-notebook, deep-neural-networks
Data Science Complete Tutorial
For extensive instructor led learning
Stars: ✭ 1,027 (+1152.44%)
Mutual labels:  jupyter-notebook, numpy
Learning python
Source material for Python Like You Mean it
Stars: ✭ 78 (-4.88%)
Mutual labels:  jupyter-notebook, numpy



Framework agnostic, faster data reading for DeepLearning.

A native extension for Python built with C++ and pybind11.

InstallationWhy?What is the performance gain?TutorialLicense

PyPI version

DataReadingBlocks for Python is a python module that provides collection of C++ backed data reading primitives. It targets deep-learning needs, but it is framework agnostic.

Installation

Available as pypi package:

$ pip install dareblopy

To build from sources refer to wiki page.

Why?

Development initially started to speedup reading from ZIP archives, reduce copying data, increase time of GIL being released to improve concurrency.

But why reading from ZIP archive? Reading a ton of small files (which is often the case) can be slow, specially if the drive is network attached, e.g. with NFS. However, the bottle neck here is hardly the disk speed, but the overhead of filesystem, name-lookup, creating file descriptors, and additional network usage if NFS is used.

If, all the small files are agglomerated into larger file (or several large files), that improves performance substantially. This is exactly the reason behind TFRecords in TensorFlow:

To read data efficiently it can be helpful to serialize your data and store it in a set of files (100-200MB each) that can each be read linearly. This is especially true if the data is being streamed over a network. This can also be useful for caching any data-preprocessing.

The downside of TFRecords is that it's TensorFlow only.

A much simpler, yet still effective solution is to store data in ZIP archive with zero compression. However, using zipfile package from standard library can be slow, since it is implemented purely in Python and in certain cases can cause unnecessary data copying.

That's precisely the reason behind development of DareBlopy. In addition to that it also has such features as:

  • Readying JPEG images directly to numpy arrays (from ZIP and from filesystem), to reduce memory usage and unnecessary data copies.
  • Two JPEG backends selectable at run-time: libjpeg and libjpeg-turbo. Both backends are embedded into DareBlopy and do not depend on any system package.
  • Reading of TFRecords (not all features are support though) without dependency on TensorFlow that enables usage of datasets stored as TFRecords with ML frameworks other than TensorFlow, e.g. Pytorch.
  • Random yielders, iterators and, dataloaders to simplify doing DataLearning with TFRecords with other ML frameworks.
  • No dependency on system packages. You install it from pip - it works.
  • Support for compressed ZIP archives, including LZ4 compression.
  • Virtual filesystem. Allows mounting of zip archives.

What is the performance gain?

Well, it depends a lot on a particular use-case. Let's consider several. All details of the benchmarks you can find in run_benchmark.py. You can also run it on your machine and compare results to the ones reported here.

Reading files to bytes

Python's bytes object can be a bit nasty. Generally speaking, you can not return from C/C++ land data as a bytes object without making a data copy. That's because memory for bytes object must be allocated as one chunk for both, the header and data itself. In DareBlopy this extra copy is eliminated, you can find details here.

In this test scenario, we read 200 files, each of which ~30kb. Reading is done from local filesystem and from a ZIP archive.

Reading files using DareBlopy is faster even when read from filesystem, but when read from ZIP it provides substantial improvement.

Reading JPEGs to numpy's ndarray

This is where DareBlopy's feature of direct readying to numpy array is demonstrated. When the file is read, it is decompressed directly to a preallocated numpy array, and all of that happens on C++ land while GIL is released.

Note: here PIL v.7.0.0 is used, on Ubuntu 18. In my installation, it does not use libjpeg-turbo.

It this case, difference between ZIP/filesystem is quite insignificant, but things change dramatically if filesystem is streamed over a network:

Reading TFRecords

DareBlopy can read TensorFlow records. This functionality was developed in the first place for reading FFHQ dataset from TFRecords.

It introduces alias to string type: uint8, which allows to return directly numpy array if the shape is known beforehand.

For example, code like:

        features = {
            'data': db.FixedLenFeature([], db.string)
        }

Can be replaced with:

        features = {
            'data': db.FixedLenFeature([3, 32, 32], db.uint8)
        }

This decoding to numpy array comes at zero cost, which is demonstrated below:

Tutorial

Import DareBlopy

import dareblopy as db
from IPython.display import Image, display
import PIL.Image

Open zip archive:

archive = db.open_zip_archive("test_utils/test_image_archive.zip")

Read image to bytes and display:

b = archive.open_as_bytes('0.jpg')
Image(b)

jpeg

Alternatively, read image to numpy:

img = archive.read_jpg_as_numpy('0.jpg')
img.shape
(256, 256, 3)
display(PIL.Image.fromarray(img))

png

For more advanced usage please refer to:

License

Apache License 2.0

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].