All Projects → maki-nage → distogram

maki-nage / distogram

Licence: MIT license
A library to compute histograms on distributed environments, on streaming data

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to distogram

Imageqt
数字图像处理——基于Qt 5.8.0
Stars: ✭ 177 (+831.58%)
Mutual labels:  histogram
FlinkExperiments
Experiments with Apache Flink.
Stars: ✭ 3 (-84.21%)
Mutual labels:  stream-processing
streamsx.kafka
Repository for integration with Apache Kafka
Stars: ✭ 13 (-31.58%)
Mutual labels:  stream-processing
Sidekiq monitor
Advanced monitoring for Sidekiq
Stars: ✭ 220 (+1057.89%)
Mutual labels:  histogram
MRI intensity normalization
Intensity normalization of multi-channel MRI images using the method proposed by Nyul et al. 2000
Stars: ✭ 49 (+157.89%)
Mutual labels:  histogram
Elkeid-HUB
Elkeid HUB is a rule/event processing engine maintained by the Elkeid Team that supports streaming/offline (not yet supported by the community edition) data processing. The original intention is to solve complex data/event processing and external system linkage requirements through standardized rules.
Stars: ✭ 62 (+226.32%)
Mutual labels:  stream-processing
Mongoeye
Schema and data analyzer for MongoDB written in Go.
Stars: ✭ 113 (+494.74%)
Mutual labels:  histogram
dagger
Dagger is an easy-to-use, configuration over code, cloud-native framework built on top of Apache Flink for stateful processing of real-time streaming data.
Stars: ✭ 238 (+1152.63%)
Mutual labels:  stream-processing
ReDe
A Redis dehydrator module
Stars: ✭ 63 (+231.58%)
Mutual labels:  stream-processing
beepbeep-3
An event stream processor anyone can use
Stars: ✭ 20 (+5.26%)
Mutual labels:  stream-processing
Histogram
Fast multi-dimensional generalized histogram with convenient interface for C++14
Stars: ✭ 243 (+1178.95%)
Mutual labels:  histogram
xlstream
Turns XLSX into a readable stream.
Stars: ✭ 148 (+678.95%)
Mutual labels:  stream-processing
sp
Stream Processors on Kafka in Golang
Stars: ✭ 29 (+52.63%)
Mutual labels:  stream-processing
Fast Histogram
⚡️ Fast 1D and 2D histogram functions in Python ⚡️
Stars: ✭ 187 (+884.21%)
Mutual labels:  histogram
vue-histogram-slider
Range slider with histogram for Vue.js
Stars: ✭ 111 (+484.21%)
Mutual labels:  histogram
Histogram
Streaming Histograms for Clojure/Java
Stars: ✭ 149 (+684.21%)
Mutual labels:  histogram
Image-Processing-CLI-in-Rust
CLI for image processing with histograms, binary treshold and other functions
Stars: ✭ 25 (+31.58%)
Mutual labels:  histogram
vector
A high-performance observability data pipeline.
Stars: ✭ 12,138 (+63784.21%)
Mutual labels:  stream-processing
openPDC
Open Source Phasor Data Concentrator
Stars: ✭ 109 (+473.68%)
Mutual labels:  stream-processing
football-events
Event-Driven microservices with Kafka Streams
Stars: ✭ 57 (+200%)
Mutual labels:  stream-processing

DistoGram

Github WorkFlows Coverage Documentation Status

DistoGram is a library that allows to compute histogram on streaming data, in distributed environments. The implementation follows the algorithms described in Ben-Haim's Streaming Parallel Decision Trees

Get Started

First create a compressed representation of a distribution:

import numpy as np
import distogram

distribution = np.random.normal(size=10000)

# Create and feed distogram from distribution
# on a real usage, data comes from an event stream
h = distogram.Distogram()
for i in distribution:
    h = distogram.update(h, i)

Compute statistics on the distribution:

nmin, nmax = distogram.bounds(h)
print("count: {}".format(distogram.count(h)))
print("mean: {}".format(distogram.mean(h)))
print("stddev: {}".format(distogram.stddev(h)))
print("min: {}".format(nmin))
print("5%: {}".format(distogram.quantile(h, 0.05)))
print("25%: {}".format(distogram.quantile(h, 0.25)))
print("50%: {}".format(distogram.quantile(h, 0.50)))
print("75%: {}".format(distogram.quantile(h, 0.75)))
print("95%: {}".format(distogram.quantile(h, 0.95)))
print("max: {}".format(nmax))
count: 10000
mean: -0.005082954640481095
stddev: 1.0028524290149186
min: -3.5691130319855047
5%: -1.6597242392338374
25%: -0.6785107421744653
50%: -0.008672960012168916
75%: 0.6720718926935414
95%: 1.6476822301131866
max: 3.8800560034877427

Compute and display the histogram of the distribution:

hist = distogram.histogram(h)
df_hist = pd.DataFrame(np.array(hist), columns=["bin", "count"])
fig = px.bar(df_hist, x="bin", y="count", title="distogram")
fig.update_layout(height=300)
fig.show()

docs/normal_histogram.png

Install

DistoGram is available on PyPi and can be installed with pip:

pip install distogram

Play With Me

You can test this library directly on this live notebook.

Performances

Distogram is design for fast updates when using python types. The following numbers show the results of the benchmark program located in the examples.

On a i7-9800X Intel CPU, performances are:

Interpreter Operation Numpy Req/s
pypy 7.3 update no 6563311
pypy 7.3 update yes 111318
CPython 3.7 update no 436709
CPython 3.7 update yes 251603

On a modest 2014 13" macbook pro, performances are:

Interpreter Operation Numpy Req/s
pypy 7.3 update no 3572436
pypy 7.3 update yes 37630
CPython 3.7 update no 112749
CPython 3.7 update yes 81005

As you can see, your are encouraged to use pypy with python native types. Pypy's jit is penalised by numpy native types, causing a huge performance hit. Moreover the streaming phylosophy of Distogram is more adapted to python native types while numpy is optimized for batch computations, even with CPython.

Credits

Although this code has been written by following the aforementioned research paper, some parts are also inspired by the implementation from Carson Farmer.

Thanks to John Belmonte for his help on performances and accuracy improvements.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].