All Projects → CamDavidsonPilon → Tdigest

CamDavidsonPilon / Tdigest

Licence: mit
t-Digest data structure in Python. Useful for percentiles and quantiles, including distributed enviroments like PySpark

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to Tdigest

pyspark-algorithms
PySpark Algorithms Book: https://www.amazon.com/dp/B07X4B2218/ref=sr_1_2
Stars: ✭ 72 (-73.72%)
Mutual labels:  distributed-computing, pyspark, mapreduce
Spark With Python
Fundamentals of Spark with Python (using PySpark), code examples
Stars: ✭ 150 (-45.26%)
Mutual labels:  pyspark, distributed-computing
Data Algorithms Book
MapReduce, Spark, Java, and Scala for Data Algorithms Book
Stars: ✭ 949 (+246.35%)
Mutual labels:  mapreduce, distributed-computing
big data
A collection of tutorials on Hadoop, MapReduce, Spark, Docker
Stars: ✭ 34 (-87.59%)
Mutual labels:  pyspark, mapreduce
ParallelUtilities.jl
Fast and easy parallel mapreduce on HPC clusters
Stars: ✭ 28 (-89.78%)
Mutual labels:  distributed-computing, mapreduce
dlsa
Distributed least squares approximation (dlsa) implemented with Apache Spark
Stars: ✭ 25 (-90.88%)
Mutual labels:  distributed-computing, pyspark
data-algorithms-with-spark
O'Reilly Book: [Data Algorithms with Spark] by Mahmoud Parsian
Stars: ✭ 34 (-87.59%)
Mutual labels:  pyspark, mapreduce
server
Hashtopolis - A Hashcat wrapper for distributed hashcracking
Stars: ✭ 954 (+248.18%)
Mutual labels:  distributed-computing
Awesome-Federated-Machine-Learning
Everything about federated learning, including research papers, books, codes, tutorials, videos and beyond
Stars: ✭ 190 (-30.66%)
Mutual labels:  distributed-computing
dtail
DTail is a distributed DevOps tool for tailing, grepping, catting logs and other text files on many remote machines at once.
Stars: ✭ 112 (-59.12%)
Mutual labels:  mapreduce
frovedis
Framework of vectorized and distributed data analytics
Stars: ✭ 59 (-78.47%)
Mutual labels:  distributed-computing
aut
The Archives Unleashed Toolkit is an open-source toolkit for analyzing web archives.
Stars: ✭ 111 (-59.49%)
Mutual labels:  pyspark
mobius
Mobius is an AI infra platform including realtime computing and training.
Stars: ✭ 22 (-91.97%)
Mutual labels:  distributed-computing
st-hadoop
ST-Hadoop is an open-source MapReduce extension of Hadoop designed specially to analyze your spatio-temporal data efficiently
Stars: ✭ 17 (-93.8%)
Mutual labels:  mapreduce
interbit
To the end of servers
Stars: ✭ 23 (-91.61%)
Mutual labels:  distributed-computing
incubator-linkis
Linkis helps easily connect to various back-end computation/storage engines(Spark, Python, TiDB...), exposes various interfaces(REST, JDBC, Java ...), with multi-tenancy, high performance, and resource control.
Stars: ✭ 2,459 (+797.45%)
Mutual labels:  pyspark
Charm4py
Parallel Programming with Python and Charm++
Stars: ✭ 259 (-5.47%)
Mutual labels:  distributed-computing
SciFlow
Scientific workflow management
Stars: ✭ 49 (-82.12%)
Mutual labels:  distributed-computing
spark-extension
A library that provides useful extensions to Apache Spark and PySpark.
Stars: ✭ 25 (-90.88%)
Mutual labels:  pyspark
SadlyDistributed
Distributing your code(soul), in almost any language(state), among a cluster of idle browsers(voids)
Stars: ✭ 20 (-92.7%)
Mutual labels:  distributed-computing

tdigest

Efficient percentile estimation of streaming or distributed data

PyPI version Build Status

This is a Python implementation of Ted Dunning's t-digest data structure. The t-digest data structure is designed around computing accurate estimates from either streaming data, or distributed data. These estimates are percentiles, quantiles, trimmed means, etc. Two t-digests can be added, making the data structure ideal for map-reduce settings, and can be serialized into much less than 10kB (instead of storing the entire list of data).

See a blog post about it here: Percentile and Quantile Estimation of Big Data: The t-Digest

Installation

tdigest is compatible with both Python 2 and Python 3.

pip install tdigest

Usage

Update the digest sequentially

from tdigest import TDigest
from numpy.random import random

digest = TDigest()
for x in range(5000):
    digest.update(random())

print(digest.percentile(15))  # about 0.15, as 0.15 is the 15th percentile of the Uniform(0,1) distribution

Update the digest in batches

another_digest = TDigest()
another_digest.batch_update(random(5000))
print(another_digest.percentile(15))

Sum two digests to create a new digest

sum_digest = digest + another_digest 
sum_digest.percentile(30)  # about 0.3

To dict or serializing a digest with JSON

You can use the to_dict() method to turn a TDigest object into a standard Python dictionary.

digest = TDigest()
digest.update(1)
digest.update(2)
digest.update(3)
print(digest.to_dict())

Or you can get only a list of Centroids with centroids_to_list().

digest.centroids_to_list()

Similarly, you can restore a Python dict of digest values with update_from_dict(). Centroids are merged with any existing ones in the digest. For example, make a fresh digest and restore values from a python dictionary.

digest = TDigest()
digest.update_from_dict({'K': 25, 'delta': 0.01, 'centroids': [{'c': 1.0, 'm': 1.0}, {'c': 1.0, 'm': 2.0}, {'c': 1.0, 'm': 3.0}]})

K and delta values are optional, or you can provide only a list of centroids with update_centroids_from_list().

digest = TDigest()
digest.update_centroids([{'c': 1.0, 'm': 1.0}, {'c': 1.0, 'm': 2.0}, {'c': 1.0, 'm': 3.0}])

If you want to serialize with other tools like JSON, you can first convert to_dict().

json.dumps(digest.to_dict())

Alternatively, make a custom encoder function to provide as default to the standard json module.

def encoder(digest_obj):
    return digest_obj.to_dict()

Then pass the encoder function as the default parameter.

json.dumps(digest, default=encoder)

API

TDigest.

  • update(x, w=1): update the tdigest with value x and weight w.
  • batch_update(x, w=1): update the tdigest with values in array x and weight w.
  • compress(): perform a compression on the underlying data structure that will shrink the memory footprint of it, without hurting accuracy. Good to perform after adding many values.
  • percentile(p): return the pth percentile. Example: p=50 is the median.
  • cdf(x): return the CDF the value x is at.
  • trimmed_mean(p1, p2): return the mean of data set without the values below and above the p1 and p2 percentile respectively.
  • to_dict(): return a Python dictionary of the TDigest and internal Centroid values.
  • update_from_dict(dict_values): update from serialized dictionary values into the TDigest object.
  • centroids_to_list(): return a Python list of the TDigest object's internal Centroid values.
  • update_centroids_from_list(list_values): update Centroids from a python list.
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].