All Projects → JDASoftwareGroup → rle-array

JDASoftwareGroup / rle-array

Licence: MIT license
Run-length encoded arrays for pandas.

Programming Languages

python
139335 projects - #7 most used programming language
shell
77523 projects

Projects that are alternatives of or similar to rle-array

dataquest-guided-projects-solutions
My dataquest project solutions
Stars: ✭ 35 (+75%)
Mutual labels:  pandas
ydata-quality
Data Quality assessment with one line of code
Stars: ✭ 311 (+1455%)
Mutual labels:  pandas
Information-Retrieval
Information Retrieval algorithms developed in python. To follow the blog posts, click on the link:
Stars: ✭ 103 (+415%)
Mutual labels:  pandas
jcasts
Simple podcast MVP
Stars: ✭ 27 (+35%)
Mutual labels:  pandas
degiro-trading-tracker
Simplified tracking of your investments
Stars: ✭ 16 (-20%)
Mutual labels:  pandas
five-minute-midas
Predicting Profitable Day Trading Positions using Decision Tree Classifiers. scikit-learn | Flask | SQLite3 | pandas | MLflow | Heroku | Streamlit
Stars: ✭ 41 (+105%)
Mutual labels:  pandas
obsplus
A Pandas-Centric ObsPy Expansion Pack
Stars: ✭ 28 (+40%)
Mutual labels:  pandas
vulkn
Love your Data. Love the Environment. Love VULKИ.
Stars: ✭ 43 (+115%)
Mutual labels:  pandas
gw2raidar
A log parsing website for Guild Wars 2 combat logs
Stars: ✭ 19 (-5%)
Mutual labels:  pandas
Python-for-data-analysis
No description or website provided.
Stars: ✭ 18 (-10%)
Mutual labels:  pandas
UDACITY-Deep-Learning-Nanodegree-PROJECTS
These are the projects I did on my Udacity Deep Learning Nanodegree 🌟 💻 💻. 💥 🌈
Stars: ✭ 18 (-10%)
Mutual labels:  pandas
jupyter-django
Using Jupyter Notebook with Django: a presentation
Stars: ✭ 42 (+110%)
Mutual labels:  pandas
pantab
Read/Write pandas DataFrames with Tableau Hyper Extracts
Stars: ✭ 64 (+220%)
Mutual labels:  pandas
online-course-recommendation-system
Built on data from Pluralsight's course API fetched results. Works with model trained with K-means unsupervised clustering algorithm.
Stars: ✭ 31 (+55%)
Mutual labels:  pandas
google classroom
Google Classroom Data Pipeline
Stars: ✭ 17 (-15%)
Mutual labels:  pandas
ml-workflow-automation
Python Machine Learning (ML) project that demonstrates the archetypal ML workflow within a Jupyter notebook, with automated model deployment as a RESTful service on Kubernetes.
Stars: ✭ 44 (+120%)
Mutual labels:  pandas
Google-DSC-Platform-Extension
Hello DSC Leads, Automate your process of adding attendees manually.
Stars: ✭ 16 (-20%)
Mutual labels:  pandas
bcpandas
High-level wrapper around BCP for high performance data transfers between pandas and SQL Server. No knowledge of BCP required!!
Stars: ✭ 69 (+245%)
Mutual labels:  pandas
tsa-tutorial
Material for the tutorial, "Time series analysis with pandas" at T-Academy
Stars: ✭ 21 (+5%)
Mutual labels:  pandas
machine-learning-capstone-project
This is the final project for the Udacity Machine Learning Nanodegree: Predicting article retweets and likes based on the title using Machine Learning
Stars: ✭ 28 (+40%)
Mutual labels:  pandas

rle-array

Build Status Coverage Status

Extension Array for Pandas that implements Run-length Encoding.

Quick Start

Some basic setup first:

>>> import pandas as pd
>>> pd.set_option("display.max_rows", 40)
>>> pd.set_option("display.width", None)

We need some example data, so let's create some pseudo-weather data:

>>> from rle_array.testing import generate_example
>>> df = generate_example()
>>> df.head(10)
        date  month  year    city    country   avg_temp   rain   mood
0 2000-01-01      1  2000  city_0  country_0  12.400000  False     ok
1 2000-01-02      1  2000  city_0  country_0   4.000000  False     ok
2 2000-01-03      1  2000  city_0  country_0  17.200001  False  great
3 2000-01-04      1  2000  city_0  country_0   8.400000  False     ok
4 2000-01-05      1  2000  city_0  country_0   6.400000  False     ok
5 2000-01-06      1  2000  city_0  country_0  14.400000  False     ok
6 2000-01-07      1  2000  city_0  country_0  14.300000   True     ok
7 2000-01-08      1  2000  city_0  country_0   6.800000  False     ok
8 2000-01-09      1  2000  city_0  country_0  10.100000  False     ok
9 2000-01-10      1  2000  city_0  country_0  -1.200000  False     ok

Due to the large number of attributes for locations and the date, the data size is quite large:

>>> df.memory_usage()
Index            128
date        32000000
month        4000000
year         8000000
city        32000000
country     32000000
avg_temp    16000000
rain         4000000
mood        32000000
dtype: int64
>>> df.memory_usage().sum()
160000128

To compress the data, we can use rle-array:

>>> import rle_array
>>> df_rle = df.astype({
...     "city": "RLEDtype[object]",
...     "country": "RLEDtype[object]",
...     "month": "RLEDtype[int8]",
...     "mood": "RLEDtype[object]",
...     "rain": "RLEDtype[bool]",
...     "year": "RLEDtype[int16]",
... })
>>> df_rle.memory_usage()
Index            128
date        32000000
month        1188000
year          120000
city           32000
country           64
avg_temp    16000000
rain         6489477
mood        17153296
dtype: int64
>>> df_rle.memory_usage().sum()
72982965

This works better the longer the runs are. In the above example, it does not work too well for "rain".

Development Plan

The development of rle-array has the following priorities (in decreasing order):

  1. Correctness: All results must be correct. The Pandas-provided test suite must pass. Approximation are not allowed.
  2. Transparency: The user can use :class:`~rle_array.RLEDtype` and :class:`~rle_array.RLEArray` like other Pandas types. No special parameters or extra functions are required.
  3. Features: Support all features that Pandas offers, even if it is slow (but inform the user using a :class:`pandas.errors.PerformanceWarning`).
  4. Simplicity: Do not use Python C Extensions or Cython (NumPy and Numba are allowed).
  5. Memory Reduction: Do not decompress the encoded data when not required, try to do as many calculations directly on the compressed representation.
  6. Performance: It should be quick, for large data ideally faster than working on the uncompressed data. Use Numba to speed up code.

Implementation

Imagine the following data array:

Index Data
1 "a"
2 "a"
3 "a"
4 "x"
5 "c"
6 "c"
7 "a"
8 "a"

There some data points valid for multiple entries in a row:

Index Data
1 "a"
2
3
4 "x"
5 "c"
6
7 "a"
8

These sections are also called runs and can be encoded by their value and their length:

Length Value
3 "a"
1 "x"
2 "c"
2 "a"

This representation is called Run-length Encoding. To integrate this encoding better with Pandas and NumPy and to support operations like slicing and random access (e.g. via :func:`pandas.api.extensions.ExtensionArray.take`), we store the end position (the cum-sum of the length column) instead of the length:

End-position Value
3 "a"
4 "x"
6 "c"
8 "a"

The value array is an :class:`numpy.ndarray` with the same dtype as the original data and the end-positions are an :class:`numpy.ndarray` with the dtype int64.

License

Licensed under:

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].