rle-array

Extension Array for Pandas that implements Run-length Encoding.

Table of Contents

Quick Start
Development Plan
Implementation
License

Quick Start

Some basic setup first:

>>> import pandas as pd
>>> pd.set_option("display.max_rows", 40)
>>> pd.set_option("display.width", None)

We need some example data, so let's create some pseudo-weather data:

>>> from rle_array.testing import generate_example
>>> df = generate_example()
>>> df.head(10)
        date  month  year    city    country   avg_temp   rain   mood
0 2000-01-01      1  2000  city_0  country_0  12.400000  False     ok
1 2000-01-02      1  2000  city_0  country_0   4.000000  False     ok
2 2000-01-03      1  2000  city_0  country_0  17.200001  False  great
3 2000-01-04      1  2000  city_0  country_0   8.400000  False     ok
4 2000-01-05      1  2000  city_0  country_0   6.400000  False     ok
5 2000-01-06      1  2000  city_0  country_0  14.400000  False     ok
6 2000-01-07      1  2000  city_0  country_0  14.300000   True     ok
7 2000-01-08      1  2000  city_0  country_0   6.800000  False     ok
8 2000-01-09      1  2000  city_0  country_0  10.100000  False     ok
9 2000-01-10      1  2000  city_0  country_0  -1.200000  False     ok

Due to the large number of attributes for locations and the date, the data size is quite large:

>>> df.memory_usage()
Index            128
date        32000000
month        4000000
year         8000000
city        32000000
country     32000000
avg_temp    16000000
rain         4000000
mood        32000000
dtype: int64
>>> df.memory_usage().sum()
160000128

To compress the data, we can use rle-array:

>>> import rle_array
>>> df_rle = df.astype({
...     "city": "RLEDtype[object]",
...     "country": "RLEDtype[object]",
...     "month": "RLEDtype[int8]",
...     "mood": "RLEDtype[object]",
...     "rain": "RLEDtype[bool]",
...     "year": "RLEDtype[int16]",
... })
>>> df_rle.memory_usage()
Index            128
date        32000000
month        1188000
year          120000
city           32000
country           64
avg_temp    16000000
rain         6489477
mood        17153296
dtype: int64
>>> df_rle.memory_usage().sum()
72982965

This works better the longer the runs are. In the above example, it does not work too well for "rain".

Development Plan

The development of rle-array has the following priorities (in decreasing order):

Correctness: All results must be correct. The Pandas-provided test suite must pass. Approximation are not allowed.
Transparency: The user can use :class:`~rle_array.RLEDtype` and :class:`~rle_array.RLEArray` like other Pandas types. No special parameters or extra functions are required.
Features: Support all features that Pandas offers, even if it is slow (but inform the user using a :class:`pandas.errors.PerformanceWarning`).
Simplicity: Do not use Python C Extensions or Cython (NumPy and Numba are allowed).
Memory Reduction: Do not decompress the encoded data when not required, try to do as many calculations directly on the compressed representation.
Performance: It should be quick, for large data ideally faster than working on the uncompressed data. Use Numba to speed up code.

Implementation

Imagine the following data array:

Index	Data
1	"a"
2	"a"
3	"a"
4	"x"
5	"c"
6	"c"
7	"a"
8	"a"

There some data points valid for multiple entries in a row:

Index	Data
1	"a"
2
3
4	"x"
5	"c"
6	"c"
7	"a"
8	"a"

These sections are also called runs and can be encoded by their value and their length:

Length	Value
3	"a"
1	"x"
2	"c"
2	"a"

This representation is called Run-length Encoding. To integrate this encoding better with Pandas and NumPy and to support operations like slicing and random access (e.g. via :func:`pandas.api.extensions.ExtensionArray.take`), we store the end position (the cum-sum of the length column) instead of the length:

End-position	Value
3	"a"
4	"x"
6	"c"
8	"a"

The value array is an :class:`numpy.ndarray` with the same dtype as the original data and the end-positions are an :class:`numpy.ndarray` with the dtype int64.

License

Licensed under:

MIT License (LICENSE.txt or https://opensource.org/licenses/MIT)

Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

JDASoftwareGroup / rle-array

Programming Languages

Labels

Projects that are alternatives of or similar to rle-array

rle-array

Quick Start

Development Plan

Implementation

License