All Projects → afiodorov → radixmmap

afiodorov / radixmmap

Licence: MIT license
Mmap radix sort file by a fixed length prefix of each line

Programming Languages

go
31211 projects - #10 most used programming language

Projects that are alternatives of or similar to radixmmap

mmap-io
Clean straight forward mmap-bindings for node.js
Stars: ✭ 62 (+19.23%)
Mutual labels:  mmap
python-sharearray
Share numpy arrays across processes efficiently ideal for large, read-only datasets
Stars: ✭ 32 (-38.46%)
Mutual labels:  mmap
AmeisenNavigation
Navigationmesh Server for my bot based on the TrinityCore MMAP's and Recast & Detour
Stars: ✭ 35 (-32.69%)
Mutual labels:  mmap
portable-memory-mapping
Portable Memory Mapping C++ Class (Windows/Linux)
Stars: ✭ 34 (-34.62%)
Mutual labels:  mmap
SortingLab.jl
Faster sorting algorithms (sort and sortperm) for Julia
Stars: ✭ 20 (-61.54%)
Mutual labels:  radix-sort
polardb-race
第一届POLARDB数据库性能大赛-初赛第5名(JAVA)-复赛第7名(CPP)
Stars: ✭ 24 (-53.85%)
Mutual labels:  mmap
bytekit
Java 字节操作的工具库(不是字节码的工具库)
Stars: ✭ 40 (-23.08%)
Mutual labels:  mmap
radix-sorting
Radix sorting from the ground up
Stars: ✭ 27 (-48.08%)
Mutual labels:  radix-sort
mmappickle
Python 3 library to store memory mappable objects into pickle-compatible files
Stars: ✭ 34 (-34.62%)
Mutual labels:  mmap
Flatbuffers
FlatBuffers: Memory Efficient Serialization Library
Stars: ✭ 17,180 (+32938.46%)
Mutual labels:  mmap
s3parcp
Faster than s3cp
Stars: ✭ 31 (-40.38%)
Mutual labels:  mmap
mmap-object
Shared Memory Objects for Node
Stars: ✭ 90 (+73.08%)
Mutual labels:  mmap
box
Box - Open Standard Archive Format, a zip killer.
Stars: ✭ 38 (-26.92%)
Mutual labels:  mmap

radixmmap: sort massive files by prefix in memory

In this example we will sort a file in chronological order, where first 19 bytes of each line are assumed to be RFC3339 date.

Example 1:

> cat /tmp/test
2011-01-10T15:30:45Z,bla
2009-01-10T15:30:47Z,abc
2009-01-10T15:30:45Z,def

> ./radixmmap /tmp/test
2009-01-10T15:30:45Z,def
2009-01-10T15:30:47Z,abc
2011-01-10T15:30:45Z,bla

Example 2, multiple csv files with header:

> cat /tmp/a /tmp/b
date,string
2009-01-10T15:30:45Z,def
2009-01-10T15:30:47Z,abc
2011-01-10T15:30:45Z,bla
date,string
2020-01-10T15:30:45Z,key
2005-01-10T15:30:47Z,lkj
1999-01-10T15:30:45Z,zxc

> ./radixmmap -skip-header /tmp/a /tmp/b
1999-01-10T15:30:45Z,zxc
2005-01-10T15:30:47Z,lkj
2009-01-10T15:30:45Z,def
2009-01-10T15:30:47Z,abc
2011-01-10T15:30:45Z,bla
2020-01-10T15:30:45Z,key

How?

First we load each file into memory & then use radix sort to sort by first 19 bytes of each line.

Why?

The idea is to sort big files as fast as possible with as little overhead as possible.

I find that memory mapped files allow for optimal loading of the file: this way OS allocates just as much memory as needed.

Prior to this utility I have been using sort command found in shells:

LC_ALL=C sort --parallel=16 -t, -k1 -S100% /tmp/test

but found it quite memory hungry.

Main idea behind the implementation

This implementation uses as little RAM as possible without compromising on performance too much.

Additionally to loading file in RAM, we need 16 bytes per line to remember where the lines start and where they end (8 bytes to remember start position and 8 bytes to remember end position). For a file with 1.2 billion lines this results in 19.2 GB overhead.

Benchmark

Currently sorts 44GB file using 63.2GB RAM, 16 cores in 19 minutes 37 seconds:

> go build && time ./radixmmap -d sorted.csv bigfile.csv
real    19m37.992s
user    96m27.955s
sys     1m7.665s

In comparison sort takes 29 minutes and 51 seconds, and uses >80GB of RAM:

> time LC_ALL=C sort --parallel=16 -t, -k1 -S100% -o sorted.csv bigfile.csv
real    29m51.346s
user    71m34.478s
sys     3m17.947s

The file has ~1.22 billion lines.

Credits

Big thanks to edsrzf and twotwotwo for providing underlying implementations for mmap & radix sort respectively.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].