All Projects → learnedsystems → LearnedSort

learnedsystems / LearnedSort

Licence: GPL-3.0 license
Learned Sort: a model-enhanced sorting algorithm

Programming Languages

C++
36643 projects - #6 most used programming language
shell
77523 projects
python
139335 projects - #7 most used programming language
CMake
9771 projects

Projects that are alternatives of or similar to LearnedSort

Sorting-Visualizer
📊 Sorting.Visualizer is a web app for visualizing a bunch of different sorting algorithms Like Selection Sort, Bubble Sort, Insertion Sort, Merge Sort, Quick Sort, Heap Sort With the functionality of (Speed Control) and (Array Size Control)...
Stars: ✭ 37 (-50%)
Mutual labels:  sorting-algorithms
algorithms
The All ▲lgorithms documentation website.
Stars: ✭ 114 (+54.05%)
Mutual labels:  sorting-algorithms
ultra-sort
DSL for SIMD Sorting on AVX2 & AVX512
Stars: ✭ 29 (-60.81%)
Mutual labels:  sorting-algorithms
SortingLab.jl
Faster sorting algorithms (sort and sortperm) for Julia
Stars: ✭ 20 (-72.97%)
Mutual labels:  sorting-algorithms
Mediarizer
Organise your photos and videos in chronological order fast and easy
Stars: ✭ 18 (-75.68%)
Mutual labels:  sorting-algorithms
Sorting-Visualiser
A mobile application that visualizes various sorting algorithms such as Bubble sort, selection sort, quick sort etc. The sorting process is visualized as the rearrangement of vertical lines of different lengths from shortest to tallest.
Stars: ✭ 32 (-56.76%)
Mutual labels:  sorting-algorithms
PySortDemo
Visualization of sorting algorithms, done in Python.
Stars: ✭ 21 (-71.62%)
Mutual labels:  sorting-algorithms
Sorting-Algorithms
sorting algorithms in python
Stars: ✭ 15 (-79.73%)
Mutual labels:  sorting-algorithms
bruceR
📦 BRoadly Useful Convenient and Efficient R functions that BRing Users Concise and Elegant R data analyses.
Stars: ✭ 110 (+48.65%)
Mutual labels:  linear-models
sorting-visualizer
A Sorting Algorithms Visualizer built using ReactJS
Stars: ✭ 38 (-48.65%)
Mutual labels:  sorting-algorithms
golang-interview-questions
golang 面试集锦
Stars: ✭ 42 (-43.24%)
Mutual labels:  sorting-algorithms
Algorithms-Java
A collection of common algorithms and data structures implemented in Java.
Stars: ✭ 141 (+90.54%)
Mutual labels:  sorting-algorithms
Learn-Data Structure-Algorithm-by-Javascript
Data Structure and Algorithm explanations with Implementations by Javascript
Stars: ✭ 55 (-25.68%)
Mutual labels:  sorting-algorithms
sorting-visualizer
Sorting Algorithms Visualizer
Stars: ✭ 429 (+479.73%)
Mutual labels:  sorting-algorithms
appliedstats
📊 Methods of Applied Statistics Course Textbook Repository
Stars: ✭ 134 (+81.08%)
Mutual labels:  linear-models
lua sort
Lua pure sort algorithm based on lib_table.c (from LuaJIT 2.1.0)
Stars: ✭ 21 (-71.62%)
Mutual labels:  sorting-algorithms
algocasts-sorting-algorithms
AlgoCasts 排序算法专题
Stars: ✭ 27 (-63.51%)
Mutual labels:  sorting-algorithms
Algorithms
Short explanations and implementations of different algorithms in multiple languages
Stars: ✭ 37 (-50%)
Mutual labels:  sorting-algorithms
programminginpython.com
This repo consists code of all the programs discussed at programminginpython.com website
Stars: ✭ 60 (-18.92%)
Mutual labels:  sorting-algorithms
ShiftSort
Sorting algorithm quicker than MergeSort, and is adaptive and stable.
Stars: ✭ 39 (-47.3%)
Mutual labels:  sorting-algorithms

LearnedSort

License: GPL v3 Build Status LGTM Grade LGTM Alerts

The official repository for LearnedSort, a model-enhanced sorting algorithm.

The repository contains the sources for LearnedSort as well as benchmarking code for both synthetic and real datasets.

Usage example:

#include "learned_sort.h"

int main(int argc, char *argv[]) {
  
    // Get some data
    vector<double> arr = {...}

    // Sort in ascending order
    learned_sort::sort(arr.begin(), arr.end());
}

Building Instructions

LearnedSort library

LearnedSort is distributed as part of a header-only library. Therefore, in order the use it, it is enough to include the header file in your code.

#include "learned_sort.h"

However, besides the LearnedSort implementation, this repository contains benchmarking and unit testing code. In order to execute those, follow the instructions below.

Building the Benchmarks

The code in this repository has only been tested on Linux systems, including RHEL 7 and Ubuntu 20.04. In order to build the benchmarks, you will need the following dependencies:

# Clone this repository
git clone --recursive https://github.com/learnedsystems/learned-sort.git

# Change into the code directory
cd learned-sort

# Run the compilation script
./compile.sh

Running the unit tests

Before you run the benchmarks, we recommend that you run the unit tests first. We use Google Test to perform unit testing on LearnedSort and other baseline sorting algorithms on various data distributions and data types. After downloading this repository, run the tests to make sure everything is working fine for your system setup. Should any of the tests fail, then GTest will display a summary of the errors that occurred, otherwise no output is displayed.

# Run all unit tests
./test.sh

Running the synthetic benchmarks

We use Google Benchmark to measure the running times for LearnedSort and other sorting algorithm baselines for comparison. The benchmarks will be run for various input size, and with enough iterations to provide a stable statistic. The output will show the total running time (column "Time") and the time spent in the CPU (column "CPU") in milliseconds. Each row displays the name of the algorithm that is being benchmarked, followed by the input size and statistic type. The benchmark will repeat a few times and it will report the mean, median, and standard deviation of the measurements. If you wish to change the number of repetitions (e.g., to 10), you can add the option --benchmark_repetitions=<num_reps> to the following command:

# Run the synthetic benchmarks
./synth_bench.sh

Customizing the synthetic benchmarks

This script will use the default synthetic datasets, which is an array of 200M double-precision, normally-distributed keys. You may customize the benchmarks by editing the top of the file src/main_synth.cc:

// NOTE: You may change the data type here
typedef double data_t;

// NOTE: You may change the distribution here
// For a list of supported distributions see src/utils.h
distr_t DATA_DISTR = NORMAL;

// NOTE: You may change the input size here
constexpr size_t INPUT_SZ = 50'000'000;

Running the real benchmarks

For the real benchmarks, it is first required that the datasets from Harvard Dataverse are fetched to this repository's tree, since they are not checked in Git. In order to do that, we provide a script that downloads the datasets, decompresses them, generates histograms of the data's distribution, and counts the number of unique keys in each dataset.

After the datasets have been successfully retrieved, you may run the real benchmarks.

# Download real datasets
./download_real_datasets.sh

# Run the real benchmarks
./real_bench.sh

Customizing the real benchmarks

The script uses the NYC/Pickup dataset by default, however, just like the synthetic benchmarks, you may customize the real benchmarks by editing the top of the file src/main_real.cc.

// NOTE: You may change the data type here
typedef long data_t;

// NOTE: You may change the dataset here
// For a list of real datasets see the `data/` directory.
// Make sure to specify the correct data type for the selected dataset.
const string DATASET = "NYC/Pickup";

For a list of possible values for the DATASET variable and their respective data types, please check out the data/ folder.

Benchmark results

In the following sections we give concrete performance numbers for a particular server-grade computer.

For a detailed list of benchmarks, refer to our papers:

Benchmark setup

  • CPU: Intel® Xeon® Gold 6150 CPU @ 2.70GHz
  • RAM: 376GB
  • OS: Linux Ubuntu 20.04, kernel 5.4.0-26-generic
  • CXX: GCC 9.3.0-10ubuntu2

Performance charts

The following chart displays the performance of LearnedSort and other sorting algorithms on a set of randomly generated input distributions containing 50M keys. The histograms in the vertical axis correspond to the shape of the distributions used for the benchmark.

Limitations

  • Currently it is not possible to sort records that contain payloads. We are in the works of adding support for tuples.
  • This implementation does not currently support string keys.

Known bugs

Refer to the Issues page for known bugs.

License & Credits

This work is licensed under the GNU General Public License v3.0 and any use in academic settings must cite the corresponding paper:

@inproceedings{10.1145/3318464.3389752,
author = {Kristo, Ani and Vaidya, Kapil and \c{C}etintemel, Ugur and Misra, Sanchit and Kraska, Tim},
title = {The Case for a Learned Sorting Algorithm},
year = {2020},
isbn = {9781450367356},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3318464.3389752},
doi = {10.1145/3318464.3389752},
booktitle = {Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data},
pages = {1001–1016},
numpages = {16},
keywords = {linear models, linear interpolation, learned algorithm, RMI, sorting algorithm, ML for systems, CDF, sorting},
location = {Portland, OR, USA},
series = {SIGMOD '20}
}
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].