All Projects → fzqneo → ByteSlice

fzqneo / ByteSlice

Licence: Apache-2.0 license
"Byteslice: Pushing the envelop of main memory data processing with a new storage layout" (SIGMOD'15)

Programming Languages

C++
36643 projects - #6 most used programming language
CMake
9771 projects

Projects that are alternatives of or similar to ByteSlice

rastercube
rastercube is a python library for big data analysis of georeferenced time series data (e.g. MODIS NDVI)
Stars: ✭ 15 (-37.5%)
Mutual labels:  big-data
SGDLibrary
MATLAB/Octave library for stochastic optimization algorithms: Version 1.0.20
Stars: ✭ 165 (+587.5%)
Mutual labels:  big-data
xcast
A High-Performance Data Science Toolkit for the Earth Sciences
Stars: ✭ 28 (+16.67%)
Mutual labels:  big-data
sparkucx
A high-performance, scalable and efficient ShuffleManager plugin for Apache Spark, utilizing UCX communication layer
Stars: ✭ 32 (+33.33%)
Mutual labels:  big-data
cloudberry
Big Data Visualization
Stars: ✭ 89 (+270.83%)
Mutual labels:  big-data
arrow-datafusion
Apache Arrow DataFusion SQL Query Engine
Stars: ✭ 2,360 (+9733.33%)
Mutual labels:  big-data
ultra-sort
DSL for SIMD Sorting on AVX2 & AVX512
Stars: ✭ 29 (+20.83%)
Mutual labels:  simd-parallelism
Movies-Analytics-in-Spark-and-Scala
Data cleaning, pre-processing, and Analytics on a million movies using Spark and Scala.
Stars: ✭ 47 (+95.83%)
Mutual labels:  big-data
insightedge
InsightEdge Core
Stars: ✭ 22 (-8.33%)
Mutual labels:  big-data
bigquery-kafka-connect
☁️ nodejs kafka connect connector for Google BigQuery
Stars: ✭ 17 (-29.17%)
Mutual labels:  big-data
beekeeper
Service for automatically managing and cleaning up unreferenced data
Stars: ✭ 43 (+79.17%)
Mutual labels:  big-data
incubator-liminal
Apache Liminals goal is to operationalise the machine learning process, allowing data scientists to quickly transition from a successful experiment to an automated pipeline of model training, validation, deployment and inference in production. Liminal provides a Domain Specific Language to build ML workflows on top of Apache Airflow.
Stars: ✭ 117 (+387.5%)
Mutual labels:  big-data
libquo
Dynamic execution environments for coupled, thread-heterogeneous MPI+X applications
Stars: ✭ 21 (-12.5%)
Mutual labels:  openmp
siembol
An open-source, real-time Security Information & Event Management tool based on big data technologies, providing a scalable, advanced security analytics framework.
Stars: ✭ 153 (+537.5%)
Mutual labels:  big-data
meetups-archivos
Ppts, códigos y videos de las meetups, data science days, videollamadas y workshops. Data Science Research es una organización sin fines de lucro que busca difundir, descentralizar y difundir los conocimientos en Ciencia de Datos e Inteligencia Artificial en el Perú, dando oportunidades a nuevos talentos mediante MeetUps, Workshops y Semilleros …
Stars: ✭ 60 (+150%)
Mutual labels:  big-data
airavata-php-gateway
Mirror of Apache Airavata PHP Gateway
Stars: ✭ 15 (-37.5%)
Mutual labels:  big-data
LoL-Match-Prediction
Win probability predictions for League of Legends matches using neural networks
Stars: ✭ 34 (+41.67%)
Mutual labels:  big-data
Big-Data-Demo
基于Vue、three.js、echarts,数据可视化展示项目,包含三维模型导入交互、三维模型标注等功能
Stars: ✭ 146 (+508.33%)
Mutual labels:  big-data
talaria
TalariaDB is a distributed, highly available, and low latency time-series database for Presto
Stars: ✭ 148 (+516.67%)
Mutual labels:  big-data
matrix multiplication
Parallel Matrix Multiplication Using OpenMP, Phtreads, and MPI
Stars: ✭ 41 (+70.83%)
Mutual labels:  openmp

ByteSlice is a main-memory data format for fixed length unsigned integers, and attributes that can be encoded as such (e.g., age, datetime). It is primarily designed for highly efficient ordinal comparison based scan and lookup in column-store databases. The basic idea is to chop column values into multiple bytes and store the bytes at different contiguous memory spaces.

The implementation heavily utilizes Single-Instruction-Multiple-Data (SIMD) instruction sets on modern CPUs to achieve bare-metal speed processing. The scan algorithms are optimized to reduce number of instructions, memory footprint, branch mis-predictions and other performance-critical factors.

Using the library

A quick glimpse:

// Create a column of two million 12-bit values in ByteSlice format
Column* column = new Column(ColumnType::kByteSlicePadRight, 12, 2*1024*1024);
// Prepare a bit vector to store scan results
BitVector* bitvector = new BitVector(column);
// Execute scan on the column with predicate value < 3
column->Scan(Comparator::kLess,
            3,
            bitvector,
            Bitwise::kSet);

Build from source

Clone

git clone --recursive https://github.com/fzqneo/ByteSlice.git

Or this after cloning without --recursive:

git submodule update --init --recursive

Build

You need CMake to generate build scripts. Makefile is tested.

To generate debug build:

mkdir debug
cd debug
cmake -DCMAKE_BUILD_TYPE=debug ..
make -j4

To generate release build:

mkdir release
cd release
cmake -DCMAKE_BUILD_TYPE=release ..
make -j4

NOTE: The default build type is debug, which may not give optimal performance.

Running examples

Example programs are in 'example/' directory.

example/example1 -s 10000000

To see a full list of options:

example/example1 -h

NOTE: The source code of example program showcases how to use the library.

Multithreading

Multithreading is controlled by OpenMP environment variables: (assume you use GCC)

OMP_NUM_THREADS=2 ./example/example1

NOTE: The default number of threads depends on the system, which is usually the number of cores. You may also want to set the thread affinity via GOMP_CPU_AFFINITY (assume you use GCC).

Running tests

make check

Build tests without running.

make check-build

Documentation (work in progress)

You need doxygen to generate documentations in html and latex.

 doxygen

File structure

  • example/ - Example programs

  • third-party/ - Third-party libraries

  • src/ - ByteSlice library source files

  • tests/ - Unit tests written in GoogleTest framework

Run examples in Docker

A compiled release-build is contained in the Docker image zf01/byteslice. You need to install Docker.

Run with default parameters:

docker run --rm zf01/byteslice

Run with custom parameters:

docker run --rm -it zf01/byteslice /bin/bash
OMP_NUM_THREADS=1 /root/ByteSlice/release/example/example1 -s 16000000 -b 17

Build Docker image from source

# Run inside the project directory
docker build -t byteslice .

Citing this work

Ziqiang Feng, Eric Lo, Ben Kao, and Wenjian Xu. "Byteslice: Pushing the envelop of main memory data processing with a new storage layout." In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, pp. 31-46. ACM, 2015.

Download: http://dl.acm.org/citation.cfm?id=2747642

BibTex:

@inproceedings{Feng:2015:BPE:2723372.2747642,
 author = {Feng, Ziqiang and Lo, Eric and Kao, Ben and Xu, Wenjian},
 title = {ByteSlice: Pushing the Envelop of Main Memory Data Processing with a New Storage Layout},
 booktitle = {Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data},
 series = {SIGMOD '15},
 year = {2015},
 isbn = {978-1-4503-2758-9},
 location = {Melbourne, Victoria, Australia},
 pages = {31--46},
 numpages = {16},
 url = {http://doi.acm.org/10.1145/2723372.2747642},
 doi = {10.1145/2723372.2747642},
 acmid = {2747642},
 publisher = {ACM},
 address = {New York, NY, USA},
 keywords = {column store, main memory, olap, simd, storage layout},
} 

Contact

Ziqiang Feng ( zf at cs dot cmu dot edu )

Platform requirements

  1. C++ compiler supporting C++11, OpenMP and AVX2
  2. CPU with AVX2 instruction set extension

Tested platform

This package has been tested with the following configuration:

  • Linux 3.13.0-66-generic (64-bit)
  • Intel(R) Core(TM) i7-4770 CPU @ 3.40GHz
  • g++ 4.9.3

Known issues

  1. posix_memalign() is used in some files, causing compilation failure on Windows.
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].