All Projects → cswinter → Locustdb

cswinter / Locustdb

Licence: other
Massively parallel, high performance analytics database that will rapidly devour all of your data.

Programming Languages

rust
11053 projects

Projects that are alternatives of or similar to Locustdb

Trino
Official repository of Trino, the distributed SQL query engine for big data, formerly known as PrestoSQL (https://trino.io)
Stars: ✭ 4,581 (+266.48%)
Mutual labels:  analytics, database
Tensorbase
TensorBase BE is building a high performance, cloud neutral bigdata warehouse for SMEs fully in Rust.
Stars: ✭ 440 (-64.8%)
Mutual labels:  analytics, database
Crate
CrateDB is a distributed SQL database that makes it simple to store and analyze massive amounts of data in real-time.
Stars: ✭ 3,254 (+160.32%)
Mutual labels:  analytics, database
React Native Firebase
🔥 A well-tested feature-rich modular Firebase implementation for React Native. Supports both iOS & Android platforms for all Firebase services.
Stars: ✭ 9,674 (+673.92%)
Mutual labels:  analytics, database
Nsdb
Natural Series Database
Stars: ✭ 49 (-96.08%)
Mutual labels:  analytics, database
Duckdb
DuckDB is an in-process SQL OLAP Database Management System
Stars: ✭ 4,014 (+221.12%)
Mutual labels:  analytics, database
Clickhouse Native Jdbc
ClickHouse Native Protocol JDBC implementation
Stars: ✭ 310 (-75.2%)
Mutual labels:  analytics, database
Reddit Detective
Play detective on Reddit: Discover political disinformation campaigns, secret influencers and more
Stars: ✭ 129 (-89.68%)
Mutual labels:  analytics, database
Skyalt
Accessible database and analytics. Organize and learn from data without engineers.
Stars: ✭ 40 (-96.8%)
Mutual labels:  analytics, database
Metabase
The simplest, fastest way to get business intelligence and analytics to everyone in your company 😋
Stars: ✭ 26,803 (+2044.24%)
Mutual labels:  analytics, database
Querytree
Data reporting and visualization for your app
Stars: ✭ 230 (-81.6%)
Mutual labels:  analytics, database
Eventql
Distributed "massively parallel" SQL query engine
Stars: ✭ 1,121 (-10.32%)
Mutual labels:  analytics, database
Web Database Analytics
Web scrapping and related analytics using Python tools
Stars: ✭ 175 (-86%)
Mutual labels:  analytics, database
Aresdb
A GPU-powered real-time analytics storage and query engine.
Stars: ✭ 2,814 (+125.12%)
Mutual labels:  analytics, database
Spark With Python
Fundamentals of Spark with Python (using PySpark), code examples
Stars: ✭ 150 (-88%)
Mutual labels:  analytics, database
Concourse
Distributed database warehouse for transactions, search and analytics across time.
Stars: ✭ 310 (-75.2%)
Mutual labels:  analytics, database
Databazel
The analytical and reporting solution for MongoDB
Stars: ✭ 118 (-90.56%)
Mutual labels:  analytics, database
Gpdb
Greenplum Database - Massively Parallel PostgreSQL for Analytics. An open-source massively parallel data platform for analytics, machine learning and AI.
Stars: ✭ 4,928 (+294.24%)
Mutual labels:  analytics, database
Data Science Best Resources
Carefully curated resource links for data science in one place
Stars: ✭ 1,104 (-11.68%)
Mutual labels:  analytics, database
Shorty
🔗 A URL shortening service built using Flask and MySQL
Stars: ✭ 78 (-93.76%)
Mutual labels:  analytics, database

LocustDB

Build Status Crates.io Gitter

An experimental analytics database aiming to set a new standard for query performance and storage efficiency on commodity hardware. See How to Analyze Billions of Records per Second on a Single Desktop PC and How to Read 100s of Millions of Records per Second from a Single Disk for an overview of current capabilities.

Usage

Download the latest binary release, which can be run from the command line on most x64 Linux systems, including Windows Subsystem for Linux. For example, to load the file test_data/nyc-taxi.csv.gz in this repository and start the repl run:

./locustdb --load test_data/nyc-taxi.csv.gz --trips

When loading .csv or .csv.gz files with --load, the first line of each file is assumed to be a header containing the names for all columns. The type of each column will be derived automatically, but this might break for columns that contain a mixture of numbers/strings/empty entries.

To persist data to disk in LocustDB's internal storage format (which allows fast queries from disk after the initial load), specify the storage location with --db-path When creating/opening a persistent database, LocustDB will open a lot of files and might crash if the limit on the number of open files is too low. On Linux, you can check the current limit with ulimit -n and set a new limit with e.g. ulimit -n 4096.

The --trips flag will configure the ingestion schema for loading the 1.46 billion taxi ride dataset which can be downloaded here.

For additional usage info, invoke with --help:

$ ./locustdb --help
LocustDB 0.2.1
Clemens Winter <[email protected]>
Massively parallel, high performance analytics database that will rapidly devour all of your data.

USAGE:
    locustdb [FLAGS] [OPTIONS]

FLAGS:
    -h, --help             Prints help information
        --mem-lz4          Keep data cached in memory lz4 encoded. Decreases memory usage and query speeds.
        --reduced-trips    Set ingestion schema for select set of columns from nyc taxi ride dataset
        --seq-disk-read    Improves performance on HDD, can hurt performance on SSD.
        --trips            Set ingestion schema for nyc taxi ride dataset
    -V, --version          Prints version information

OPTIONS:
        --db-path <PATH>           Path to data directory
        --load <FILES>             Load .csv or .csv.gz files into the database
        --mem-limit-tables <GB>    Limit for in-memory size of tables in GiB [default: 8]
        --partition-size <ROWS>    Number of rows per partition when loading new data [default: 65536]
        --readahead <MB>           How much data to load at a time when reading from disk during queries in MiB
                                   [default: 256]
        --schema <SCHEMA>          Comma separated list specifying the types and (optionally) names of all columns in
                                   files specified by `--load` option.
                                   Valid types: `s`, `string`, `i`, `integer`, `ns` (nullable string), `ni` (nullable
                                   integer)
                                   Example schema without column names: `int,string,string,string,int`
                                   Example schema with column names: `name:s,age:i,country:s`
        --table <NAME>             Name for the table populated with --load [default: default]
        --threads <INTEGER>        Number of worker threads. [default: number of cores (12)]

Goals

A vision for LocustDB.

Fast

Query performance for analytics workloads is best-in-class on commodity hardware, both for data cached in memory and for data read from disk.

Cost-efficient

LocustDB automatically achieves spectacular compression ratios, has minimal indexing overhead, and requires less machines to store the same amount of data than any other system. The trade-off between performance and storage efficiency is configurable.

Low latency

New data is available for queries within seconds.

Scalable

LocustDB scales seamlessly from a single machine to large clusters.

Flexible and easy to use

LocustDB should be usable with minimal configuration or schema-setup as:

  • a highly available distributed analytics system continuously ingesting data and executing queries
  • a commandline tool/repl for loading and analysing data from CSV files
  • an embedded database/query engine included in other Rust programs via cargo

Non-goals

Until LocustDB is production ready these are distractions at best, if not wholly incompatible with the main goals.

Strong consistency and durability guarantees

  • small amounts of data may be lost during ingestion
  • when a node is unavailable, queries may return incomplete results
  • results returned by queries may not represent a consistent snapshot

High QPS

LocustDB does not efficiently execute queries inserting or operating on small amounts of data.

Full SQL support

  • All data is append only and can only be deleted/expired in bulk.
  • LocustDB does not support queries that cannot be evaluated independently by each node (large joins, complex subqueries, precise set sizes, precise top n).

Support for cost-inefficient or specialised hardware

LocustDB does not run on GPUs.

Compiling from source

  1. Install Rust: rustup.rs
  2. Clone the repository
git clone https://github.com/cswinter/LocustDB.git
cd LocustDB
  1. Compile with --release for optimal performance:
cargo run --release --bin repl -- --load test_data/nyc-taxi.csv.gz --reduced-trips

Running tests or benchmarks

cargo test

cargo bench

Storage backend

LocustDB has support for persisting data to disk and running queries on data stored on disk. This feature is disabled by default, and has to be enabled explicitly by passing --features "enable_rocksdb" to cargo during compilation. The database backend uses RocksDB, which is a somewhat complex C++ dependency that has to be compiled from source and requires gcc and various libraries to be available. You will have to manually install those on your system, instructions can be found here. You may also have to install various other random tools until compilation succeeds.

LZ4

Compile with --features "enable_lz4" to enable an additional lz4 compression pass which can significantly reduce data size both on disk and in-memory, at the cost of slightly slower in-memory queries.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].