All Projects → varchar-io → nebula

varchar-io / nebula

Licence: Apache-2.0 license
A distributed block-based data storage and compute engine

Programming Languages

C++
36643 projects - #6 most used programming language
CMake
9771 projects

Projects that are alternatives of or similar to nebula

Pachyderm
Reproducible Data Science at Scale!
Stars: ✭ 5,305 (+4077.17%)
Mutual labels:  distributed-systems, big-data, data-analysis
Hazelcast
Open-source distributed computation and storage platform
Stars: ✭ 4,662 (+3570.87%)
Mutual labels:  distributed-systems, big-data, distributed-computing
ripple
Simple shared surface streaming application
Stars: ✭ 17 (-86.61%)
Mutual labels:  distributed-systems, real-time, distributed-computing
tutorial
Tutorials to help you build your first Swim app
Stars: ✭ 27 (-78.74%)
Mutual labels:  distributed-systems, real-time, distributed-computing
Distributedsystems
My Distributed Systems references
Stars: ✭ 67 (-47.24%)
Mutual labels:  distributed-systems, distributed-computing
Distributedsystem Series
📚 深入浅出分布式基础架构,Linux 与操作系统篇 | 分布式系统篇 | 分布式计算篇 | 数据库篇 | 网络篇 | 虚拟化与编排篇 | 大数据与云计算篇
Stars: ✭ 1,092 (+759.84%)
Mutual labels:  distributed-systems, distributed-computing
Genie
Distributed Big Data Orchestration Service
Stars: ✭ 1,544 (+1115.75%)
Mutual labels:  distributed-systems, big-data
Qix
Machine Learning、Deep Learning、PostgreSQL、Distributed System、Node.Js、Golang
Stars: ✭ 13,740 (+10718.9%)
Mutual labels:  distributed-systems, distributed-computing
Distributed Consensus Reading List
A long list of academic papers on the topic of distributed consensus
Stars: ✭ 803 (+532.28%)
Mutual labels:  distributed-systems, distributed-computing
Scalecube Cluster
ScaleCube Cluster is a lightweight Java VM implementation of SWIM: Scalable Weakly-consistent Infection-style Process Group Membership Protocol. features cluster membership, failure detection, and gossip protocol library.
Stars: ✭ 119 (-6.3%)
Mutual labels:  distributed-systems, distributed-computing
Swellrt
SwellRT main project. Server, JavaScript and Java clients
Stars: ✭ 205 (+61.42%)
Mutual labels:  distributed-systems, real-time
Protoactor Dotnet
Proto Actor - Ultra fast distributed actors for Go, C# and Java/Kotlin
Stars: ✭ 1,070 (+742.52%)
Mutual labels:  distributed-systems, distributed-computing
Awesome Scalability
The Patterns of Scalable, Reliable, and Performant Large-Scale Systems
Stars: ✭ 36,688 (+28788.19%)
Mutual labels:  distributed-systems, big-data
Parapet
A purely functional library to build distributed and event-driven systems
Stars: ✭ 106 (-16.54%)
Mutual labels:  distributed-systems, distributed-computing
Construct
JavaScript Digital Organisms simulator
Stars: ✭ 17 (-86.61%)
Mutual labels:  distributed-systems, distributed-computing
Orleans.clustering.kubernetes
Orleans Membership provider for Kubernetes
Stars: ✭ 140 (+10.24%)
Mutual labels:  distributed-systems, distributed-computing
dislib
The Distributed Computing library for python implemented using PyCOMPSs programming model for HPC.
Stars: ✭ 39 (-69.29%)
Mutual labels:  big-data, distributed-computing
Gosiris
An actor framework for Go
Stars: ✭ 222 (+74.8%)
Mutual labels:  distributed-systems, distributed-computing
pyspark-algorithms
PySpark Algorithms Book: https://www.amazon.com/dp/B07X4B2218/ref=sr_1_2
Stars: ✭ 72 (-43.31%)
Mutual labels:  big-data, distributed-computing
Titanoboa
Titanoboa makes complex workflows easy. It is a low-code workflow orchestration platform for JVM - distributed, highly scalable and fault tolerant.
Stars: ✭ 787 (+519.69%)
Mutual labels:  distributed-systems, big-data

Nebula

Extremely-fast Interactive Real-Time Analytics

logo
Nebula is an extremely-fast end-to-end interactive big data analytics solution. Nebula is designed as a high-performance columnar data storage and tabular OLAP engine.

What is Nebula?

  • Extreme Fast Data Analytics System with Access Control.
  • Distributed Cache Tier for Tabular Data.
  • Build Unified Service API for any Sources (files, streaming, services, etc.)

Nebula can run on

  • Local box
  • VM cluster
  • Kubenettes

Documents of design, internals and stories will be shared at project docs (under-construction).

A Simple Story

To cut it short, check this story and see if it's interesting to you:

  1. You have some data, they are files on cloud storage, or streaming (eg. kafka), or even just a bunch of CSV files on Github, pretty much any source...
  2. You deploy a Nebula cluster, it is either single box, a cluster of a few EC2 machines on AWS, or just a Kubenettes cluster. Nebula doesn't have external dependencies, just a couple binaries (or docker images), so it's easy to maintain.
  3. Now, you add a table defintion in the cluster config file. Right away, you have these available:
    • A web UI where you can slice/dice your data for interactive visualization. You can also write script to transform your data in server side.
    • A REST API that you can build your own application with.

Highlight - visualize your real-time streaming from Kafka

demo

Sounds interesting? Continue to read...

Contents

Introduction

With Nebula, you could easily:

pretty chart 1

Transform column, aggregate by it with filters

  • To learn more, check out these resources:
  1. 10 minutes quick tutorial video

  2. Nebula presentation slides

Get Started

Run example instance with sample data on local

  1. clone the repo: git clone https://github.com/varchar-io/nebula.git
  2. build latest code: cd nebula && ./build.sh
  3. launch services: ./run.sh (the script uses test config file build/configs/test.yml which you can modify to connect your own data)
  4. explore nebula UI in browser if all up running: http://localhost:8088

Run example instance with sample data on Kubernetes

Deploy a single node k8s cluster on your local box. Assume your current kubectl points to the cluster, just run:

  • apply: kubectl apply -f deploy/k8s/nebula.yaml.
  • forward: kubectl port-forward nebula/server 8088:8088
  • explore: http://localhost:8088

Build Source & Test

The whole repo can be built on either MacOS or Linux. Just run ./build.sh.

After built the source successfully, the binaries can be found in ./build directory. Now you can launch a simple cluster of "server" + "one worker" + "web server" like this:

  • launch node: ~/nebula/build%./NodeServer
  • launch server: ~/nebula/build% ./NebulaServer --CLS_CONF configs/test.yml
  • launch web server: ~/nebula/src/service/http/nebula% NS_ADDR=localhost:9190 NODE_PORT=8081 node node.js`

If everything goes as expected, now you should be able to explore and query the sample data from its UI at http://localhost:8081

Birdeye View

Overview

Common Scenarios

As you may see in the previous section where we talk about running the sample locally. All of Nebula data tables are defined by a yaml section in the cluster config file, it's configs/test.yml in the example. Each of the use case demonstrated here is a table defintion, which you can copy to configs/test.yml and run it in that test. (Just replace the real values of your own data, such as schema and file location)

CASE-1: Static Data Analytics

Configure your data source from a permanent storage (file system) and run analytics on it. AWS S3, Azure Blob Storage are often used storage system with support of file formats like CSV, Parquet, ORC. These file formats and storage system are frequently used in modern big data ecosystems.

For example, this simple config will let you analyze a S3 data on Nebula

seattle.calls:
  retention:
    max-mb: 40000
    max-hr: 0
  schema: "ROW<cad:long, clearence:string, type:string, priority:int, init_type:string, final_type:string, queue_time:string, arrive_time:string, precinct:string, sector:string, beat:string>"
  data: s3
  loader: Swap
  source: s3://nebula/seattle_calls.10k.tsv
  backup: s3://nebula/n202/
  format: csv
  csv:
    hasHeader: true
    delimiter: ","
  time:
    type: column
    column: queue_time
    pattern: "%m/%d/%Y %H:%M:%S"

CASE-2: Realtime Data Analytics

Connect Nebula to real-time data source such as Kafka with data formats in thrift or JSON, and do real-time data analytics.

For example, this config section will ask Nebula to connect one Kafka topic for real time code profiling.

  k.pinterest-code:
    retention:
      max-mb: 200000
      max-hr: 48
    schema: "ROW<service:string, host:string, tag:string, lang:string, stack:string>"
    data: kafka
    loader: Streaming
    source: <brokers>
    backup: s3://nebula/n116/
    format: json
    kafka:
      topic: <topic>
    columns:
      service:
        dict: true
      host:
        dict: true
      tag:
        dict: true
      lang:
        dict: true
    time:
      # kafka will inject a time column when specified provided
      type: provided
    settings:
      batch: 500

CASE-3: Ephemeral Data Analytics

Define a template in Nebula, and load data through Nebula API to allow data live for specific period. Run analytics on Nebula to serve queries in this ephemeral data's life time.

CASE-4: Sparse Storage

Highly break down input data into huge small data cubes living in Nebula nodes, usually a simple predicate (filter) will massively prune dowm data to scan for super low latency in your analytics.

For exmaple, config internal partition leveraging sparse storage for super fast pruning for queries targeting specific dimension: (It also demonstrates how to set up column level access control: access group and access action for specific columns)

  nebula.test:
    retention:
      # max 10G RAM assigment
      max-mb: 10000
      # max 10 days assignment
      max-hr: 240
    schema: "ROW<id:int, event:string, tag:string, items:list<string>, flag:bool, value:tinyint>"
    data: custom
    loader: NebulaTest
    source: ""
    backup: s3://nebula/n100/
    format: none
    # NOTE: refernece only, column properties defined here will not take effect
    # because they are overwritten/decided by definition of TestTable.h
    columns:
      id:
        bloom_filter: true
      event:
        access:
          read:
            groups: ["nebula-users"]
            action: mask
      tag:
        partition:
          values: ["a", "b", "c"]
          chunk: 1
    time:
      type: static
      # get it from linux by "date +%s"
      value: 1565994194

SDK: Nebula Is Programmable

Through the great projecct QuickJS, Nebula is able to support full ES6 programing through its simple UI code editor. Below is an snippet code that generates a pie charts for your SQL-like query code in JS.

On the page top, the demo video shows how nebula client SDK is used and tables and charts are generated in milliseconds!

    // define an customized column
    const colx = () => nebula.column("value") % 20;
    nebula.apply("colx", nebula.Type.INT, colx);

    // get a data set from data set stored in HTTPS or S3
    nebula
        .source("nebula.test")
        .time("2020-08-16", "2020-08-26")
        .select("colx", count("id"))
        .where(and(gt("id", 5), eq("flag", true)))
        .sortby(nebula.Sort.DESC)
        .limit(10)
        .run();

Open source

Open source is wonderful - that is the reason we can build software and make innovations on top of others. Without these great open source projects, Nebula won't be possible:

Many others are used by Nebula:

  • common tools (glog/gflags/gtest/yaml-cpp/fmt/leveldb)
  • serde (msgpack/rapidjson/rdkafka)
  • algos(xxhash, roaring bitmap, zstd, lz4)
  • ...

Adoptions

Pinterest

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].