All Projects → andygrove → Datafusion

andygrove / Datafusion

Licence: apache-2.0
DataFusion has now been donated to the Apache Arrow project

Programming Languages

rust
11053 projects

Projects that are alternatives of or similar to Datafusion

Dataframe Js
A javascript library providing a new data structure for datascientists and developpers
Stars: ✭ 376 (-38.46%)
Mutual labels:  dataframe, sql, data
Spark With Python
Fundamentals of Spark with Python (using PySpark), code examples
Stars: ✭ 150 (-75.45%)
Mutual labels:  dataframe, sql, spark
Ballista
Distributed compute platform implemented in Rust, and powered by Apache Arrow.
Stars: ✭ 2,274 (+272.18%)
Mutual labels:  dataframe, spark, arrow
Kyuubi
Kyuubi is a unified multi-tenant JDBC interface for large-scale data processing and analytics, built on top of Apache Spark
Stars: ✭ 363 (-40.59%)
Mutual labels:  sql, spark
Sparklens
Qubole Sparklens tool for performance tuning Apache Spark
Stars: ✭ 345 (-43.54%)
Mutual labels:  spark, cluster
Micronaut Data
Ahead of Time Data Repositories
Stars: ✭ 352 (-42.39%)
Mutual labels:  sql, data
Android Nosql
Lightweight, simple structured NoSQL database for Android
Stars: ✭ 284 (-53.52%)
Mutual labels:  sql, data
Keypathkit
KeyPathKit is a library that provides the standard functions to manipulate data along with a call-syntax that relies on typed keypaths to make the call sites as short and clean as possible.
Stars: ✭ 376 (-38.46%)
Mutual labels:  sql, data
Tensorflowonspark
TensorFlowOnSpark brings TensorFlow programs to Apache Spark clusters.
Stars: ✭ 3,748 (+513.42%)
Mutual labels:  spark, cluster
Awesome Cybersecurity Datasets
A curated list of amazingly awesome Cybersecurity datasets
Stars: ✭ 380 (-37.81%)
Mutual labels:  dataframe, data
Datasheets
Read data from, write data to, and modify the formatting of Google Sheets
Stars: ✭ 593 (-2.95%)
Mutual labels:  dataframe, data
Cook
Fair job scheduler on Kubernetes and Mesos for batch workloads and Spark
Stars: ✭ 314 (-48.61%)
Mutual labels:  spark, cluster
Elasticluster
Create clusters of VMs on the cloud and configure them with Ansible.
Stars: ✭ 298 (-51.23%)
Mutual labels:  spark, cluster
Metorikku
A simplified, lightweight ETL Framework based on Apache Spark
Stars: ✭ 361 (-40.92%)
Mutual labels:  sql, spark
Crate
CrateDB is a distributed SQL database that makes it simple to store and analyze massive amounts of data in real-time.
Stars: ✭ 3,254 (+432.57%)
Mutual labels:  sql, cluster
Agile data code 2
Code for Agile Data Science 2.0, O'Reilly 2017, Second Edition
Stars: ✭ 413 (-32.41%)
Mutual labels:  spark, data
Featran
A Scala feature transformation library for data science and machine learning
Stars: ✭ 420 (-31.26%)
Mutual labels:  spark, data
Pdpipe
Easy pipelines for pandas DataFrames.
Stars: ✭ 590 (-3.44%)
Mutual labels:  dataframe, data
Roapi
Create full-fledged APIs for static datasets without writing a single line of code.
Stars: ✭ 253 (-58.59%)
Mutual labels:  sql, arrow
Datagear
数据可视化分析平台,使用Java语言开发,采用浏览器/服务器架构,支持SQL、CSV、Excel、HTTP接口、JSON等多种数据源
Stars: ✭ 266 (-56.46%)
Mutual labels:  sql, data

License Version Build Status Coverage Status Gitter chat

DataFusion: Modern Distributed Compute Platform implemented in Rust

DataFusion is an attempt at building a modern distributed compute platform in Rust, leveraging Apache Arrow as the memory model.

NOTE: DataFusion was donated to the Apache Arrow project in February 2019. Source is here.

See my article How To Build a Modern Distributed Compute Platform to learn about the design and my motivation for building this. The TL;DR is that this project is a great way to learn about building a query engine but this is quite early and not usable for any real world work just yet.

Status

The current code supports single-threaded execution of limited SQL queries (projection, selection, and aggregates) against CSV files. Parquet files will be supported shortly.

To use DataFusion as a crate dependency, add the following to your Cargo.toml:

[dependencies]
datafusion = "0.6.0"

Here is a brief example for running a SQL query against a CSV file. See the examples directory for full examples.

fn main() {
    // create local execution context
    let mut ctx = ExecutionContext::new();

    // define schema for data source (csv file)
    let schema = Arc::new(Schema::new(vec![
        Field::new("city", DataType::Utf8, false),
        Field::new("lat", DataType::Float64, false),
        Field::new("lng", DataType::Float64, false),
    ]));

    // register csv file with the execution context
    let csv_datasource = CsvDataSource::new("test/data/uk_cities.csv", schema.clone(), 1024);
    ctx.register_datasource("cities", Rc::new(RefCell::new(csv_datasource)));

    // simple projection and selection
    let sql = "SELECT city, lat, lng FROM cities WHERE lat > 51.0 AND lat < 53";

    // execute the query
    let relation = ctx.sql(&sql).unwrap();

    // display the relation
    let mut results = relation.borrow_mut();

    while let Some(batch) = results.next().unwrap() {

        println!(
            "RecordBatch has {} rows and {} columns",
            batch.num_rows(),
            batch.num_columns()
        );

        let city = batch
            .column(0)
            .as_any()
            .downcast_ref::<BinaryArray>()
            .unwrap();

        let lat = batch
            .column(1)
            .as_any()
            .downcast_ref::<Float64Array>()
            .unwrap();

        let lng = batch
            .column(2)
            .as_any()
            .downcast_ref::<Float64Array>()
            .unwrap();

        for i in 0..batch.num_rows() {
            let city_name: String = String::from_utf8(city.get_value(i).to_vec()).unwrap();

            println!(
                "City: {}, Latitude: {}, Longitude: {}",
                city_name,
                lat.value(i),
                lng.value(i),
            );
        }
    }
}

Roadmap

See ROADMAP.md for the full roadmap.

Prerequisites

  • Rust nightly (required by parquet-rs crate)

Building DataFusion

See BUILDING.md.

Gitter

There is a Gitter channel where you can ask questions about the project or make feature suggestions too.

Contributing

Contributors are welcome! Please see CONTRIBUTING.md for details.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].