All Projects → constellation-rs → Amadeus

constellation-rs / Amadeus

Licence: apache-2.0
Harmonious distributed data analysis in Rust.

Programming Languages

rust
11053 projects

Projects that are alternatives of or similar to Amadeus

Pwrake
Parallel Workflow extension for Rake, runs on multicores, clusters, clouds.
Stars: ✭ 57 (-76.25%)
Mutual labels:  parallel-computing, distributed-computing
Padasip
Python Adaptive Signal Processing
Stars: ✭ 138 (-42.5%)
Mutual labels:  data-analysis, data-processing
Parapet
A purely functional library to build distributed and event-driven systems
Stars: ✭ 106 (-55.83%)
Mutual labels:  parallel-computing, distributed-computing
Dataflowjavasdk
Google Cloud Dataflow provides a simple, powerful model for building both batch and streaming parallel data processing pipelines.
Stars: ✭ 854 (+255.83%)
Mutual labels:  data-analysis, data-processing
Klyng
A message-passing distributed computing framework for node.js
Stars: ✭ 167 (-30.42%)
Mutual labels:  parallel-computing, distributed-computing
Data Science On Gcp
Source code accompanying book: Data Science on the Google Cloud Platform, Valliappa Lakshmanan, O'Reilly 2017
Stars: ✭ 864 (+260%)
Mutual labels:  data-analysis, data-processing
Real Time Sentiment Tracking On Twitter For Brand Improvement And Trend Recognition
A real-time interactive web app based on data pipelines using streaming Twitter data, automated sentiment analysis, and MySQL&PostgreSQL database (Deployed on Heroku)
Stars: ✭ 127 (-47.08%)
Mutual labels:  data-analysis, stream-processing
data processing course
Some class materials for a data processing course using PySpark
Stars: ✭ 50 (-79.17%)
Mutual labels:  stream-processing, data-processing
Future.apply
🚀 R package: future.apply - Apply Function to Elements in Parallel using Futures
Stars: ✭ 159 (-33.75%)
Mutual labels:  parallel-computing, distributed-computing
Geni
A Clojure dataframe library that runs on Spark
Stars: ✭ 152 (-36.67%)
Mutual labels:  parallel-computing, distributed-computing
Future
🚀 R package: future: Unified Parallel and Distributed Processing in R for Everyone
Stars: ✭ 735 (+206.25%)
Mutual labels:  parallel-computing, distributed-computing
Awesome Parallel Computing
A curated list of awesome parallel computing resources
Stars: ✭ 212 (-11.67%)
Mutual labels:  parallel-computing, distributed-computing
Hazelcast
Open-source distributed computation and storage platform
Stars: ✭ 4,662 (+1842.5%)
Mutual labels:  distributed-computing, stream-processing
Pulsar Spark
When Apache Pulsar meets Apache Spark
Stars: ✭ 55 (-77.08%)
Mutual labels:  stream-processing, data-processing
Awesome Kafka
A list about Apache Kafka
Stars: ✭ 397 (+65.42%)
Mutual labels:  stream-processing, data-processing
Pulsar Flink
Elastic data processing with Apache Pulsar and Apache Flink
Stars: ✭ 126 (-47.5%)
Mutual labels:  stream-processing, data-processing
ripple
Simple shared surface streaming application
Stars: ✭ 17 (-92.92%)
Mutual labels:  distributed-computing, stream-processing
distex
Distributed process pool for Python
Stars: ✭ 101 (-57.92%)
Mutual labels:  parallel-computing, distributed-computing
Spark With Python
Fundamentals of Spark with Python (using PySpark), code examples
Stars: ✭ 150 (-37.5%)
Mutual labels:  parallel-computing, distributed-computing
Collapse
Advanced and Fast Data Transformation in R
Stars: ✭ 184 (-23.33%)
Mutual labels:  data-analysis, data-processing

Amadeus

Harmonious distributed data processing & analysis in Rust

Crates.io Apache-2.0 licensed Build Status

📖 Docs | 🌐 Home | 💬 Chat

Amadeus provides:

  • Distributed streams: like Rayon's parallel iterators, but distributed across a cluster.
  • Data connectors: to work with CSV, JSON, Parquet, Postgres, S3 and more.
  • ETL and Data Science tooling: focused on streaming processing & analysis.

Amadeus is a batteries-included, low-level reusable building block for the Rust Distributed Computing and Big Data ecosystems.

Principles

  • Fearless: no data races, no unsafe, and lossless data canonicalization.
  • Make distributed computing trivial: running distributed should be as easy and performant as running locally.
  • Data is gradually typed: for maximum performance when the schema is known, and flexibility when it's not.
  • Simplicity: keep interfaces and implementations as simple and reliable as possible.
  • Reliability: minimize unhandled errors (including OOM), and only surface errors that couldn't be handled internally.

Why Amadeus?

Clean & Scalable applications

By design, Amadeus encourages you to write clean and reusable code that works, regardless of data scale, locally or distributed across a cluster. Write once, run at any data scale.

Community

We aim to create a community that is welcoming and helpful to anyone that is interested! Come join us on our Zulip chat to:

  • get Amadeus working for your use case;
  • discuss direction for the project;
  • find good issues to get started with.

Compatibility out of the box

Amadeus has deep, pluggable, integration with various file formats, databases and interfaces:

Data format Source Destination
CSV
JSON
XML 👐
Parquet 🔨
Avro 🔨
PostgreSQL 🔨
HDF5 👐
Redshift 👐
CloudFront Logs
Common Crawl
S3 🔨
HDFS 👐 👐

✔ = Working
🔨 = Work in Progress
👐 = Requested: check out the issue for how to help!

Performance

Amadeus is routinely benchmarked and provisional results are very promising:

  • A 1.5x to 17x speedup reading Parquet data compared to the official Apache Arrow parquet crate with these benchmarks.

Runs Everywhere

Amadeus is a library that can be used on its own as parallel threadpool, or with Constellation as a distributed cluster.

Constellation is a framework for process distribution and communication, and has backends for a bare cluster (Linux or macOS), a managed Kubernetes cluster, and more in the pipeline.

Examples

This will read the Parquet partitions from the S3 bucket, and print the 100 most frequently occuring URLs.

use amadeus::prelude::*;
use amadeus::data::{IpAddr, Url};
use std::error::Error;

#[derive(Data, Clone, PartialEq, Debug)]
struct LogLine {
    uri: Option<String>,
    requestip: Option<IpAddr>,
}

#[tokio::main]
async fn main() -> Result<(), Box<dyn Error>> {
    let pool = ThreadPool::new(None, None)?;

    let rows = Parquet::new(ParquetDirectory::new(S3Directory::new_with(
        AwsRegion::UsEast1,
        "us-east-1.data-analytics",
        "cflogworkshop/optimized/cf-accesslogs/",
        AwsCredentials::Anonymous,
    )))
    .await?;

    let top_pages = rows
        .par_stream()
        .map(|row: Result<LogLine, _>| {
            let row = row.unwrap();
            (row.uri, row.requestip)
        })
        .most_distinct(&pool, 100, 0.99, 0.002, 0.0808)
        .await;

    println!("{:#?}", top_pages);
    Ok(())
}

This is typed, so faster, and it goes an analytics step further also, prints top 100 URLs by distinct IPs logged.

See the same example but with data dynamically typed.
use amadeus::prelude::*;
use std::error::Error;

#[tokio::main]
async fn main() -> Result<(), Box<dyn Error>> {
    let pool = ThreadPool::new(None, None)?;

    let rows = Parquet::new(ParquetDirectory::new(S3Directory::new_with(
        AwsRegion::UsEast1,
        "us-east-1.data-analytics",
        "cflogworkshop/optimized/cf-accesslogs/",
        AwsCredentials::Anonymous,
    )))
    .await?;

    let top_pages = rows
        .par_stream()
        .map(|row: Result<Value, _>| {
            let row = row.ok()?.into_group().ok()?;
            row.get("uri")?.clone().into_url().ok()
        })
        .filter(|row| row.is_some())
        .map(Option::unwrap)
        .most_frequent(&pool, 100, 0.99, 0.002)
        .await;

    println!("{:#?}", top_pages);
    Ok(())
}

What about loading this data into Postgres? This will create and populate a table called "accesslogs".

use amadeus::prelude::*;
use std::error::Error;

#[tokio::main]
async fn main() -> Result<(), Box<dyn Error>> {
    let pool = ThreadPool::new(None, None)?;

    let rows = Parquet::new(ParquetDirectory::new(S3Directory::new_with(
        AwsRegion::UsEast1,
        "us-east-1.data-analytics",
        "cflogworkshop/optimized/cf-accesslogs/",
        AwsCredentials::Anonymous,
    )))
    .await?;

    // Note: this isn't yet implemented!
    rows.par_stream()
        .pipe(Postgres::new("127.0.0.1", PostgresTable::new("accesslogs")));

    Ok(())
}

Running Distributed

Operations can run on a parallel threadpool or on a distributed process pool.

Amadeus uses the Constellation framework for process distribution and communication. Constellation has backends for a bare cluster (Linux or macOS), and a managed Kubernetes cluster.

use amadeus::dist::prelude::*;
use amadeus::data::{IpAddr, Url};
use constellation::*;
use std::error::Error;

#[derive(Data, Clone, PartialEq, Debug)]
struct LogLine {
    uri: Option<String>,
    requestip: Option<IpAddr>,
}

fn main() -> Result<(), Box<dyn Error>> {
    init(Resources::default());

    // #[tokio::main] isn't supported yet so unfortunately setting up the Runtime must be done explicitly
    tokio::runtime::Builder::new()
        .threaded_scheduler()
        .enable_all()
        .build()
        .unwrap()
        .block_on(async {
            let pool = ProcessPool::new(None, None, None, Resources::default())?;

            let rows = Parquet::new(ParquetDirectory::new(S3Directory::new_with(
                AwsRegion::UsEast1,
                "us-east-1.data-analytics",
                "cflogworkshop/optimized/cf-accesslogs/",
                AwsCredentials::Anonymous,
            )))
            .await?;

            let top_pages = rows
                .dist_stream()
                .map(FnMut!(|row: Result<LogLine, _>| {
                    let row = row.unwrap();
                    (row.uri, row.requestip)
                }))
                .most_distinct(&pool, 100, 0.99, 0.002, 0.0808)
                .await;

            println!("{:#?}", top_pages);
            Ok(())
        })
}

Getting started

todo

Examples

Take a look at the various examples.

Contribution

Amadeus is an open source project! If you'd like to contribute, check out the list of “good first issues”. These are all (or should be) issues that are suitable for getting started, and they generally include a detailed set of instructions for what to do. Please ask questions and ping us on our Zulip chat if anything is unclear!

License

Licensed under Apache License, Version 2.0, (LICENSE.txt or http://www.apache.org/licenses/LICENSE-2.0).

Unless you explicitly state otherwise, any contribution intentionally submitted for inclusion in the work by you, as defined in the Apache-2.0 license, shall be licensed as above, without any additional terms or conditions.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].