All Projects → apache → Beam

apache / Beam

Licence: apache-2.0
Apache Beam is a unified programming model for Batch and Streaming

Programming Languages

python
139335 projects - #7 most used programming language
java
68154 projects - #9 most used programming language
go
31211 projects - #10 most used programming language
groovy
2714 projects
dart
5743 projects
shell
77523 projects

Projects that are alternatives of or similar to Beam

Scio
A Scala API for Apache Beam and Google Cloud Dataflow.
Stars: ✭ 2,247 (-56.36%)
Mutual labels:  batch, streaming, beam
Materialize
Materialize lets you ask questions of your live data, which it answers and then maintains for you as your data continue to change. The moment you need a refreshed answer, you can get it in milliseconds. Materialize is designed to help you interactively explore your streaming data, perform data warehousing analytics against live relational data, or just increase the freshness and reduce the load of your dashboard and monitoring tasks.
Stars: ✭ 3,341 (-35.11%)
Mutual labels:  sql, streaming
Calcite
Apache Calcite
Stars: ✭ 2,816 (-45.31%)
Mutual labels:  sql, big-data
openmessaging.github.io
OpenMessaging homepage
Stars: ✭ 12 (-99.77%)
Mutual labels:  streaming, batch
Spark With Python
Fundamentals of Spark with Python (using PySpark), code examples
Stars: ✭ 150 (-97.09%)
Mutual labels:  sql, big-data
Presto
The official home of the Presto distributed SQL query engine for big data
Stars: ✭ 12,957 (+151.64%)
Mutual labels:  sql, big-data
beam-site
Apache Beam Site
Stars: ✭ 28 (-99.46%)
Mutual labels:  big-data, beam
Maha
A framework for rapid reporting API development; with out of the box support for high cardinality dimension lookups with druid.
Stars: ✭ 101 (-98.04%)
Mutual labels:  sql, big-data
Crate
CrateDB is a distributed SQL database that makes it simple to store and analyze massive amounts of data in real-time.
Stars: ✭ 3,254 (-36.8%)
Mutual labels:  sql, big-data
Sylph
Stream computing platform for bigdata
Stars: ✭ 362 (-92.97%)
Mutual labels:  sql, big-data
Metorikku
A simplified, lightweight ETL Framework based on Apache Spark
Stars: ✭ 361 (-92.99%)
Mutual labels:  sql, big-data
Efcore.bulkextensions
Entity Framework Core Bulk Batch Extensions for Insert Update Delete Read (CRUD), Truncate and SaveChanges operations on SQL Server, PostgreSQL, SQLite
Stars: ✭ 2,295 (-55.43%)
Mutual labels:  sql, batch
Join Monster Graphql Tools Adapter
Use Join Monster to fetch your data with Apollo Server.
Stars: ✭ 130 (-97.48%)
Mutual labels:  sql, batch
Presto Go Client
A Presto client for the Go programming language.
Stars: ✭ 183 (-96.45%)
Mutual labels:  sql, big-data
Calcite Avatica
Mirror of Apache Calcite - Avatica
Stars: ✭ 130 (-97.48%)
Mutual labels:  sql, big-data
Clickhouse
ClickHouse® is a free analytics DBMS for big data
Stars: ✭ 21,089 (+309.57%)
Mutual labels:  sql, big-data
Ignite
Apache Ignite
Stars: ✭ 4,027 (-21.79%)
Mutual labels:  sql, big-data
Spark Website
Apache Spark Website
Stars: ✭ 75 (-98.54%)
Mutual labels:  sql, big-data
Fiflow
flink-sql 在 flink 上运行 sql 和 构建数据流的平台 基于 apache flink 1.10.0
Stars: ✭ 100 (-98.06%)
Mutual labels:  sql, streaming
Trino
Official repository of Trino, the distributed SQL query engine for big data, formerly known as PrestoSQL (https://trino.io)
Stars: ✭ 4,581 (-11.03%)
Mutual labels:  sql, big-data

Apache Beam

Apache Beam is a unified model for defining both batch and streaming data-parallel processing pipelines, as well as a set of language-specific SDKs for constructing pipelines and Runners for executing them on distributed processing backends, including Apache Flink, Apache Spark, Google Cloud Dataflow, and Hazelcast Jet.

Status

Maven Version PyPI version Python coverage Build python source distribution and wheels Python tests Java tests

Post-commit tests status (on master branch)

Lang SDK Dataflow Flink Samza Spark Twister2
Go Build Status --- Build Status Build Status Build Status ---
Java Build Status Build Status
Build Status
Build Status
Build Status
Build Status
Build Status
Build Status
Build Status
Build Status
Build Status
Build Status
Build Status
Build Status
Python Build Status
Build Status
Build Status
Build Status
Build Status
Build Status
Build Status
Build Status
Build Status Build Status ---
XLang Build Status Build Status Build Status Build Status Build Status Build Status

Overview

Beam provides a general approach to expressing embarrassingly parallel data processing pipelines and supports three categories of users, each of which have relatively disparate backgrounds and needs.

  1. End Users: Writing pipelines with an existing SDK, running it on an existing runner. These users want to focus on writing their application logic and have everything else just work.
  2. SDK Writers: Developing a Beam SDK targeted at a specific user community (Java, Python, Scala, Go, R, graphical, etc). These users are language geeks and would prefer to be shielded from all the details of various runners and their implementations.
  3. Runner Writers: Have an execution environment for distributed processing and would like to support programs written against the Beam Model. Would prefer to be shielded from details of multiple SDKs.

The Beam Model

The model behind Beam evolved from a number of internal Google data processing projects, including MapReduce, FlumeJava, and Millwheel. This model was originally known as the “Dataflow Model”.

To learn more about the Beam Model (though still under the original name of Dataflow), see the World Beyond Batch: Streaming 101 and Streaming 102 posts on O’Reilly’s Radar site, and the VLDB 2015 paper.

The key concepts in the Beam programming model are:

  • PCollection: represents a collection of data, which could be bounded or unbounded in size.
  • PTransform: represents a computation that transforms input PCollections into output PCollections.
  • Pipeline: manages a directed acyclic graph of PTransforms and PCollections that is ready for execution.
  • PipelineRunner: specifies where and how the pipeline should execute.

SDKs

Beam supports multiple language specific SDKs for writing pipelines against the Beam Model.

Currently, this repository contains SDKs for Java, Python and Go.

Have ideas for new SDKs or DSLs? See the JIRA.

Runners

Beam supports executing programs on multiple distributed processing backends through PipelineRunners. Currently, the following PipelineRunners are available:

  • The DirectRunner runs the pipeline on your local machine.
  • The DataflowRunner submits the pipeline to the Google Cloud Dataflow.
  • The FlinkRunner runs the pipeline on an Apache Flink cluster. The code has been donated from dataArtisans/flink-dataflow and is now part of Beam.
  • The SparkRunner runs the pipeline on an Apache Spark cluster. The code has been donated from cloudera/spark-dataflow and is now part of Beam.
  • The JetRunner runs the pipeline on a Hazelcast Jet cluster. The code has been donated from hazelcast/hazelcast-jet and is now part of Beam.
  • The Twister2Runner runs the pipeline on a Twister2 cluster. The code has been donated from DSC-SPIDAL/twister2 and is now part of Beam.

Have ideas for new Runners? See the JIRA.

Getting Started

To learn how to write Beam pipelines, read the Quickstart for [Java, Python, or Go] available on our website.

Contact Us

To get involved in Apache Beam:

Instructions for building and testing Beam itself are in the contribution guide.

More Information

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].