All Projects → spotify → Scio

spotify / Scio

Licence: apache-2.0
A Scala API for Apache Beam and Google Cloud Dataflow.

Programming Languages

scala
5932 projects
java
68154 projects - #9 most used programming language

Projects that are alternatives of or similar to Scio

bigflow
A Python framework for data processing on GCP.
Stars: ✭ 96 (-95.73%)
Mutual labels:  bigquery, beam, dataflow
Beam
Apache Beam is a unified programming model for Batch and Streaming
Stars: ✭ 5,149 (+129.15%)
Mutual labels:  batch, streaming, beam
bigquery-to-datastore
Export a whole BigQuery table to Google Datastore with Apache Beam/Google Dataflow
Stars: ✭ 56 (-97.51%)
Mutual labels:  bigquery, beam, google-cloud
Gcp Variant Transforms
GCP Variant Transforms
Stars: ✭ 100 (-95.55%)
Mutual labels:  dataflow, bigquery, beam
Onyx
Distributed, masterless, high performance, fault tolerant data processing
Stars: ✭ 2,019 (-10.15%)
Mutual labels:  batch, data, streaming
Openmessaging Java
OpenMessaging Runtime Interface for Java
Stars: ✭ 685 (-69.51%)
Mutual labels:  batch, streaming
Raftlib
The RaftLib C++ library, streaming/dataflow concurrency via C++ iostream-like operators
Stars: ✭ 717 (-68.09%)
Mutual labels:  dataflow, streaming
Pothosblocks
A collection of core processing blocks
Stars: ✭ 7 (-99.69%)
Mutual labels:  dataflow, streaming
Pycm
Multi-class confusion matrix library in Python
Stars: ✭ 1,076 (-52.11%)
Mutual labels:  data, ml
Featran
A Scala feature transformation library for data science and machine learning
Stars: ✭ 420 (-81.31%)
Mutual labels:  data, ml
Attention Ocr
A Tensorflow model for text recognition (CNN + seq2seq with visual attention) available as a Python package and compatible with Google Cloud ML Engine.
Stars: ✭ 844 (-62.44%)
Mutual labels:  google-cloud, ml
Athenax
SQL-based streaming analytics platform at scale
Stars: ✭ 1,178 (-47.57%)
Mutual labels:  data, streaming
Streaming Readings
Streaming System 相关的论文读物
Stars: ✭ 554 (-75.34%)
Mutual labels:  dataflow, streaming
Awesome Ai Ml Dl
Awesome Artificial Intelligence, Machine Learning and Deep Learning as we learn it. Study notes and a curated list of awesome resources of such topics.
Stars: ✭ 831 (-63.02%)
Mutual labels:  data, ml
Lexpredict Lexnlp
LexNLP by LexPredict
Stars: ✭ 439 (-80.46%)
Mutual labels:  data, ml
Ethereum Etl
Python scripts for ETL (extract, transform and load) jobs for Ethereum blocks, transactions, ERC20 / ERC721 tokens, transfers, receipts, logs, contracts, internal transactions. Data is available in Google BigQuery https://goo.gl/oY5BCQ
Stars: ✭ 956 (-57.45%)
Mutual labels:  google-cloud, bigquery
Codesearchnet
Datasets, tools, and benchmarks for representation learning of code.
Stars: ✭ 1,378 (-38.67%)
Mutual labels:  data, ml
Spark Bigquery Connector
BigQuery data source for Apache Spark: Read data from BigQuery into DataFrames, write DataFrames into BigQuery tables.
Stars: ✭ 126 (-94.39%)
Mutual labels:  google-cloud, bigquery
Tesseract
A set of libraries for rapidly developing Pipeline driven micro/macroservices.
Stars: ✭ 20 (-99.11%)
Mutual labels:  data, dataflow
firehose
Firehose is an extensible, no-code, and cloud-native service to load real-time streaming data from Kafka to data stores, data lakes, and analytical storage systems.
Stars: ✭ 213 (-90.52%)
Mutual labels:  bigquery, streaming

Scio

Build Status codecov.io GitHub license Maven Central Scaladoc Scala Steward badge

Scio Logo

Ecclesiastical Latin IPA: /ˈʃi.o/, [ˈʃiː.o], [ˈʃi.i̯o] Verb: I can, know, understand, have knowledge.

Scio is a Scala API for Apache Beam and Google Cloud Dataflow inspired by Apache Spark and Scalding.

Scio 0.3.0 and future versions depend on Apache Beam (org.apache.beam) while earlier versions depend on Google Cloud Dataflow SDK (com.google.cloud.dataflow). See this page for a list of breaking changes.

Features

  • Scala API close to that of Spark and Scalding core APIs
  • Unified batch and streaming programming model
  • Fully managed service*
  • Integration with Google Cloud products: Cloud Storage, BigQuery, Pub/Sub, Datastore, Bigtable
  • JDBC, TensorFlow TFRecords, Cassandra, Elasticsearch and Parquet I/O
  • Interactive mode with Scio REPL
  • Type safe BigQuery
  • Integration with Algebird and Breeze
  • Pipeline orchestration with Scala Futures
  • Distributed cache

* provided by Google Cloud Dataflow

Quick Start

Download and install the Java Development Kit (JDK) version 8.

Install sbt.

Use our giter8 template to quickly create a new Scio job repository:

sbt new spotify/scio.g8

Switch to the new repo (default scio-job) and build it:

cd scio-job
sbt stage

Run the included word count example:

target/universal/stage/bin/scio-job --output=wc

List result files and inspect content:

ls -l wc
cat wc/part-00000-of-00004.txt

Documentation

Getting Started is the best place to start with Scio. If you are new to Apache Beam and distributed data processing, check out the Beam Programming Guide first for a detailed explanation of the Beam programming model and concepts. If you have experience with other Scala data processing libraries, check out this comparison between Scio, Scalding and Spark. Finally check out this document about the relationship between Scio, Beam and Dataflow.

Example Scio pipelines and tests can be found under scio-examples. A lot of them are direct ports from Beam's Java examples. See this page for some of them with side-by-side explanation. Also see Big Data Rosetta Code for common data processing code snippets in Scio, Scalding and Spark.

Artifacts

Scio includes the following artifacts:

  • scio-core: core library
  • scio-test: test utilities, add to your project as a "test" dependency
  • scio-avro: add-on for Avro, can also be used standalone
  • scio-google-cloud-platform: add-on for Google Cloud IO's: BigQuery, Bigtable, Pub/Sub, Datastore, Spanner
  • scio-cassandra*: add-ons for Cassandra
  • scio-elasticsearch*: add-ons for Elasticsearch
  • scio-extra: extra utilities for working with collections, Breeze, etc., best effort support
  • scio-jdbc: add-on for JDBC IO
  • scio-parquet: add-on for Parquet
  • scio-tensorflow: add-on for TensorFlow TFRecords IO and prediction
  • scio-redis: add-on for Redis
  • scio-smb: add-on for Sort Merge Bucket operations
  • scio-repl: extension of the Scala REPL with Scio specific operations

License

Copyright 2021 Spotify AB.

Licensed under the Apache License, Version 2.0: http://www.apache.org/licenses/LICENSE-2.0

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].