All Projects → GoogleCloudPlatform → Dataflowtemplates

GoogleCloudPlatform / Dataflowtemplates

Licence: apache-2.0
Google-provided Cloud Dataflow template pipelines for solving simple in-Cloud data tasks

Programming Languages

java
68154 projects - #9 most used programming language

Projects that are alternatives of or similar to Dataflowtemplates

go-bqloader
bqloader is a simple ETL framework to load data from Cloud Storage into BigQuery.
Stars: ✭ 16 (-97.35%)
Mutual labels:  bigquery, google-cloud-storage
Dataflow Tutorial
Cloud Dataflow Tutorial for Beginners
Stars: ✭ 17 (-97.18%)
Mutual labels:  google-cloud-storage, bigquery
benji
📁 This library is a Scala reactive DSL for object storage (e.g. S3/Amazon, S3/CEPH, Google Cloud Storage).
Stars: ✭ 18 (-97.01%)
Mutual labels:  google-cloud-storage
Bigquery Python
Simple Python client for interacting with Google BigQuery.
Stars: ✭ 397 (-34.16%)
Mutual labels:  bigquery
Wal E
Continuous Archiving for Postgres
Stars: ✭ 3,313 (+449.42%)
Mutual labels:  google-cloud-storage
Nodejs Bigquery
Node.js client for Google Cloud BigQuery: A fast, economical and fully-managed enterprise data warehouse for large-scale data analytics.
Stars: ✭ 268 (-55.56%)
Mutual labels:  bigquery
Goofys
a high-performance, POSIX-ish Amazon S3 file system written in Go
Stars: ✭ 3,932 (+552.07%)
Mutual labels:  google-cloud-storage
firehose
Firehose is an extensible, no-code, and cloud-native service to load real-time streaming data from Kafka to data stores, data lakes, and analytical storage systems.
Stars: ✭ 213 (-64.68%)
Mutual labels:  bigquery
Graphql Engine
Blazing fast, instant realtime GraphQL APIs on your DB with fine grained access control, also trigger webhooks on database events.
Stars: ✭ 24,845 (+4020.23%)
Mutual labels:  bigquery
Almanac.httparchive.org
HTTP Archive's annual "State of the Web" report made by the web community
Stars: ✭ 310 (-48.59%)
Mutual labels:  bigquery
Franchise
🍟 a notebook sql client. what you get when have a lot of sequels.
Stars: ✭ 3,823 (+534%)
Mutual labels:  bigquery
Pypinfo
Easily view PyPI download statistics via Google's BigQuery.
Stars: ✭ 295 (-51.08%)
Mutual labels:  bigquery
Flydrive
☁️ Flexible and Fluent framework-agnostic driver based system to manage storage in Node.js
Stars: ✭ 275 (-54.39%)
Mutual labels:  google-cloud-storage
Sqlpad
Web-based SQL editor run in your own private cloud. Supports MySQL, Postgres, SQL Server, Vertica, Crate, ClickHouse, Trino, Presto, SAP HANA, Cassandra, Snowflake, BigQuery, SQLite, and more with ODBC
Stars: ✭ 4,113 (+582.09%)
Mutual labels:  bigquery
snowplow-bigquery-loader
Loads Snowplow enriched events into Google BigQuery
Stars: ✭ 15 (-97.51%)
Mutual labels:  bigquery
Laravel Google Cloud Storage
A Google Cloud Storage filesystem for Laravel
Stars: ✭ 415 (-31.18%)
Mutual labels:  google-cloud-storage
storage
Go package for abstracting local, in-memory, and remote (Google Cloud Storage/S3) filesystems
Stars: ✭ 49 (-91.87%)
Mutual labels:  google-cloud-storage
Issue Label Bot
Code For The Issue Label Bot, an App that automatically labels issues using machine learning, available on the GitHub Marketplace. This is also code for the blog article: "How to automate tasks on GitHub with machine learning for fun and profit"
Stars: ✭ 292 (-51.58%)
Mutual labels:  bigquery
Bigquery Utils
Useful scripts, udfs, views, and other utilities for migration and data warehouse operations in BigQuery.
Stars: ✭ 338 (-43.95%)
Mutual labels:  bigquery
Kopia
Cross-platform backup tool for Windows, macOS & Linux with fast, incremental backups, client-side end-to-end encryption, compression and data deduplication. CLI and GUI included.
Stars: ✭ 507 (-15.92%)
Mutual labels:  google-cloud-storage

Google Cloud Dataflow Template Pipelines

These Dataflow templates are an effort to solve simple, but large, in-Cloud data tasks, including data import/export/backup/restore and bulk API operations, without a development environment. The technology under the hood which makes these operations possible is the Google Cloud Dataflow service combined with a set of Apache Beam SDK templated pipelines.

Google is providing this collection of pre-implemented Dataflow templates as a reference and to provide easy customization for developers wanting to extend their functionality.

Open in Cloud Shell

Template Pipelines

* Supports user-defined functions (UDFs).

For documentation on each template's usage and parameters, please see the official docs.

Getting Started

Requirements

  • Java 8
  • Maven 3

Building the Project

Build the entire project using the maven compile command.

mvn clean compile

Creating a Template File

Dataflow templates can be created using a maven command which builds the project and stages the template file on Google Cloud Storage. Any parameters passed at template build time will not be able to be overwritten at execution time.

mvn compile exec:java \
-Dexec.mainClass=com.google.cloud.teleport.templates.<template-class> \
-Dexec.cleanupDaemonThreads=false \
-Dexec.args=" \
--project=<project-id> \
--stagingLocation=gs://<bucket-name>/staging \
--tempLocation=gs://<bucket-name>/temp \
--templateLocation=gs://<bucket-name>/templates/<template-name>.json \
--runner=DataflowRunner"

Executing a Template File

Once the template is staged on Google Cloud Storage, it can then be executed using the gcloud CLI tool. The runtime parameters required by the template can be passed in the parameters field via comma-separated list of paramName=Value.

gcloud dataflow jobs run <job-name> \
--gcs-location=<template-location> \
--zone=<zone> \
--parameters <parameters>

Using UDFs

User-defined functions (UDFs) allow you to customize a template's functionality by providing a short JavaScript function without having to maintain the entire codebase. This is useful in situations which you'd like to rename fields, filter values, or even transform data formats before output to the destination. All UDFs are executed by providing the payload of the element as a string to the JavaScript function. You can then use JavaScript's in-built JSON parser or other system functions to transform the data prior to the pipeline's output. The return statement of a UDF specifies the payload to pass forward in the pipeline. This should always return a string value. If no value is returned or the function returns undefined, the incoming record will be filtered from the output.

UDF Function Specification

Template UDF Input Type Input Description UDF Output Type Output Description
Datastore Bulk Delete String A JSON string of the entity String A JSON string of the entity to delete; filter entities by returning undefined
Datastore to Pub/Sub String A JSON string of the entity String The payload to publish to Pub/Sub
Datastore to GCS Text String A JSON string of the entity String A single-line within the output file
GCS Text to BigQuery String A single-line within the input file String A JSON string which matches the destination table's schema
Pub/Sub to BigQuery String A string representation of the incoming payload String A JSON string which matches the destination table's schema
Pub/Sub to Datastore String A string representation of the incoming payload String A JSON string of the entity to write to Datastore
Pub/Sub to Splunk String A string representation of the incoming payload String The event data to be sent to Splunk HEC events endpoint. Must be a string or a stringified JSON object

UDF Examples

Adding fields

/**
 * A transform which adds a field to the incoming data.
 * @param {string} inJson
 * @return {string} outJson
 */
function transform(inJson) {
  var obj = JSON.parse(inJson);
  obj.dataFeed = "Real-time Transactions";
  obj.dataSource = "POS";
  return JSON.stringify(obj);
}

Filtering records

/**
 * A transform function which only accepts 42 as the answer to life.
 * @param {string} inJson
 * @return {string} outJson
 */
function transform(inJson) {
  var obj = JSON.parse(inJson);
  // only output objects which have an answer to life of 42.
  if (obj.hasOwnProperty('answerToLife') && obj.answerToLife === 42) {
    return JSON.stringify(obj);
  }
}
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].