All Projects → yu-iskw → bigquery-to-datastore

yu-iskw / bigquery-to-datastore

Licence: other
Export a whole BigQuery table to Google Datastore with Apache Beam/Google Dataflow

Programming Languages

java
68154 projects - #9 most used programming language
shell
77523 projects
Makefile
30231 projects
Dockerfile
14818 projects

Projects that are alternatives of or similar to bigquery-to-datastore

kuromoji-for-bigquery
Tokenize Japanese text on BigQuery with Kuromoji in Apache Beam/Google Dataflow at scale
Stars: ✭ 11 (-80.36%)
Mutual labels:  bigquery, google-cloud, apache-beam, google-dataflow
Scio
A Scala API for Apache Beam and Google Cloud Dataflow.
Stars: ✭ 2,247 (+3912.5%)
Mutual labels:  bigquery, beam, google-cloud
iris3
An upgraded and improved version of the Iris automatic GCP-labeling project
Stars: ✭ 38 (-32.14%)
Mutual labels:  bigquery, google-cloud
ob google-bigquery
This service is meant to simplify running Google Cloud operations, especially BigQuery tasks. This means you do not have to worry about installation, configuration or ongoing maintenance related to an SDK environment. This can be helpful to those who would prefer to not to be responsible for those activities.
Stars: ✭ 43 (-23.21%)
Mutual labels:  bigquery, google-cloud
Gcp Variant Transforms
GCP Variant Transforms
Stars: ✭ 100 (+78.57%)
Mutual labels:  bigquery, beam
go-bqloader
bqloader is a simple ETL framework to load data from Cloud Storage into BigQuery.
Stars: ✭ 16 (-71.43%)
Mutual labels:  bigquery, google-cloud
bqv
The simplest tool to manage views of BigQuery.
Stars: ✭ 22 (-60.71%)
Mutual labels:  bigquery, google-cloud
Magnolify
A collection of Magnolia add-on modules
Stars: ✭ 81 (+44.64%)
Mutual labels:  bigquery, google-cloud
DataflowTemplate
Mercari Dataflow Template
Stars: ✭ 46 (-17.86%)
Mutual labels:  google-cloud, apache-beam
bigflow
A Python framework for data processing on GCP.
Stars: ✭ 96 (+71.43%)
Mutual labels:  bigquery, beam
argon
Campaign Manager 360 and Display & Video 360 Reports to BigQuery connector
Stars: ✭ 31 (-44.64%)
Mutual labels:  bigquery, google-cloud
DataflowTemplates
Convenient Dataflow pipelines for transforming data between cloud data sources
Stars: ✭ 22 (-60.71%)
Mutual labels:  bigquery, apache-beam
server
The ViUR application development framework - legacy version 2.x for Python 2.7
Stars: ✭ 12 (-78.57%)
Mutual labels:  google-cloud, google-datastore
Ethereum Etl
Python scripts for ETL (extract, transform and load) jobs for Ethereum blocks, transactions, ERC20 / ERC721 tokens, transfers, receipts, logs, contracts, internal transactions. Data is available in Google BigQuery https://goo.gl/oY5BCQ
Stars: ✭ 956 (+1607.14%)
Mutual labels:  bigquery, google-cloud
Spark Bigquery Connector
BigQuery data source for Apache Spark: Read data from BigQuery into DataFrames, write DataFrames into BigQuery tables.
Stars: ✭ 126 (+125%)
Mutual labels:  bigquery, google-cloud
blockchain-etl-streaming
Streaming Ethereum and Bitcoin blockchain data to Google Pub/Sub or Postgres in Kubernetes
Stars: ✭ 57 (+1.79%)
Mutual labels:  apache-beam, google-dataflow
bigquery-kafka-connect
☁️ nodejs kafka connect connector for Google BigQuery
Stars: ✭ 17 (-69.64%)
Mutual labels:  bigquery, google-cloud
zorya
Google Cloud Instance Scheduler helping to reduce costs by 60% on average for non-production environments.
Stars: ✭ 127 (+126.79%)
Mutual labels:  google-cloud
Cloud-Service-Providers-Free-Tier-Overview
Comparing the free tier offers of the major cloud providers like AWS, Azure, GCP, Oracle etc.
Stars: ✭ 226 (+303.57%)
Mutual labels:  google-cloud
ImageToText
OCR with Google's AI technology (Cloud Vision API)
Stars: ✭ 30 (-46.43%)
Mutual labels:  google-cloud

bigquery-to-datastore

CircleCI codecov

This enables us to export a BigQuery table to a Google Datastore kind using Apache Beam on top of Google Dataflow.

You don't have to have duplicated rows whose key values are same. Apache Beam's DatastoreIO doesn't allow us to write same key at once.

Data Pipeline

Requirements

  • Maven
  • Java 1.8+
  • Google Cloud Platform account

Usage

Required arguments

  • --project: Google Cloud Project
  • --inputBigQueryDataset: Input BigQuery dataset ID
  • --inputBigQueryTable: Input BigQuery table ID
  • --keyColumn: BigQuery column name for a key of Google Datastore kind
  • --outputDatastoreNamespace: Output Google Datastore namespace
  • --outputDatastoreKind: OUtput Google Datastore kind
  • --tempLocation: The Cloud Storage path to use for temporary files. Must be a valid Cloud Storage URL, beginning with gs://.
  • --gcpTempLocation: A GCS path for storing temporary files in GCP.

Optional arguments

  • --runner: Apache Beam runner.
    • When you don't set this option, it will run on your local machine, not Google Dataflow.
    • e.g. DataflowRunner
  • --parentPaths: Output Google Datastore parent path(s)
    • e.g. Parent1:p1,Parent2:p2 ==> KEY('Parent1', 'p1', 'Parent2', 'p2')
  • --indexedColumns: Indexed columns on Google Datastore.
    • e.g. col1,col2,col3 ==> col1, col2 and col2 are indexed on Google Datastore.
  • --numWorkers: The number of workers when you run it on top of Google Dataflow.
  • --workerMachineType: Google Dataflow worker instance type
    • e.g. n1-standard-1, n1-standard-4

Example to run on Google Dataflow

# compile
mvn clean package

# Run bigquery-to-datastore via the compiled JAR file
java -cp $(pwd)/target/bigquery-to-datastore-bundled-0.7.0.jar \
  com.github.yuiskw.beam.BigQuery2Datastore \
  --project=your-gcp-project \
  --runner=DataflowRunner \
  --inputBigQueryDataset=test_dataset \
  --inputBigQueryTable=test_table \
  --outputDatastoreNamespace=test_namespace \
  --outputDatastoreKind=TestKind \
  --parentPaths=Parent1:p1,Parent2:p2 \
  --keyColumn=id \
  --indexedColumns=col1,col2,col3 \
  --tempLocation=gs://test_bucket/test-log/ \
  --gcpTempLocation=gs://test_bucket/test-log/

How to run

How to build and run it with java

# compile
mvn clean package
# or
make package

# run
java -cp $(pwd)/target/bigquery-to-datastore-bundled-0.7.0.jar --help
# or
./bin/bigquery-to-datastore --help

How to run it on docker

We also offers docker images for this project in yuiskw/bigquery-to-datastore - Docker Hub. We have several docker images based on Apache Beam versions.

docker run yuiskw/bigquery-to-datastore:0.7.0-beam-2.16.0 --help

How to install it with homebrew

You can install it with homebrew from yu-iskw/homebrew-bigquery-to-datastore.

# install
brew install yu-iskw/bigquery-to-datastore/bigquery-to-datastore

# show help
./bin/bigquery-to-datastore --help

Type conversions between BigQuery and Google Datastore

The below table describes the type conversions between BigQuery and Google Datastore. Since Datastore unfortunately doesn't have any data type for time, bigquery-to-datastore ignore BigQuery columns whose data type are TIME.

BigQuery Datastore
BOOLEAN bool
INTEGER int
DOUBLE double
STRING string
TIMESTAMP timestamp
DATE timestamp
TIME ignored: Google Datastore doesn't have time type.
RECORD array
STRUCT Entity

Note

As you probably know, Google Datastore doesn't have any feature much like UPDATE of MySQL. Since DatastoreIO.Write upsert given input entities, it will just overwrite an entity whether or not it already exists. If we would like to insert multiple data separately, we have to combine them on bigquery beforehand.

License

Copyright (c) 2017 Yu Ishikawa.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].