All Projects → mercari → DataflowTemplate

mercari / DataflowTemplate

Licence: MIT license
Mercari Dataflow Template

Programming Languages

java
68154 projects - #9 most used programming language

Projects that are alternatives of or similar to DataflowTemplate

kuromoji-for-bigquery
Tokenize Japanese text on BigQuery with Kuromoji in Apache Beam/Google Dataflow at scale
Stars: ✭ 11 (-76.09%)
Mutual labels:  google-cloud, apache-beam
bigquery-to-datastore
Export a whole BigQuery table to Google Datastore with Apache Beam/Google Dataflow
Stars: ✭ 56 (+21.74%)
Mutual labels:  google-cloud, apache-beam
spannerz
Google Cloud Spanner Query Planner Visualizer
Stars: ✭ 60 (+30.43%)
Mutual labels:  google-cloud
gcpsamples
Simple "Hello world" samples for accessing Google Cloud APIs in (node,dotnet,java,golang,python)
Stars: ✭ 100 (+117.39%)
Mutual labels:  google-cloud
osrm-backend-k8s
Open Source Routing Machine (OSRM) osrm-backend for Kubernetes on Google Container Engine (GKE).
Stars: ✭ 34 (-26.09%)
Mutual labels:  google-cloud
perspectiveapi-authorship-demo
Example code to illustrate how to build an authorship experience using the perspective API
Stars: ✭ 62 (+34.78%)
Mutual labels:  google-cloud
up10.me
up10.me file storage site which only store data given data one day
Stars: ✭ 22 (-52.17%)
Mutual labels:  google-cloud
nominatim-k8s
Nominatim for Kubernetes on Google Container Engine (GKE).
Stars: ✭ 59 (+28.26%)
Mutual labels:  google-cloud
clj-gcloud-storage
Clojure wrapper for google-cloud-storage Java client.
Stars: ✭ 20 (-56.52%)
Mutual labels:  google-cloud
multer-sharp
Streaming multer storage engine permit to resize and upload to Google Cloud Storage.
Stars: ✭ 21 (-54.35%)
Mutual labels:  google-cloud
proxima-platform
The Proxima platform.
Stars: ✭ 17 (-63.04%)
Mutual labels:  apache-beam
CloudConductor
CloudConductor is a workflow management system that generates and executes bioinformatics pipelines
Stars: ✭ 13 (-71.74%)
Mutual labels:  google-cloud
streamlit-project
This repository provides a simple deployment-ready project layout for a Streamlit app. Simply swap out the code in `app.py` for your own and hit deploy!
Stars: ✭ 33 (-28.26%)
Mutual labels:  google-cloud
augle
Auth + Google = Augle
Stars: ✭ 22 (-52.17%)
Mutual labels:  google-cloud
deploy-appengine
A GitHub Action that deploys source code to Google App Engine.
Stars: ✭ 184 (+300%)
Mutual labels:  google-cloud
fabric-operation
Scripts to configure and deploy Hyperledger Fabric applications locally or in cloud by using Kubernetes or docker-compose
Stars: ✭ 15 (-67.39%)
Mutual labels:  google-cloud
auth
A GitHub Action for authenticating to Google Cloud.
Stars: ✭ 567 (+1132.61%)
Mutual labels:  google-cloud
functions-framework-java
FaaS (Function as a service) framework for writing portable Java functions
Stars: ✭ 101 (+119.57%)
Mutual labels:  google-cloud
build-a-platform-with-krm
Build a platform with the Kubernetes resource model!
Stars: ✭ 55 (+19.57%)
Mutual labels:  google-cloud
nx-extend
Nx Workspaces builders and tools
Stars: ✭ 67 (+45.65%)
Mutual labels:  google-cloud

Mercari Dataflow Template

The Mercari Dataflow Template enables you to run various pipelines without writing programs by simply defining a configuration file.

Mercari Dataflow Template is implemented as a FlexTemplate for Cloud Dataflow. Pipelines are assembled based on the defined configuration file and can be executed as Cloud Dataflow Jobs.

See the Document for usage

Usage Example

Write the following json file and upload it to GCS (Suppose you upload it to gs://example/config.json).

This configuration file stores the BigQuery query results in the table specified by Spanner.

{
  "sources": [
    {
      "name": "bigquery",
      "module": "bigquery",
      "parameters": {
        "query": "SELECT * FROM `myproject.mydataset.mytable`"
      }
    }
  ],
  "sinks": [
    {
      "name": "spanner",
      "module": "spanner",
      "input": "bigquery",
      "parameters": {
        "projectId": "myproject",
        "instanceId": "myinstance",
        "databaseId": "mydatabase",
        "table": "mytable"
      }
    }
  ]
}

Assuming you have deployed the Mercari Dataflow Template to gs://example/template, run the following command.

gcloud dataflow flex-template run bigquery-to-spanner \
  --template-file-gcs-location=gs://example/template \
  --parameters=config=gs://example/config.json

The Dataflow job will be started, and you can check the execution status of the job in the console screen.

Deploy Template

Mercari Dataflow Template is used as FlexTemplate. Therefore, the Mercari Dataflow Template should be deployed according to the FlexTemplate creation steps.

Requirements

Push Template Container Image to Cloud Container Registry.

The first step is to build the source code and register it as a container image in the Cloud Container Registry.

The following command will generate a container for FlexTemplate from the source code and upload it to Container Registry.

mvn clean package -DskipTests -Dimage=gcr.io/{deploy_project}/{template_repo_name}

Upload template file.

The next step is to generate a template file to start a job from the container image and upload it to GCS.

Use the following command to generate a template file that can execute a dataflow job from a container image, and upload it to GCS.

gcloud dataflow flex-template build gs://{path/to/template_file} \
  --image "gcr.io/{deploy_project}/{template_repo_name}" \
  --sdk-language "JAVA"

Run dataflow job from template file

Run Dataflow Job from the template file.

  • gcloud command

You can run template specifying gcs path that uploaded config file.

gsutil cp config.json gs://{path/to/config.json}

gcloud dataflow flex-template run {job_name} \
  --template-file-gcs-location=gs://{path/to/template_file} \
  --parameters=config=gs://{path/to/config.json}
  • REST API

You can also run template by REST API.

PROJECT_ID=[PROJECT_ID]
REGION=[REGION]
CONFIG="$(cat examples/xxx.json)"

curl -X POST -H "Content-Type: application/json"  -H "Authorization: Bearer $(gcloud auth print-access-token)" "https://dataflow.googleapis.com/v1b3/projects/${PROJECT_ID}/locations/${REGION}/flexTemplates:launch" -d "{
  'launchParameter': {
    'jobName': 'myJobName',
    'containerSpecGcsPath': 'gs://{path/to/template_file}',
    'parameters': {
      'config': '$(echo "$CONFIG")',
      'stagingLocation': 'gs://{path/to/staging}'
    },
    'environment': {
      'tempLocation': 'gs://{path/to/temp}'
    }
  }
}"

(The options tempLocation and stagingLocation are optional. If not specified, a bucket named dataflow-staging-{region}-{project_no} will be automatically generated and used)

Run Template in streaming mode

To run Template in streaming mode, specify streaming=true in the parameter.

gcloud dataflow flex-template run {job_name} \
  --template-file-gcs-location=gs://{path/to/template_file} \
  --parameters=config=gs://{path/to/config.json} \
  --parameters=streaming=true

Deploy Docker image for local pipeline

You can run pipeline locally. This is useful when you want to process small data quickly.

At first, you should register the container for local execution.

# Generate MDT jar file.
mvn clean package -DskipTests -Dimage=gcr.io/{deploy_project}/{template_repo_name}

# Create Docker image for local run
docker build --tag=gcr.io/{deploy_project}/{repo_name_local} .

# If you need to push the image to the GCR,
# you may do so by using the following commands
gcloud auth configure-docker
docker push gcr.io/{deploy_project}/{repo_name_local}

Run Pipeline locally

For local execution, execute the following command to grant the necessary permissions

gcloud auth application-default login

The following is an example of a locally executed command. The authentication file and config file are mounted for access by the container. The other arguments (such as project and config) are the same as for normal execution.

If you want to run in streaming mode, specify streaming=true in the argument as you would in normal execution.

Mac OS

docker run \
  -v ~/.config/gcloud:/mnt/gcloud:ro \
  -v /{your_work_dir}:/mnt/config:ro \  
  --rm gcr.io/{deploy_project}/{repo_name_local} \
  --project={project} \
  --config=/mnt/config/{my_config}.json

Windows OS

docker run ^
  -v C:\Users\{YourUserName}\AppData\Roaming\gcloud:/mnt/gcloud:ro ^
  -v C:\Users\{YourWorkingDirPath}\:/mnt/config:ro ^
  --rm gcr.io/{deploy_project}/{repo_name_local} ^
  --project={project} ^
  --config=/mnt/config/{MyConfig}.json

Note: If you use BigQuery module locally, you will need to specify the tempLocation argument.

Committers

Contribution

Please read the CLA carefully before submitting your contribution to Mercari. Under any circumstances, by submitting your contribution, you are deemed to accept and agree to be bound by the terms and conditions of the CLA.

https://www.mercari.com/cla/

License

Copyright 2022 Mercari, Inc.

Licensed under the MIT License.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].