All Projects → googlegenomics → Gcp Variant Transforms

googlegenomics / Gcp Variant Transforms

Licence: apache-2.0
GCP Variant Transforms

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to Gcp Variant Transforms

Scio
A Scala API for Apache Beam and Google Cloud Dataflow.
Stars: ✭ 2,247 (+2147%)
Mutual labels:  dataflow, bigquery, beam
bigflow
A Python framework for data processing on GCP.
Stars: ✭ 96 (-4%)
Mutual labels:  bigquery, beam, dataflow
DataflowTemplates
Convenient Dataflow pipelines for transforming data between cloud data sources
Stars: ✭ 22 (-78%)
Mutual labels:  bigquery, dataflow
bigquery-data-lineage
Reference implementation for real-time Data Lineage tracking for BigQuery using Audit Logs, ZetaSQL and Dataflow.
Stars: ✭ 112 (+12%)
Mutual labels:  bigquery, dataflow
bigquery-to-datastore
Export a whole BigQuery table to Google Datastore with Apache Beam/Google Dataflow
Stars: ✭ 56 (-44%)
Mutual labels:  bigquery, beam
Fgbase
Ready-send coordination layer on top of goroutines.
Stars: ✭ 45 (-55%)
Mutual labels:  dataflow
Vue Dataflow Editor
Vue 2 dataflow graph editor
Stars: ✭ 73 (-27%)
Mutual labels:  dataflow
Datashare Toolkit
DIY commercial datasets on Google Cloud Platform
Stars: ✭ 41 (-59%)
Mutual labels:  bigquery
Gvpm
Gradient-domain Volumetric Photon Density Estimation, SIGGRAPH 2018
Stars: ✭ 32 (-68%)
Mutual labels:  beam
Embulk Output Bigquery
Embulk output plugin to load/insert data into Google BigQuery
Stars: ✭ 99 (-1%)
Mutual labels:  bigquery
Ethereum Etl Airflow
Airflow DAGs for exporting, loading, and parsing the Ethereum blockchain data. What datasets do you want to be added to Ethereum ETL? Vote here: https://blockchain-etl.convas.io.
Stars: ✭ 89 (-11%)
Mutual labels:  bigquery
Linq To Bigquery
LINQ to BigQuery is C# LINQ Provider for Google BigQuery. It also enables Desktop GUI Client with LINQPad and plug-in driver.
Stars: ✭ 69 (-31%)
Mutual labels:  bigquery
Ddlparse
DDL parase and Convert to BigQuery JSON schema and DDL statements
Stars: ✭ 52 (-48%)
Mutual labels:  bigquery
Google Cloud Eclipse
Google Cloud Platform plugin for Eclipse
Stars: ✭ 75 (-25%)
Mutual labels:  dataflow
Arcon
Runtime for Writing Streaming Applications in Rust.
Stars: ✭ 44 (-56%)
Mutual labels:  dataflow
Magnolify
A collection of Magnolia add-on modules
Stars: ✭ 81 (-19%)
Mutual labels:  bigquery
Maestro
An analytical cost model evaluating DNN mappings (dataflows and tiling).
Stars: ✭ 35 (-65%)
Mutual labels:  dataflow
Sql Runner
Run templatable playbooks of SQL scripts in series and parallel on Redshift, PostgreSQL, BigQuery and Snowflake
Stars: ✭ 68 (-32%)
Mutual labels:  bigquery
Goflow
Flow-based and dataflow programming library for Go (golang)
Stars: ✭ 1,276 (+1176%)
Mutual labels:  dataflow
Toubkal
Fully reactive programming for nodejs and the browser
Stars: ✭ 67 (-33%)
Mutual labels:  dataflow

GCP Variant Transforms

Build Status Coverage Status

Overview

This is a tool for transforming and processing VCF files in a scalable manner based on Apache Beam using Dataflow on Google Cloud Platform.

It can be used to directly load VCF files to BigQuery supporting hundreds of thousands of files, millions of samples, and billions of records. Additionally, it provides a preprocess functionality to validate the VCF files such that the inconsistencies can be easily identified.

Please see the following links for more information:

Prerequisites

  1. Follow the getting started instructions on the Google Cloud page.
  2. Enable the Genomics, Compute Engine, Cloud Storage, and Dataflow APIs
  3. Create a new BigQuery dataset by visiting the BigQuery web UI, clicking on the down arrow icon next to your project name in the navigation, and clicking on Create new dataset.

Loading VCF files to BigQuery

Using docker

The easiest way to run the VCF to BigQuery pipeline is to use the docker image, as it has the binaries and all dependencies pre-installed. Please ensure you have the latest gcloud tool by running gcloud components update (more details here).

Use the following command to get the latest version of Variant Transforms.

docker pull gcr.io/cloud-lifesciences/gcp-variant-transforms

Run the script below and replace the following parameters:

  • Dataflow's required inputs:
    • GOOGLE_CLOUD_PROJECT: This is your project ID that contains the BigQuery dataset.
    • GOOGLE_CLOUD_REGION: You must choose a geographic region for Cloud Dataflow to process your data, for example: us-west1. For more information please refer to Setting Regions.
    • GOOGLE_CLOUD_LOCATION: You may choose a geographic location for Cloud Life Sciences API to orchestrate job from. This is not where the data will be processed, but where some operation metadata will be stored. This can be the same or different from the region chosen for Cloud Dataflow. If this is not set, the metadata will be stored in us-central1. See the list of Currently Available Locations.
    • TEMP_LOCATION: This can be any folder in Google Cloud Storage that your project has write access to. It's used to store temporary files and logs from the pipeline.
  • INPUT_PATTERN: A location in Google Cloud Storage where the VCF file are stored. You may specify a single file or provide a pattern to load multiple files at once. Please refer to the Variant Merging documentation if you want to merge samples across files. The pipeline supports gzip, bzip, and uncompressed VCF formats. However, it runs slower for compressed files as they cannot be sharded.
  • OUTPUT_TABLE: The full path to a BigQuery table to store the output.
#!/bin/bash
# Parameters to replace:
GOOGLE_CLOUD_PROJECT=GOOGLE_CLOUD_PROJECT
GOOGLE_CLOUD_REGION=GOOGLE_CLOUD_REGION
GOOGLE_CLOUD_LOCATION=GOOGLE_CLOUD_LOCATION
TEMP_LOCATION=gs://BUCKET/temp
INPUT_PATTERN=gs://BUCKET/*.vcf
OUTPUT_TABLE=GOOGLE_CLOUD_PROJECT:BIGQUERY_DATASET.BIGQUERY_TABLE

COMMAND="vcf_to_bq \
  --input_pattern ${INPUT_PATTERN} \
  --output_table ${OUTPUT_TABLE} \
  --job_name vcf-to-bigquery \
  --runner DataflowRunner"

docker run -v ~/.config:/root/.config \
  gcr.io/cloud-lifesciences/gcp-variant-transforms \
  --project "${GOOGLE_CLOUD_PROJECT}" \
  --location "${GOOGLE_CLOUD_LOCATION}" \
  --region "${GOOGLE_CLOUD_REGION}" \
  --temp_location "${TEMP_LOCATION}" \
  "${COMMAND}"

--project, --region, and --temp_location are required inputs. You must set all of them, unless your project and region default values are set in your local gcloud configuration. You may set the default project and region using the following commands:

gcloud config set project GOOGLE_CLOUD_PROJECT
gcloud config set compute/region REGION

The underlying pipeline uses Cloud Dataflow. You can navigate to the Dataflow Console, to see more detailed view of the pipeline (e.g. number of records being processed, number of workers, more detailed error logs).

Running from github

In addition to using the docker image, you may run the pipeline directly from source. First install git, python, pip, and virtualenv:

sudo apt-get install -y git python3-pip python3-venv python3.7-venv python-dev build-essential

Note that python 3.8 is not yet supported, so ensure you are using Python 3.7.

Run virtualenv, clone the repo, and install pip packages:

python3 -m venv venv3
source venv3/bin/activate
git clone https://github.com/googlegenomics/gcp-variant-transforms.git
cd gcp-variant-transforms
python -m pip install --upgrade pip
python -m pip install --upgrade wheel
python -m pip install --upgrade .

You may use the DirectRunner (aka local runner) for small (e.g. 10,000 records) files or DataflowRunner for larger files. Files should be stored on Google Cloud Storage if using Dataflow, but may be stored locally for DirectRunner.

Example command for DirectRunner:

python -m gcp_variant_transforms.vcf_to_bq \
  --input_pattern gcp_variant_transforms/testing/data/vcf/valid-4.0.vcf \
  --output_table GOOGLE_CLOUD_PROJECT:BIGQUERY_DATASET.BIGQUERY_TABLE \
  --job_name vcf-to-bigquery-direct-runner \
  --temp_location "${TEMP_LOCATION}"

Example command for DataflowRunner:

python -m gcp_variant_transforms.vcf_to_bq \
  --input_pattern gs://BUCKET/*.vcf \
  --output_table GOOGLE_CLOUD_PROJECT:BIGQUERY_DATASET.BIGQUERY_TABLE \
  --job_name vcf-to-bigquery \
  --setup_file ./setup.py \
  --runner DataflowRunner \
  --project "${GOOGLE_CLOUD_PROJECT}" \
  --region "${GOOGLE_CLOUD_REGION}" \
  --temp_location "${TEMP_LOCATION}"

Running VCF files preprocessor

The VCF files preprocessor is used for validating the datasets such that the inconsistencies can be easily identified. It can be used as a standalone validator to check the validity of the VCF files, or as a helper tool for VCF to BigQuery pipeline. Please refer to VCF files preprocessor for more details.

Running BigQuery to VCF

The BigQuery to VCF pipeline is used to export variants in BigQuery to one VCF file. Please refer to BigQuery to VCF pipeline for more details.

Additional topics

Development

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].