All Projects → miraisolutions → sparkbq

miraisolutions / sparkbq

Licence: GPL-3.0 license
Sparklyr extension package to connect to Google BigQuery

Programming Languages

r
7636 projects

Projects that are alternatives of or similar to sparkbq

mleap
R Interface to MLeap
Stars: ✭ 24 (+50%)
Mutual labels:  sparklyr
sparklygraphs
Old repo for R interface for GraphFrames
Stars: ✭ 13 (-18.75%)
Mutual labels:  sparklyr
logica
Logica is a logic programming language that compiles to StandardSQL and runs on Google BigQuery.
Stars: ✭ 1,469 (+9081.25%)
Mutual labels:  bigquery
tag-manager
Website analytics, JavaScript error tracking + analytics, tag manager, data ingest endpoint creation (tracking pixels). GDPR + CCPA compliant.
Stars: ✭ 279 (+1643.75%)
Mutual labels:  bigquery
dekart
GIS Visualisation for Amazon Athena and BigQuery
Stars: ✭ 131 (+718.75%)
Mutual labels:  bigquery
hive-bigquery-storage-handler
Hive Storage Handler for interoperability between BigQuery and Apache Hive
Stars: ✭ 16 (+0%)
Mutual labels:  bigquery
hive compared bq
hive_compared_bq compares/validates 2 (SQL like) tables, and graphically shows the rows/columns that are different.
Stars: ✭ 27 (+68.75%)
Mutual labels:  bigquery
DataflowTemplates
Convenient Dataflow pipelines for transforming data between cloud data sources
Stars: ✭ 22 (+37.5%)
Mutual labels:  bigquery
spark.sas7bdat
Read in SAS data in parallel into Apache Spark
Stars: ✭ 25 (+56.25%)
Mutual labels:  sparklyr
objectiv-analytics
Powerful product analytics for data teams, with full control over data & models.
Stars: ✭ 399 (+2393.75%)
Mutual labels:  bigquery
bigquery-data-lineage
Reference implementation for real-time Data Lineage tracking for BigQuery using Audit Logs, ZetaSQL and Dataflow.
Stars: ✭ 112 (+600%)
Mutual labels:  bigquery
scalikejdbc-bigquery
ScalikeJDBC extension for Google BigQuery
Stars: ✭ 18 (+12.5%)
Mutual labels:  bigquery
bigflow
A Python framework for data processing on GCP.
Stars: ✭ 96 (+500%)
Mutual labels:  bigquery
firestore-to-bigquery-export
NPM package for copying and converting Cloud Firestore data to BigQuery.
Stars: ✭ 26 (+62.5%)
Mutual labels:  bigquery
amplitude-bigquery
Export your events from Amplitude to Google BigQuery/Google Cloud Storage
Stars: ✭ 28 (+75%)
Mutual labels:  bigquery
managed ml systems and iot
Managed Machine Learning Systems and Internet of Things Live Lesson
Stars: ✭ 35 (+118.75%)
Mutual labels:  bigquery
spark-on-k8s-gcp-examples
Example Spark applications that run on Kubernetes and access GCP products, e.g., GCS, BigQuery, and Cloud PubSub
Stars: ✭ 36 (+125%)
Mutual labels:  bigquery
graphframes
R Interface for GraphFrames
Stars: ✭ 36 (+125%)
Mutual labels:  sparklyr
starlake
Starlake is a Spark Based On Premise and Cloud ELT/ETL Framework for Batch & Stream Processing
Stars: ✭ 16 (+0%)
Mutual labels:  bigquery
polygon-etl
ETL (extract, transform and load) tools for ingesting Polygon blockchain data to Google BigQuery and Pub/Sub
Stars: ✭ 53 (+231.25%)
Mutual labels:  bigquery

sparkbq: Google BigQuery Support for sparklyr

CRAN_Status_Badge Rdoc

sparkbq is a sparklyr extension package providing an integration with Google BigQuery. It builds on top of spark-bigquery, which provides a Google BigQuery data source to Apache Spark.

Version Information

You can install the released version of sparkbq from CRAN via

install.packages("sparkbq")

or the latest development version through

devtools::install_github("miraisolutions/sparkbq", ref = "develop")

The following table provides an overview over supported versions of Apache Spark, Scala, and Google Dataproc:

sparkbq spark-bigquery Apache Spark Scala Google Dataproc
0.1.x 0.1.0 2.2.x and 2.3.x 2.11 1.2.x and 1.3.x

sparkbq is based on the Spark package spark-bigquery which is available in a separate GitHub repository.

Example Usage

library(sparklyr)
library(sparkbq)
library(dplyr)

config <- spark_config()

sc <- spark_connect(master = "local[*]", config = config)

# Set Google BigQuery default settings
bigquery_defaults(
  billingProjectId = "<your_billing_project_id>",
  gcsBucket = "<your_gcs_bucket>",
  datasetLocation = "US",
  serviceAccountKeyFile = "<your_service_account_key_file>",
  type = "direct"
)

# Reading the public shakespeare data table
# https://cloud.google.com/bigquery/public-data/
# https://cloud.google.com/bigquery/sample-tables
hamlet <- 
  spark_read_bigquery(
    sc,
    name = "hamlet",
    projectId = "bigquery-public-data",
    datasetId = "samples",
    tableId = "shakespeare") %>%
  filter(corpus == "hamlet") # NOTE: predicate pushdown to BigQuery!
  
# Retrieve results into a local tibble
hamlet %>% collect()

# Write result into "mysamples" dataset in our BigQuery (billing) project
spark_write_bigquery(
  hamlet,
  datasetId = "mysamples",
  tableId = "hamlet",
  mode = "overwrite")

Authentication

When running outside of Google Cloud it is necessary to specify a service account JSON key file. The service account key file can be passed as parameter serviceAccountKeyFile to bigquery_defaults or directly to spark_read_bigquery and spark_write_bigquery.

Alternatively, an environment variable export GOOGLE_APPLICATION_CREDENTIALS=/path/to/your/service_account_keyfile.json can be set (see https://cloud.google.com/docs/authentication/getting-started for more information). Make sure the variable is set before starting the R session.

When running on Google Cloud, e.g. Google Cloud Dataproc, application default credentials (ADC) may be used in which case it is not necessary to specify a service account key file.

Further Information

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].