All Projects → GoogleCloudDataproc → hive-bigquery-storage-handler

GoogleCloudDataproc / hive-bigquery-storage-handler

Licence: Apache-2.0 license
Hive Storage Handler for interoperability between BigQuery and Apache Hive

Programming Languages

java
68154 projects - #9 most used programming language
Dockerfile
14818 projects

Projects that are alternatives of or similar to hive-bigquery-storage-handler

hive-jdbc-driver
An alternative to the "hive standalone" jar for connecting Java applications to Apache Hive via JDBC
Stars: ✭ 31 (+93.75%)
Mutual labels:  hive, hadoop, apache
Hive
Apache Hive
Stars: ✭ 4,031 (+25093.75%)
Mutual labels:  hive, hadoop, apache
Hive Jdbc Uber Jar
Hive JDBC "uber" or "standalone" jar based on the latest Apache Hive version
Stars: ✭ 188 (+1075%)
Mutual labels:  hive, hadoop, apache
Datashare Toolkit
DIY commercial datasets on Google Cloud Platform
Stars: ✭ 41 (+156.25%)
Mutual labels:  bigquery, gcp
gcp-ml
Google Cloud Platform Machine Learning Samples
Stars: ✭ 31 (+93.75%)
Mutual labels:  bigquery, gcp
snowplow-bigquery-loader
Loads Snowplow enriched events into Google BigQuery
Stars: ✭ 15 (-6.25%)
Mutual labels:  bigquery, gcp
polygon-etl
ETL (extract, transform and load) tools for ingesting Polygon blockchain data to Google BigQuery and Pub/Sub
Stars: ✭ 53 (+231.25%)
Mutual labels:  bigquery, gcp
Bitcoin Etl
ETL scripts for Bitcoin, Litecoin, Dash, Zcash, Doge, Bitcoin Cash. Available in Google BigQuery https://goo.gl/oY5BCQ
Stars: ✭ 174 (+987.5%)
Mutual labels:  bigquery, gcp
Ethereum Etl Airflow
Airflow DAGs for exporting, loading, and parsing the Ethereum blockchain data. What datasets do you want to be added to Ethereum ETL? Vote here: https://blockchain-etl.convas.io.
Stars: ✭ 89 (+456.25%)
Mutual labels:  bigquery, gcp
Hadoop Connectors
Libraries and tools for interoperability between Hadoop-related open-source software and Google Cloud Platform.
Stars: ✭ 218 (+1262.5%)
Mutual labels:  bigquery, hadoop
yarn-prometheus-exporter
Export Hadoop YARN (resource-manager) metrics in prometheus format
Stars: ✭ 44 (+175%)
Mutual labels:  hadoop, apache
argon
Campaign Manager 360 and Display & Video 360 Reports to BigQuery connector
Stars: ✭ 31 (+93.75%)
Mutual labels:  bigquery, gcp
etlflow
EtlFlow is an ecosystem of functional libraries in Scala based on ZIO for writing various different tasks, jobs on GCP and AWS.
Stars: ✭ 38 (+137.5%)
Mutual labels:  bigquery, gcp
Ethereum Etl
Python scripts for ETL (extract, transform and load) jobs for Ethereum blocks, transactions, ERC20 / ERC721 tokens, transfers, receipts, logs, contracts, internal transactions. Data is available in Google BigQuery https://goo.gl/oY5BCQ
Stars: ✭ 956 (+5875%)
Mutual labels:  bigquery, gcp
iris3
An upgraded and improved version of the Iris automatic GCP-labeling project
Stars: ✭ 38 (+137.5%)
Mutual labels:  bigquery, gcp
Cube.js
📊 Cube — Open-Source Analytics API for Building Data Apps
Stars: ✭ 11,983 (+74793.75%)
Mutual labels:  bigquery, hive
kafka-connect-fs
Kafka Connect FileSystem Connector
Stars: ✭ 107 (+568.75%)
Mutual labels:  hadoop, gcp
dpkb
大数据相关内容汇总,包括分布式存储引擎、分布式计算引擎、数仓建设等。关键词:Hadoop、HBase、ES、Kudu、Hive、Presto、Spark、Flink、Kylin、ClickHouse
Stars: ✭ 123 (+668.75%)
Mutual labels:  hive, hadoop
hive to es
同步Hive数据仓库数据到Elasticsearch的小工具
Stars: ✭ 21 (+31.25%)
Mutual labels:  hive, hadoop
Facebook Hive Udfs
Facebook's Hive UDFs
Stars: ✭ 213 (+1231.25%)
Mutual labels:  hive, hadoop

Hive-BigQuery StorageHandler [No Longer Maintained]

This is a Hive StorageHandler plugin that enables Hive to interact with BigQuery. It allows you keep your existing pipelines but move to BigQuery. It utilizes the high throughput BigQuery Storage API to read data and uses the BigQuery API to write data.

The following steps are performed under Dataproc cluster in Google Cloud Platform. If you need to run in your cluster, you will need setup Google Cloud SDK and Google Cloud Storage connector for Hadoop.

Getting the StorageHandler

  1. Check it out from GitHub.
  2. Build it with the new Google Hadoop BigQuery Connector
git clone https://github.com/GoogleCloudPlatform/hive-bigquery-storage-handler  
cd hive-bigquery-storage-handler  
mvn clean install  
  1. Deploy hive-bigquery-storage-handler-1.0-shaded.jar

Using the StorageHandler to access BigQuery

  1. Enable the BigQuery Storage API. Follow these instructions and check pricing details

  2. Copy the compiled Jar to a Google Cloud Storage bucket that can be accessed by your hive cluster

  3. Open Hive CLI and load the jar as shown below:

hive> add jar gs://<Jar location>/hive-bigquery-storage-handler-1.0-shaded.jar;  
  1. Verify the jar is loaded successfully
hive> list jars;  

At this point you can operate Hive just like you used to do.

Creating BigQuery tables

If you have BigQuery table already, here is how you can define Hive table that refer to it:

CREATE TABLE bq_test (word_count bigint, word string)  
 STORED BY 
 'com.google.cloud.hadoop.io.bigquery.hive.HiveBigQueryStorageHandler' 
 TBLPROPERTIES ( 
 'bq.dataset'='<BigQuery dataset name>', 
 'bq.table'='<BigQuery table name>', 
 'mapred.bq.project.id'='<Your Project ID>', 
 'mapred.bq.temp.gcs.path'='gs://<Bucket name>/<Temporary path>', 
 'mapred.bq.gcs.bucket'='<Cloud Storage Bucket name>' 
 );

You will need to provide the following table properties:

Property Value
bq.dataset BigQuery dataset id (Optional if hive database name matches BQ dataset name)
bq.table BigQuery table name (Optional if hive table name matches BQ table name)
mapred.bq.project.id Your project id
mapred.temp.gcs.path Temporary file location in GCS bucket
mapred.bq.gcs.bucket Temporary GCS bucket name

Data Type Mapping

BigQuery Hive DESCRIPTION
INTEGER BIGINT Signed 8-byte Integer
FLOAT DOUBLE 8-byte double precision floating point number
DATE DATE FORMAT IS YYYY-[M]M-[D]D. The range of values supported for the Date type is 0001-­01-­01 to 9999-­12-­31
TIMESTAMP TIMESTAMP Represents an absolute point in time since Unix epoch with millisecond precision (on Hive) compared to Microsecond precision on Bigquery.
BOOLEAN BOOLEAN Boolean values are represented by the keywords TRUE and FALSE
STRING STRING Variable-length character data
BYTES BINARY Variable-length binary data
REPEATED ARRAY Represents repeated values
RECORD STRUCT Represents nested structures

Filtering

The new API allows column pruning and predicate filtering to only read the data you are interested in.

Column Pruning

Since BigQuery is backed by a columnar datastore, it can efficiently stream data without reading all columns.

Predicate Filtering

The Storage API supports arbitrary pushdown of predicate filters. To enable predicate pushdown ensure hive.optimize.ppd is set to true.
Filters on all primitive type columns will be pushed to storage layer improving the performance of reads. Predicate pushdown is not supported on complex types such as arrays and structs. For example - filters like address.city = "Sunnyvale" will not get pushdown to Bigquery.

Caveats

  1. Ensure that table exists in bigquery and column names are always lowercase
  2. timestamp column in hive is interpreted to be timezoneless and stored as an offset from the UNIX epoch with milliseconds precision.
    To display in human readable format from_unix_time udf can be used as
    from_unixtime(cast(cast(<timestampcolumn> as bigint)/1000 as bigint), 'yyyy-MM-dd hh:mm:ss')      

Issues

  1. Writing to BigQuery will fail when using Apache Tez as the execution engine. As a workaround set hive.execution.engine=mr to use MapReduce as the execution engine
  2. STRUCT type is not supported unless avro schema is explicitly specified using either avro.schema.literal or avro.schema.url table properties. Below table contains all supported types defining schema explicitly. Note: If table doesn't need struct then specifying schema is optional
     CREATE TABLE dbname.alltypeswithSchema(currenttimestamp TIMESTAMP,currentdate DATE, userid BIGINT, sessionid STRING, skills Array<String>,
       eventduration DOUBLE, eventcount BIGINT, is_latest BOOLEAN,keyset BINARY,addresses ARRAY<STRUCT<status: STRING, street: STRING,city: STRING, state: STRING,zip: BIGINT>> )
       STORED BY 'com.google.cloud.hadoop.io.bigquery.hive.HiveBigQueryStorageHandler'
       TBLPROPERTIES (
        'bq.dataset'='bqdataset',
        'bq.table'='bqtable',
        'mapred.bq.project.id'='bqproject',
        'mapred.bq.temp.gcs.path'='gs://bucketname/prefix',
        'mapred.bq.gcs.bucket'='bucketname',
        'avro.schema.literal'='{"type":"record","name":"alltypesnonnull",
            "fields":[{"name":"currenttimestamp","type":["null",{"type":"long","logicalType":"timestamp-micros"}], "default" : null}
                     ,{"name":"currentdate","type":{"type":"int","logicalType":"date"}, "default" : -1},{"name":"userid","type":"long","doc":"User identifier.", "default" : -1}
                     ,{"name":"sessionid","type":["null","string"], "default" : null},{"name":"skills","type":["null", {"type":"array","items":"string"}], "default" : null}
                     ,{"name":"eventduration","type":["null","double"], "default" : null},{"name":"eventcount","type":["null","long"], "default" : null}
                     ,{"name":"is_latest","type":["null","boolean"], "default" : null},{"name":"keyset","type":["null","bytes"], "default" : null}
                     ,{"name":"addresses","type":["null", {"type":"array",
                        "items":{"type":"record","name":"__s_0",
                        "fields":[{"name":"status","type":"string"},{"name":"street","type":"string"},{"name":"city","type":"string"},{"name":"state","type":"string"},{"name":"zip","type":"long"}]
                        }}], "default" : null
                      }
                    ]
            }'
       );
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].