All Projects → ExpediaGroup → hiveberg

ExpediaGroup / hiveberg

Licence: Apache-2.0 license
Demonstration of a Hive Input Format for Iceberg

Programming Languages

java
68154 projects - #9 most used programming language

Projects that are alternatives of or similar to hiveberg

smart-data-lake
Smart Automation Tool for building modern Data Lakes and Data Pipelines
Stars: ✭ 79 (+259.09%)
Mutual labels:  hive, data-lake
DaFlow
Apache-Spark based Data Flow(ETL) Framework which supports multiple read, write destinations of different types and also support multiple categories of transformation rules.
Stars: ✭ 24 (+9.09%)
Mutual labels:  hive
data-profiling
a set of scripts to pull meta data and data profiling metrics from relational database systems
Stars: ✭ 57 (+159.09%)
Mutual labels:  hive
hive-cube
Data self exporting and monitoring platform based on Hive data warehouse. https://hc.smartloli.org
Stars: ✭ 34 (+54.55%)
Mutual labels:  hive
hadoopoffice
HadoopOffice - Analyze Office documents using the Hadoop ecosystem (Spark/Flink/Hive)
Stars: ✭ 56 (+154.55%)
Mutual labels:  hive
HiveJdbcStorageHandler
No description or website provided.
Stars: ✭ 21 (-4.55%)
Mutual labels:  hive
hiveql-parser
HiveQL Parser. Parse HiveQL code and print AST in JSON format if success, else print well formed syntax error message.
Stars: ✭ 25 (+13.64%)
Mutual labels:  hive
BigDataTools
tools for bigData
Stars: ✭ 36 (+63.64%)
Mutual labels:  hive
Real-time-Data-Warehouse
Real-time Data Warehouse with Apache Flink & Apache Kafka & Apache Hudi
Stars: ✭ 52 (+136.36%)
Mutual labels:  iceberg
liquibase-impala
Liquibase extension to add Impala Database support
Stars: ✭ 23 (+4.55%)
Mutual labels:  hive
hadoop-etl-udfs
The Hadoop ETL UDFs are the main way to load data from Hadoop into EXASOL
Stars: ✭ 17 (-22.73%)
Mutual labels:  hive
databricks-dbapi
DBAPI and SQLAlchemy dialect for Databricks Workspace and SQL Analytics clusters
Stars: ✭ 21 (-4.55%)
Mutual labels:  hive
aaocp
一个对用户行为日志进行分析的大数据项目
Stars: ✭ 53 (+140.91%)
Mutual labels:  hive
simple-ddl-parser
Simple DDL Parser to parse SQL (HQL, TSQL, AWS Redshift, BigQuery, Snowflake and other dialects) ddl files to json/python dict with full information about columns: types, defaults, primary keys, etc. & table properties, types, domains, etc.
Stars: ✭ 76 (+245.45%)
Mutual labels:  hive
herd-mdl
Herd-MDL, a turnkey managed data lake in the cloud. See https://finraos.github.io/herd-mdl/ for more information.
Stars: ✭ 11 (-50%)
Mutual labels:  data-lake
xxhadoop
Data Analysis Using Hadoop/Spark/Storm/ElasticSearch/MachineLearning etc. This is My Daily Notes/Code/Demo. Don't fork, Just star !
Stars: ✭ 37 (+68.18%)
Mutual labels:  hive
beekeeper
Service for automatically managing and cleaning up unreferenced data
Stars: ✭ 43 (+95.45%)
Mutual labels:  hive
hive-jdbc-driver
An alternative to the "hive standalone" jar for connecting Java applications to Apache Hive via JDBC
Stars: ✭ 31 (+40.91%)
Mutual labels:  hive
web-click-flow
网站点击流离线日志分析
Stars: ✭ 14 (-36.36%)
Mutual labels:  hive
waggle-dance
Hive federation service. Enables disparate tables to be concurrently accessed across multiple Hive deployments.
Stars: ✭ 194 (+781.82%)
Mutual labels:  hive

Hive integration with Iceberg

Note

This project is now archived - the majority of the code from this project has been merged into the Apache Iceberg codebase where it has been available from release 0.10.0 onwards. Documentation can be found at Using Hive with Iceberg. We highly recommend using the features in Iceberg itself and contributing any related changes there.

Overview

The intention of this project is to demonstrate how Hive could be integrated with Iceberg. For now it provides a Hive Input Format which can be used to read data from an Iceberg table. There is a unit test which demonstrates this.

There are a number of dependency clashes with libraries used by Iceberg (e.g. Guava, Avro) that mean we need to shade the upstream Iceberg artifacts. To use Hiveberg you first need to check out our fork of Iceberg from https://github.com/ExpediaGroup/incubator-iceberg/tree/build-hiveberg-modules and then run

./gradlew publishToMavenLocal

Ultimately we would like to contribute this Hive Input Format to Iceberg so this would no longer be required.

Features

IcebergInputFormat

To use the IcebergInputFormat you need to create a Hive table using DDL:

CREATE TABLE source_db.table_a
  ROW FORMAT SERDE 'com.expediagroup.hiveberg.IcebergSerDe'
  STORED AS
    INPUTFORMAT 'com.expediagroup.hiveberg.IcebergInputFormat'
    OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
  LOCATION 'path_to_iceberg_table_location'
  TBLPROPERTIES('iceberg.catalog' = ...);

You must ensure that your Hive table has the same name as your Iceberg table. The InputFormat uses the name property from the Hive table to load the correct Iceberg table for the read. For example, if you created the Hive table as shown above then your Iceberg table must be created using a TableIdentifier as follows where both table names match:

TableIdentifier id = TableIdentifier.parse("source_db.table_a");

Setting TBLPROPERTIES

You need to set one additional table property when using Hiveberg:

  • If you used HadoopTables to create your original Iceberg table, set this property:
TBLPROPERTIES('iceberg.catalog'='hadoop.tables')
  • If you used HadoopCatalog to create your original Iceberg table, set this property:
TBLPROPERTIES('iceberg.catalog'='hadoop.catalog')

IcebergStorageHandler

This is implemented as a simplified option for creating Hiveberg tables. The Hive DDL should instead look like:

CREATE TABLE source_db.table_a
  STORED BY 'com.expediagroup.hiveberg.IcebergStorageHandler'
  LOCATION 'path_to_iceberg_data_warehouse';
  TBLPROPERTIES('iceberg.catalog' = ...)

Include the same TBLPROPERTIES as described in the Setting TBLPROPERTIES section, depending on how you've created your Iceberg table.

Predicate Pushdown

Pushdown of the HiveSQL WHERE clause has been implemented so that filters are pushed to the Iceberg TableScan level as well as the Parquet and ORC Reader's. Note: Predicate pushdown to the Iceberg table scan is only activated when using the IcebergStorageHandler.

Column Projection

The IcebergInputFormat will project columns from the HiveSQL SELECT section down to the Iceberg readers to reduce the number of columns read.

Time Travel and System Tables

You can view snapshot metadata from your table by creating a __snapshots system table linked to your data table. To do this:

  1. Create your regular table as outlined above.
  2. Create another table where the Hive DDL looks similar to:
CREATE TABLE source_db.table_a__snapshots
  STORED BY 'com.expediagroup.hiveberg.IcebergStorageHandler'
  LOCATION 'path_to_original_data_table'
    TBLPROPERTIES ('iceberg.catalog'='hadoop.catalog')

Notes

  • It is important that the table name ends with __snapshots as the InputFormat uses this to determine when to load the snapshot metadata table instead of the regular data table.
  • If you want to create a regular data table with a name that ends in __snapshots but do not want to load the snapshots table, you can override this default behaviour by setting the iceberg.snapshots.table=false in the TBLPROPERTIES.

Running a SELECT * FROM table_a__snapshots query will return you the table snapshot metadata with results similar to the following:

committed_at snapshot_id parent_id operation manifest_list summary
1588341995546 6023596325564622918 null append /var/folders/sg/... {"added-data-files":"1","added-records":"3","changed-partition-count":"1","total-records":"3","total-data-files":"1"}
1588857355026 544685312650116825 6023596325564622918 append /var/folders/sg/... {"added-data-files":"1","added-records":"3","changed-partition-count":"1","total-records":"6","total-data-files":"2"}

Time travel is only available when using the IcebergStorageHandler.

Specific snapshots can be selected from your Iceberg table using a provided virtual column that exposes the snapshot id. The default name for this column is snapshot__id. However, if you wish to change the name of the snapshot column, you can modify the name by setting a table property like so:

TBLPROPERTIES ('iceberg.hive.snapshot.virtual.column.name' = 'new_column_name')

To achieve time travel, and query an older snapshot, you can execute a Hive query similar to:

SELECT * FROM table_a WHERE snapshot__id = 1234567890 

Legal

This project is available under the Apache 2.0 License.

Copyright 2021 Expedia, Inc.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].