All Projects → cartershanklin → hive-scd-examples

cartershanklin / hive-scd-examples

Licence: other
How to manage Slowly Changing Dimensions with Apache Hive

Programming Languages

PLSQL
303 projects
python
139335 projects - #7 most used programming language
shell
77523 projects

Managing Slowly Changing Dimensions (SCDs) with Apache Hive

This project provides sample datasets and scripts that demonstrate how to manage Slowly Changing Dimensions (SCDs) with Apache Hive's ACID MERGE capabilities. Using ACID MERGE allows all updates to be applied atomically, ensure readers see all updates or no updates, and handles failure scenarios, rather than requiring application developers to build these things themselves.

Also included is data that simulates a full data dump from a source system, followed by another data dump taken later.

The objective is to merge the data using different styles of slowly-changing dimension strategies

These examples cover Type 1, Type 2 and Type 3 updates.

Procedure

SCD Strategies

Requirements

Instructions

  • Clone this repository onto your Hadoop cluster
  • Run load_data.sh to stage data into HDFS
  • From Hive CLI or beeline, run hive_type1_scd.sql, hive_type2_scd.sql and hive_type3_scd.sql
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].