ETL
Extract - Transform - Load
Summary
Extract, Transform, Load (ETL) refers to a process in database usage and especially in data warehousing. This repository contains a starter kit featuring ETL related work.
Features and Limitations
Lambda architecture is a data-processing architecture designed to handle massive quantities of data by taking advantage of both batch- and stream-processing methods.
This starter kit package is mainly focusing on ETL related work where it allows to expand to an independent ETL framework for different client data sources. It contains basic implementation and project structure as follows,
-
Common Module โ This will contain all the common jobs and helper classes for ETL framework. Currently two Scalding helper classes are implemented (Hadoop job runner and MapReduceConfig)
-
DataModel Module โ This will contain all the BigData schema related code. For example, Avro, ORC, Thrift etc. Currently a sample Avro clickstream raw data shema has been implemented.
-
SampleClient Module โ This will contain independent data processing jobs which will have dependency on Common and DataModel.
Since this repository is to keep only the structure; different type of sample jobs are not implemented. Based on your requirement be free to modify and implement different type of batch/streaming jobs (Spark, Hive, Pig etc)
Installation
Make sure you have installed,
- JDK 1.8+
- Scala 2.10.*
- Gradle 2.2+
This started kit package uses the latest version of linkedin gradle Hadoop plugin which supports only gradle 2 series version. If anyone like to use the gradle older version then you have to downgrade linkedin gradle Hadoop plugin.
Directory Layout
.
โโโ Common --> common module which can contain helper class
โ โโโ build.gradle --> build script for common module specific
โ โโโ src --> source package directory for common module
โ โโโ main
โ โโโ java
โ โโโ resources
โ โโโ scala
โ โโโ com
โ โโโ etl
โ โโโ utils
โ โโโ HadoopRunner.scala
โ โโโ MapReduceConfig.scala
โโโ DataModel --> schema level module (eg: avro, thrift, json etc)
โ โโโ build.gradle --> build script for datamodel module specific
โ โโโ schema --> data schema files
โ โ โโโ com
โ โ โโโ etl
โ โ โโโ datamodel
โ โ โโโ ClickStreamRecord.avsc --> click stream record avro schema
โ โโโ src --> source package directory for datamodel module
โ โ โโโ main
โ โ โโโ java
โ โ โโโ resources
โ โ โโโ scala
โ โโโ target --> auto generated code (eg from avro, thrift etc)
โ โโโ generated-sources
โ โโโ main
โ โโโ java
โ โโโ com
โ โโโ etl
โ โโโ datamodel
โ โโโ ClickStreamRecord.java --> auto generated code from click stream record avro schema
โโโ SampleClient --> sperate module for client specific ETL jobs
โ โโโ build.gradle --> build script for client specific module
โ โโโ src --> source package directory for client specific module
โ โ โโโ main
โ โ โโโ java
โ โ โโโ resources
โ โ โโโ scala
โ โ โโโ com
โ โ โโโ sampleclient
โ โ โโโ jobs
โ โ โโโ ClickStreamAggregates.scala --> clickstream aggregates jobs
โ โโโ workflow --> hadoop job flow groovy script folder
โ โโโ flow.gradle --> gradle script to generate haoop job flows (eg: Azkaban)
โ โโโ jobs.gradle --> gradle script for haoop jobs (eg: Azkaban)
โโโ build.gradle --> build script for root module
โโโ gradle --> gradle folder which contains all the build script files
โ โโโ artifacts.gradle --> artifact file for ETL project
โ โโโ buildscript.gradle --> groovy script contains plugins, task classes, and other classes are available for project
โ โโโ dependencies.gradle --> dependencies for the ETL project
โ โโโ enviroments.groovy --> configuration for prod and dev enviroment
โ โโโ repositories.gradle --> all the dependencies repository location
โ โโโ workflows.gradle --> root workflow gradle file contain configuration and custom build task
โโโ gradlew
โโโ gradlew.bat
โโโ settings.gradle --> setting sub modules
Key Package Links
This starter-kit is made based on few popular libraries with sample code. Based on your requirement choose the suitable technology.
Using The Project
Note: This guide has only been tested on Mac OS X and may assume tools that are specific to it. If working in another OS substitutes may need to be used but should be available.
Step 1 โ Build the Project:
- Run
gradle clean build
Once you build the project you will find the following files:
Step 2 โ Upload Azkaban Flow
Upload โetl-starter-kit-sampleclient.zipโ to Azkaban. After deploying the fat Hadoop jar youโre ready to run the flow.
Notable Frameworks for ETL work:
- Scalding
- Pig
- Hive
- Apache Spark (Batch/Stream data processing)
- Apache Flink (Batch/Stream data processing)
License
MIT ยฉ Renien