All Projects โ†’ Renien โ†’ ETL-Starter-Kit

Renien / ETL-Starter-Kit

Licence: MIT license
๐Ÿ“ Extract, Transform, Load (ETL) ๐Ÿ‘ท refers to a process in database usage and especially in data warehousing. This repository contains a starter kit featuring ETL related work.

Programming Languages

scala
5932 projects
groovy
2714 projects

Projects that are alternatives of or similar to ETL-Starter-Kit

Bigdata Notes
ๅคงๆ•ฐๆฎๅ…ฅ้—จๆŒ‡ๅ— โญ
Stars: โœญ 10,991 (+52238.1%)
Mutual labels:  hive, bigdata, azkaban
God Of Bigdata
ไธ“ๆณจๅคงๆ•ฐๆฎๅญฆไน ้ข่ฏ•๏ผŒๅคงๆ•ฐๆฎๆˆ็ฅžไน‹่ทฏๅผ€ๅฏใ€‚Flink/Spark/Hadoop/Hbase/Hive...
Stars: โœญ 6,008 (+28509.52%)
Mutual labels:  hive, bigdata, azkaban
Bigdata practice
ๅคงๆ•ฐๆฎๅˆ†ๆžๅฏ่ง†ๅŒ–ๅฎž่ทต
Stars: โœญ 166 (+690.48%)
Mutual labels:  hive, bigdata
DGFraud-TF2
A Deep Graph-based Toolbox for Fraud Detection in TensorFlow 2.X
Stars: โœญ 84 (+300%)
Mutual labels:  datascience, datamining
logparser
Easy parsing of Apache HTTPD and NGINX access logs with Java, Hadoop, Hive, Pig, Flink, Beam, Storm, Drill, ...
Stars: โœญ 139 (+561.9%)
Mutual labels:  hive, pig
Apache Spark Hands On
Educational notes,Hands on problems w/ solutions for hadoop ecosystem
Stars: โœญ 74 (+252.38%)
Mutual labels:  hive, bigdata
Hadoopcryptoledger
Hadoop Crypto Ledger - Analyzing CryptoLedgers, such as Bitcoin Blockchain, on Big Data platforms, such as Hadoop/Spark/Flink/Hive
Stars: โœญ 126 (+500%)
Mutual labels:  hive, bigdata
bigdata-doc
ๅคงๆ•ฐๆฎๅญฆไน ็ฌ”่ฎฐ๏ผŒๅญฆไน ่ทฏ็บฟ๏ผŒๆŠ€ๆœฏๆกˆไพ‹ๆ•ด็†ใ€‚
Stars: โœญ 37 (+76.19%)
Mutual labels:  hive, bigdata
Datafaker
Datafaker is a large-scale test data and flow test data generation tool. Datafaker fakes data and inserts to varied data sources. ๆต‹่ฏ•ๆ•ฐๆฎ็”Ÿๆˆๅทฅๅ…ท
Stars: โœญ 327 (+1457.14%)
Mutual labels:  hive, bigdata
gan deeplearning4j
Automatic feature engineering using Generative Adversarial Networks using Deeplearning4j and Apache Spark.
Stars: โœญ 19 (-9.52%)
Mutual labels:  bigdata, datascience
dockerfiles
Multi docker container images for main Big Data Tools. (Hadoop, Spark, Kafka, HBase, Cassandra, Zookeeper, Zeppelin, Drill, Flink, Hive, Hue, Mesos, ... )
Stars: โœญ 29 (+38.1%)
Mutual labels:  hive, bigdata
common-datax
ๅŸบไบŽDataX็š„้€š็”จๆ•ฐๆฎๅŒๆญฅๅพฎๆœๅŠก๏ผŒไธ€ไธชRestfulๆŽฅๅฃๆžๅฎšๆ‰€ๆœ‰้€š็”จๆ•ฐๆฎๅŒๆญฅ
Stars: โœญ 51 (+142.86%)
Mutual labels:  hive, azkaban
Pyetl
python ETL framework
Stars: โœญ 33 (+57.14%)
Mutual labels:  hive, etl-framework
Bigdataguide
ๅคงๆ•ฐๆฎๅญฆไน ๏ผŒไปŽ้›ถๅผ€ๅง‹ๅญฆไน ๅคงๆ•ฐๆฎ๏ผŒๅŒ…ๅซๅคงๆ•ฐๆฎๅญฆไน ๅ„้˜ถๆฎตๅญฆไน ่ง†้ข‘ใ€้ข่ฏ•่ต„ๆ–™
Stars: โœญ 817 (+3790.48%)
Mutual labels:  hive, bigdata
DaFlow
Apache-Spark based Data Flow(ETL) Framework which supports multiple read, write destinations of different types and also support multiple categories of transformation rules.
Stars: โœญ 24 (+14.29%)
Mutual labels:  hive, etl-framework
TiBigData
TiDB connectors for Flink/Hive/Presto
Stars: โœญ 192 (+814.29%)
Mutual labels:  hive, bigdata
litemall-dw
ๅŸบไบŽๅผ€ๆบLitemall็”ตๅ•†้กน็›ฎ็š„ๅคงๆ•ฐๆฎ้กน็›ฎ๏ผŒๅŒ…ๅซๅ‰็ซฏๅŸ‹็‚น(openresty+lua)ใ€ๅŽ็ซฏๅŸ‹็‚น๏ผ›ๆ•ฐๆฎไป“ๅบ“(ไบ”ๅฑ‚)ใ€ๅฎžๆ—ถ่ฎก็ฎ—ๅ’Œ็”จๆˆท็”ปๅƒใ€‚ๅคงๆ•ฐๆฎๅนณๅฐ้‡‡็”จCDH6.3.2(ๅทฒไฝฟ็”จvagrant+ansible่„šๆœฌๅŒ–)๏ผŒๅŒๆ—ถไนŸๅŒ…ๅซไบ†Azkaban็š„workflowใ€‚
Stars: โœญ 36 (+71.43%)
Mutual labels:  hive, azkaban
qwery
A SQL-like language for performing ETL transformations.
Stars: โœญ 28 (+33.33%)
Mutual labels:  hive, etl-framework
the-apache-ignite-book
All code samples, scripts and more in-depth examples for The Apache Ignite Book. Include Apache Ignite 2.6 or above
Stars: โœญ 65 (+209.52%)
Mutual labels:  hive, bigdata
hadoopoffice
HadoopOffice - Analyze Office documents using the Hadoop ecosystem (Spark/Flink/Hive)
Stars: โœญ 56 (+166.67%)
Mutual labels:  hive, bigdata

ETL
ETL

Extract - Transform - Load

Travis Build License

Summary

Extract, Transform, Load (ETL) refers to a process in database usage and especially in data warehousing. This repository contains a starter kit featuring ETL related work.

Features and Limitations

lamda-etl

Lambda architecture is a data-processing architecture designed to handle massive quantities of data by taking advantage of both batch- and stream-processing methods.

This starter kit package is mainly focusing on ETL related work where it allows to expand to an independent ETL framework for different client data sources. It contains basic implementation and project structure as follows,

  • Common Module โ€“ This will contain all the common jobs and helper classes for ETL framework. Currently two Scalding helper classes are implemented (Hadoop job runner and MapReduceConfig)

  • DataModel Module โ€“ This will contain all the BigData schema related code. For example, Avro, ORC, Thrift etc. Currently a sample Avro clickstream raw data shema has been implemented.

  • SampleClient Module โ€“ This will contain independent data processing jobs which will have dependency on Common and DataModel.

Since this repository is to keep only the structure; different type of sample jobs are not implemented. Based on your requirement be free to modify and implement different type of batch/streaming jobs (Spark, Hive, Pig etc)

Installation

Make sure you have installed,

  • JDK 1.8+
  • Scala 2.10.*
  • Gradle 2.2+

This started kit package uses the latest version of linkedin gradle Hadoop plugin which supports only gradle 2 series version. If anyone like to use the gradle older version then you have to downgrade linkedin gradle Hadoop plugin.

Directory Layout

.
โ”œโ”€โ”€ Common                                                        --> common module which can contain helper class
โ”‚   โ”œโ”€โ”€ build.gradle                                              --> build script for common module specific
โ”‚   โ””โ”€โ”€ src                                                       --> source package directory for common module
โ”‚       โ””โ”€โ”€ main
โ”‚           โ”œโ”€โ”€ java
โ”‚           โ”œโ”€โ”€ resources
โ”‚           โ””โ”€โ”€ scala
โ”‚               โ””โ”€โ”€ com
โ”‚                   โ””โ”€โ”€ etl
โ”‚                       โ””โ”€โ”€ utils
โ”‚                           โ”œโ”€โ”€ HadoopRunner.scala
โ”‚                           โ””โ”€โ”€ MapReduceConfig.scala
โ”œโ”€โ”€ DataModel                                                     --> schema level module (eg: avro, thrift, json etc)
โ”‚   โ”œโ”€โ”€ build.gradle                                              --> build script for datamodel module specific
โ”‚   โ”œโ”€โ”€ schema                                                    --> data schema files
โ”‚   โ”‚   โ””โ”€โ”€ com
โ”‚   โ”‚       โ””โ”€โ”€ etl
โ”‚   โ”‚           โ””โ”€โ”€ datamodel
โ”‚   โ”‚               โ””โ”€โ”€ ClickStreamRecord.avsc                    --> click stream record avro schema
โ”‚   โ”œโ”€โ”€ src                                                       --> source package directory for datamodel module
โ”‚   โ”‚   โ””โ”€โ”€ main
โ”‚   โ”‚       โ”œโ”€โ”€ java
โ”‚   โ”‚       โ”œโ”€โ”€ resources
โ”‚   โ”‚       โ””โ”€โ”€ scala
โ”‚   โ””โ”€โ”€ target                                                    --> auto generated code (eg from avro, thrift etc)
โ”‚       โ””โ”€โ”€ generated-sources
โ”‚           โ””โ”€โ”€ main
โ”‚               โ””โ”€โ”€ java
โ”‚                   โ””โ”€โ”€ com
โ”‚                       โ””โ”€โ”€ etl
โ”‚                           โ””โ”€โ”€ datamodel
โ”‚                               โ””โ”€โ”€ ClickStreamRecord.java        --> auto generated code from click stream record avro schema
โ”œโ”€โ”€ SampleClient                                                  --> sperate module for client specific ETL jobs
โ”‚   โ”œโ”€โ”€ build.gradle                                              --> build script for client specific module
โ”‚   โ”œโ”€โ”€ src                                                       --> source package directory for client specific module
โ”‚   โ”‚   โ””โ”€โ”€ main
โ”‚   โ”‚       โ”œโ”€โ”€ java
โ”‚   โ”‚       โ”œโ”€โ”€ resources
โ”‚   โ”‚       โ””โ”€โ”€ scala
โ”‚   โ”‚           โ””โ”€โ”€ com
โ”‚   โ”‚               โ””โ”€โ”€ sampleclient
โ”‚   โ”‚                   โ””โ”€โ”€ jobs
โ”‚   โ”‚                       โ””โ”€โ”€ ClickStreamAggregates.scala       --> clickstream aggregates jobs
โ”‚   โ””โ”€โ”€ workflow                                                  --> hadoop job flow groovy script folder
โ”‚       โ”œโ”€โ”€ flow.gradle                                           --> gradle script to generate haoop job flows (eg: Azkaban)
โ”‚       โ””โ”€โ”€ jobs.gradle                                           --> gradle script for haoop jobs (eg: Azkaban)
โ”œโ”€โ”€ build.gradle                                                  --> build script for root module
โ”œโ”€โ”€ gradle                                                        --> gradle folder which contains all the build script files
โ”‚   โ”œโ”€โ”€ artifacts.gradle                                          --> artifact file for ETL project
โ”‚   โ”œโ”€โ”€ buildscript.gradle                                        --> groovy script contains plugins, task classes, and other classes are available for project
โ”‚   โ”œโ”€โ”€ dependencies.gradle                                       --> dependencies for the ETL project
โ”‚   โ”œโ”€โ”€ enviroments.groovy                                        --> configuration for prod and dev enviroment
โ”‚   โ”œโ”€โ”€ repositories.gradle                                       --> all the dependencies repository location
โ”‚   โ””โ”€โ”€ workflows.gradle                                          --> root workflow gradle file contain configuration and custom build task
โ”œโ”€โ”€ gradlew
โ”œโ”€โ”€ gradlew.bat
โ”œโ”€โ”€ settings.gradle                                               --> setting sub modules

Key Package Links

This starter-kit is made based on few popular libraries with sample code. Based on your requirement choose the suitable technology.

Using The Project

Note: This guide has only been tested on Mac OS X and may assume tools that are specific to it. If working in another OS substitutes may need to be used but should be available.

Step 1 โ€“ Build the Project:

  • Run gradle clean build

Once you build the project you will find the following files:

etl-build-files

Step 2 โ€“ Upload Azkaban Flow

Upload โ€˜etl-starter-kit-sampleclient.zipโ€™ to Azkaban. After deploying the fat Hadoop jar youโ€™re ready to run the flow.

sample-client-azkaban-flow

Notable Frameworks for ETL work:

License

MIT ยฉ Renien

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].