All Projects → vim89 → datalake-etl-pipeline

vim89 / datalake-etl-pipeline

Licence: Apache-2.0 license
Simplified ETL process in Hadoop using Apache Spark. Has complete ETL pipeline for datalake. SparkSession extensions, DataFrame validation, Column extensions, SQL functions, and DataFrame transformations

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to datalake-etl-pipeline

DaFlow
Apache-Spark based Data Flow(ETL) Framework which supports multiple read, write destinations of different types and also support multiple categories of transformation rules.
Stars: ✭ 24 (-38.46%)
Mutual labels:  apache-spark, hadoop, etl, etl-framework, etl-pipeline
Hydrograph
A visual ETL development and debugging tool for big data
Stars: ✭ 144 (+269.23%)
Mutual labels:  big-data, apache-spark, etl, etl-framework
SparkProgrammingInScala
Apache Spark Course Material
Stars: ✭ 57 (+46.15%)
Mutual labels:  big-data, apache-spark, datalake, spark-sql
aut
The Archives Unleashed Toolkit is an open-source toolkit for analyzing web archives.
Stars: ✭ 111 (+184.62%)
Mutual labels:  big-data, apache-spark, hadoop, pyspark
Spark With Python
Fundamentals of Spark with Python (using PySpark), code examples
Stars: ✭ 150 (+284.62%)
Mutual labels:  big-data, apache-spark, hadoop, pyspark
big data
A collection of tutorials on Hadoop, MapReduce, Spark, Docker
Stars: ✭ 34 (-12.82%)
Mutual labels:  big-data, hadoop, pyspark, spark-sql
vixtract
www.vixtract.ru
Stars: ✭ 40 (+2.56%)
Mutual labels:  etl, etl-framework, etl-pipeline, etl-components
DIRECT
DIRECT, the Data Integration Run-time Execution Control Tool, is a data logistics framework that can be used to monitor, log, audit and control data integration / ETL processes.
Stars: ✭ 20 (-48.72%)
Mutual labels:  etl, etl-framework, etl-pipeline
leaflet heatmap
简单的可视化湖州通话数据 假设数据量很大,没法用浏览器直接绘制热力图,把绘制热力图这一步骤放到线下计算分析。使用Apache Spark并行计算数据之后,再使用Apache Spark绘制热力图,然后用leafletjs加载OpenStreetMap图层和热力图图层,以达到良好的交互效果。现在使用Apache Spark实现绘制,可能是Apache Spark不擅长这方面的计算或者是我没有设计好算法,并行计算的速度比不上单机计算。Apache Spark绘制热力图和计算代码在这 https://github.com/yuanzhaokang/ParallelizeHeatmap.git .
Stars: ✭ 13 (-66.67%)
Mutual labels:  big-data, apache-spark, hadoop
mmtf-workshop-2018
Structural Bioinformatics Training Workshop & Hackathon 2018
Stars: ✭ 50 (+28.21%)
Mutual labels:  big-data, apache-spark, pyspark
Metorikku
A simplified, lightweight ETL Framework based on Apache Spark
Stars: ✭ 361 (+825.64%)
Mutual labels:  big-data, etl, etl-framework
Mmlspark
Simple and Distributed Machine Learning
Stars: ✭ 2,899 (+7333.33%)
Mutual labels:  big-data, apache-spark, pyspark
spark-twitter-sentiment-analysis
Sentiment Analysis of a Twitter Topic with Spark Structured Streaming
Stars: ✭ 55 (+41.03%)
Mutual labels:  apache-spark, pyspark, spark-sql
AirflowETL
Blog post on ETL pipelines with Airflow
Stars: ✭ 20 (-48.72%)
Mutual labels:  etl, data-pipeline, etl-pipeline
pyspark-cheatsheet
PySpark Cheat Sheet - example code to help you learn PySpark and develop apps faster
Stars: ✭ 115 (+194.87%)
Mutual labels:  big-data, apache-spark, pyspark
Eel Sdk
Big Data Toolkit for the JVM
Stars: ✭ 140 (+258.97%)
Mutual labels:  big-data, hadoop, etl
Trino
Official repository of Trino, the distributed SQL query engine for big data, formerly known as PrestoSQL (https://trino.io)
Stars: ✭ 4,581 (+11646.15%)
Mutual labels:  big-data, hadoop, datalake
Movies-Analytics-in-Spark-and-Scala
Data cleaning, pre-processing, and Analytics on a million movies using Spark and Scala.
Stars: ✭ 47 (+20.51%)
Mutual labels:  big-data, hadoop, spark-sql
SynapseML
Simple and Distributed Machine Learning
Stars: ✭ 3,355 (+8502.56%)
Mutual labels:  big-data, apache-spark, pyspark
Griffon Vm
Griffon Data Science Virtual Machine
Stars: ✭ 128 (+228.21%)
Mutual labels:  big-data, apache-spark, hadoop

Datalake ETL Pipeline

Data transformation simplified for any Data platform.

Features: The package has complete ETL process -

  1. Uses metadata, transformation & data model information to design ETL pipeline
  2. Builds target transformation SparkSQL and Spark Dataframes
  3. Builds source & target Hive DDLs
  4. Validates DataFrames, extends core classes, defines DataFrame transformations, and provides UDF SQL functions.
  5. Supports below fundamental transformations for ETL pipeline -
    • Filters on source & target dataframes
    • Grouping and Aggregations on source & target dataframes
    • Heavily nested queries / dataframes
  6. Has complex and heavily nested XML, JSON, Parquet & ORC parser to nth level of nesting
  7. Has Unit test cases designed on function/method level & measures source code coverage
  8. Has information about delpoying to higher environments
  9. Has API documentation for customization & enhancement

Enhancements: In progress -

  1. Integrate Audit and logging - Define Error codes, log process failures, Audit progress & runtime information
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].