vim89 / datalake-etl-pipeline

Licence: Apache-2.0 license

Simplified ETL process in Hadoop using Apache Spark. Has complete ETL pipeline for datalake. SparkSession extensions, DataFrame validation, Column extensions, SQL functions, and DataFrame transformations

Programming Languages

python

139335 projects - #7 most used programming language

Projects that are alternatives of or similar to datalake-etl-pipeline

DaFlow

Apache-Spark based Data Flow(ETL) Framework which supports multiple read, write destinations of different types and also support multiple categories of transformation rules.

Stars: ✭ 24 (-38.46%)

Mutual labels: apache-spark, hadoop, etl, etl-framework, etl-pipeline

Hydrograph

A visual ETL development and debugging tool for big data

Stars: ✭ 144 (+269.23%)

Mutual labels: big-data, apache-spark, etl, etl-framework

SparkProgrammingInScala

Apache Spark Course Material

Stars: ✭ 57 (+46.15%)

Mutual labels: big-data, apache-spark, datalake, spark-sql

aut

The Archives Unleashed Toolkit is an open-source toolkit for analyzing web archives.

Stars: ✭ 111 (+184.62%)

Mutual labels: big-data, apache-spark, hadoop, pyspark

Spark With Python

Fundamentals of Spark with Python (using PySpark), code examples

Stars: ✭ 150 (+284.62%)

Mutual labels: big-data, apache-spark, hadoop, pyspark

big data

A collection of tutorials on Hadoop, MapReduce, Spark, Docker

Stars: ✭ 34 (-12.82%)

Mutual labels: big-data, hadoop, pyspark, spark-sql

vixtract

www.vixtract.ru

Stars: ✭ 40 (+2.56%)

Mutual labels: etl, etl-framework, etl-pipeline, etl-components

DIRECT

DIRECT, the Data Integration Run-time Execution Control Tool, is a data logistics framework that can be used to monitor, log, audit and control data integration / ETL processes.

Stars: ✭ 20 (-48.72%)

Mutual labels: etl, etl-framework, etl-pipeline

leaflet heatmap

简单的可视化湖州通话数据假设数据量很大，没法用浏览器直接绘制热力图，把绘制热力图这一步骤放到线下计算分析。使用Apache Spark并行计算数据之后，再使用Apache Spark绘制热力图，然后用leafletjs加载OpenStreetMap图层和热力图图层，以达到良好的交互效果。现在使用Apache Spark实现绘制，可能是Apache Spark不擅长这方面的计算或者是我没有设计好算法，并行计算的速度比不上单机计算。Apache Spark绘制热力图和计算代码在这 https://github.com/yuanzhaokang/ParallelizeHeatmap.git .

Stars: ✭ 13 (-66.67%)

Mutual labels: big-data, apache-spark, hadoop

mmtf-workshop-2018

Structural Bioinformatics Training Workshop & Hackathon 2018

Stars: ✭ 50 (+28.21%)

Mutual labels: big-data, apache-spark, pyspark

Metorikku

A simplified, lightweight ETL Framework based on Apache Spark

Stars: ✭ 361 (+825.64%)

Mutual labels: big-data, etl, etl-framework

Mmlspark

Simple and Distributed Machine Learning

Stars: ✭ 2,899 (+7333.33%)

Mutual labels: big-data, apache-spark, pyspark

spark-twitter-sentiment-analysis

Sentiment Analysis of a Twitter Topic with Spark Structured Streaming

Stars: ✭ 55 (+41.03%)

Mutual labels: apache-spark, pyspark, spark-sql

AirflowETL

Blog post on ETL pipelines with Airflow

Stars: ✭ 20 (-48.72%)

Mutual labels: etl, data-pipeline, etl-pipeline

pyspark-cheatsheet

PySpark Cheat Sheet - example code to help you learn PySpark and develop apps faster

Stars: ✭ 115 (+194.87%)

Mutual labels: big-data, apache-spark, pyspark

Eel Sdk

Big Data Toolkit for the JVM

Stars: ✭ 140 (+258.97%)

Mutual labels: big-data, hadoop, etl

Trino

Official repository of Trino, the distributed SQL query engine for big data, formerly known as PrestoSQL (https://trino.io)

Stars: ✭ 4,581 (+11646.15%)

Mutual labels: big-data, hadoop, datalake

Movies-Analytics-in-Spark-and-Scala

Data cleaning, pre-processing, and Analytics on a million movies using Spark and Scala.

Stars: ✭ 47 (+20.51%)

Mutual labels: big-data, hadoop, spark-sql

SynapseML

Simple and Distributed Machine Learning

Stars: ✭ 3,355 (+8502.56%)

Mutual labels: big-data, apache-spark, pyspark

Griffon Vm

Griffon Data Science Virtual Machine

Stars: ✭ 128 (+228.21%)

Mutual labels: big-data, apache-spark, hadoop

View All Similar Projects ➔

Datalake ETL Pipeline

Data transformation simplified for any Data platform.

Features: The package has complete ETL process -

Uses metadata, transformation & data model information to design ETL pipeline
Builds target transformation SparkSQL and Spark Dataframes
Builds source & target Hive DDLs
Validates DataFrames, extends core classes, defines DataFrame transformations, and provides UDF SQL functions.
Supports below fundamental transformations for ETL pipeline -
- Filters on source & target dataframes
- Grouping and Aggregations on source & target dataframes
- Heavily nested queries / dataframes
Has complex and heavily nested XML, JSON, Parquet & ORC parser to nth level of nesting
Has Unit test cases designed on function/method level & measures source code coverage
Has information about delpoying to higher environments
Has API documentation for customization & enhancement

Enhancements: In progress -

Integrate Audit and logging - Define Error codes, log process failures, Audit progress & runtime information

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

vim89 / datalake-etl-pipeline

Programming Languages

Labels

Projects that are alternatives of or similar to datalake-etl-pipeline

Datalake ETL Pipeline