Top datalake open source projects

Simplified ETL process in Hadoop using Apache Spark. Has complete ETL pipeline for datalake. SparkSession extensions, DataFrame validation, Column extensions, SQL functions, and DataFrame transformations

✭ 39

python big-data spark apache-spark hadoop etl xml xml-parsing pyspark data-pipeline datalake hadoop-mapreduce spark-sql etl-framework hadoop-hdfs etl-pipeline etl-components

zingg

Scalable identity resolution, entity resolution, data mastering and deduplication using ML

apiary-data-lake

Terraform scripts for deploying Apiary Data Lake

✭ 15

HCL python Smarty shell apiary datalake

dlink

Dinky is an out of the box one-stop real-time computing platform dedicated to the construction and practice of Unified Streaming & Batch and Unified Data Lake & Data Warehouse. Based on Apache Flink, Dinky provides the ability to connect many big data frameworks including OLAP and Data Lake.

✭ 1,535

java typescript Less javascript Dockerfile EJS sql olap flink datawarehouse datalake dlink flinksql flinkcdc real-time-computing-platform

Top 12 datalake open source projects