All Projects → datalake-etl-pipeline → Similar Projects or Alternatives

967 Open source projects that are alternatives of or similar to datalake-etl-pipeline

Apache-Spark based Data Flow(ETL) Framework which supports multiple read, write destinations of different types and also support multiple categories of transformation rules.

Stars: ✭ 24 (-38.46%)

Mutual labels: apache-spark, hadoop, etl, etl-framework, etl-pipeline

aut

The Archives Unleashed Toolkit is an open-source toolkit for analyzing web archives.

Stars: ✭ 111 (+184.62%)

Mutual labels: big-data, apache-spark, hadoop, pyspark

Spark With Python

Fundamentals of Spark with Python (using PySpark), code examples

Stars: ✭ 150 (+284.62%)

Mutual labels: big-data, apache-spark, hadoop, pyspark

SparkProgrammingInScala

Apache Spark Course Material

Stars: ✭ 57 (+46.15%)

Mutual labels: big-data, apache-spark, datalake, spark-sql

vixtract

www.vixtract.ru

Stars: ✭ 40 (+2.56%)

Mutual labels: etl, etl-framework, etl-pipeline, etl-components

Hydrograph

A visual ETL development and debugging tool for big data

Stars: ✭ 144 (+269.23%)

Mutual labels: big-data, apache-spark, etl, etl-framework

big data

A collection of tutorials on Hadoop, MapReduce, Spark, Docker

Stars: ✭ 34 (-12.82%)

Mutual labels: big-data, hadoop, pyspark, spark-sql

basin

Basin is a visual programming editor for building Spark and PySpark pipelines. Easily build, debug, and deploy complex ETL pipelines from your browser

Stars: ✭ 25 (-35.9%)

Mutual labels: hadoop, etl, pyspark

etlflow

EtlFlow is an ecosystem of functional libraries in Scala based on ZIO for writing various different tasks, jobs on GCP and AWS.

Stars: ✭ 38 (-2.56%)

Mutual labels: etl, etl-framework, etl-pipeline

Metorikku

A simplified, lightweight ETL Framework based on Apache Spark

Stars: ✭ 361 (+825.64%)

Mutual labels: big-data, etl, etl-framework

sparkucx

A high-performance, scalable and efficient ShuffleManager plugin for Apache Spark, utilizing UCX communication layer

Stars: ✭ 32 (-17.95%)

Mutual labels: big-data, apache-spark, hadoop

pyspark-cheatsheet

PySpark Cheat Sheet - example code to help you learn PySpark and develop apps faster

Stars: ✭ 115 (+194.87%)

Mutual labels: big-data, apache-spark, pyspark

Griffon Vm

Griffon Data Science Virtual Machine

Stars: ✭ 128 (+228.21%)

Mutual labels: big-data, apache-spark, hadoop

Eel Sdk

Big Data Toolkit for the JVM

Stars: ✭ 140 (+258.97%)

Mutual labels: big-data, hadoop, etl

DIRECT

DIRECT, the Data Integration Run-time Execution Control Tool, is a data logistics framework that can be used to monitor, log, audit and control data integration / ETL processes.

Stars: ✭ 20 (-48.72%)

Mutual labels: etl, etl-framework, etl-pipeline

hamilton

A scalable general purpose micro-framework for defining dataflows. You can use it to create dataframes, numpy matrices, python objects, ML models, etc.

Stars: ✭ 612 (+1469.23%)

Mutual labels: etl, etl-framework, etl-pipeline

redis-connect-dist

Real-Time Event Streaming & Change Data Capture

Stars: ✭ 21 (-46.15%)

Mutual labels: etl, etl-framework, etl-pipeline

Sparkrdma

RDMA accelerated, high-performance, scalable and efficient ShuffleManager plugin for Apache Spark

Stars: ✭ 215 (+451.28%)

Mutual labels: big-data, apache-spark, hadoop

Waterdrop

Production Ready Data Integration Product, documentation：

Stars: ✭ 1,856 (+4658.97%)

Mutual labels: hadoop, etl-framework, etl-pipeline

SynapseML

Simple and Distributed Machine Learning

Stars: ✭ 3,355 (+8502.56%)

Mutual labels: big-data, apache-spark, pyspark

Bigdata Playground

A complete example of a big data application using : Kubernetes (kops/aws), Apache Spark SQL/Streaming/MLib, Apache Flink, Scala, Python, Apache Kafka, Apache Hbase, Apache Parquet, Apache Avro, Apache Storm, Twitter Api, MongoDB, NodeJS, Angular, GraphQL

Stars: ✭ 177 (+353.85%)

Mutual labels: big-data, apache-spark, hadoop

Butterfree

A tool for building feature stores.

Stars: ✭ 126 (+223.08%)

Mutual labels: etl, pyspark, etl-framework

Movies-Analytics-in-Spark-and-Scala

Data cleaning, pre-processing, and Analytics on a million movies using Spark and Scala.

Stars: ✭ 47 (+20.51%)

Mutual labels: big-data, hadoop, spark-sql

csvplus

csvplus extends the standard Go encoding/csv package with fluent interface, lazy stream operations, indices and joins.

Stars: ✭ 67 (+71.79%)

Mutual labels: etl, etl-framework, etl-pipeline

AirflowETL

Blog post on ETL pipelines with Airflow

Stars: ✭ 20 (-48.72%)

Mutual labels: etl, data-pipeline, etl-pipeline

leaflet heatmap

简单的可视化湖州通话数据假设数据量很大，没法用浏览器直接绘制热力图，把绘制热力图这一步骤放到线下计算分析。使用Apache Spark并行计算数据之后，再使用Apache Spark绘制热力图，然后用leafletjs加载OpenStreetMap图层和热力图图层，以达到良好的交互效果。现在使用Apache Spark实现绘制，可能是Apache Spark不擅长这方面的计算或者是我没有设计好算法，并行计算的速度比不上单机计算。Apache Spark绘制热力图和计算代码在这 https://github.com/yuanzhaokang/ParallelizeHeatmap.git .

Stars: ✭ 13 (-66.67%)

Mutual labels: big-data, apache-spark, hadoop

spark-twitter-sentiment-analysis

Sentiment Analysis of a Twitter Topic with Spark Structured Streaming

Stars: ✭ 55 (+41.03%)

Mutual labels: apache-spark, pyspark, spark-sql

Mmlspark

Simple and Distributed Machine Learning

Stars: ✭ 2,899 (+7333.33%)

Mutual labels: big-data, apache-spark, pyspark

mmtf-workshop-2018

Structural Bioinformatics Training Workshop & Hackathon 2018

Stars: ✭ 50 (+28.21%)

Mutual labels: big-data, apache-spark, pyspark

Trino

Official repository of Trino, the distributed SQL query engine for big data, formerly known as PrestoSQL (https://trino.io)

Stars: ✭ 4,581 (+11646.15%)

Mutual labels: big-data, hadoop, datalake

Orc

Apache ORC - the smallest, fastest columnar storage for Hadoop workloads

Stars: ✭ 389 (+897.44%)

Mutual labels: big-data, hadoop

Ignite

Apache Ignite

Stars: ✭ 4,027 (+10225.64%)

Mutual labels: big-data, hadoop

Kafka Connect Hdfs

Kafka Connect HDFS connector

Stars: ✭ 400 (+925.64%)

Mutual labels: big-data, hadoop

H2o 3

H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random Forest, Generalized Linear Modeling (GLM with Elastic Net), K-Means, PCA, Generalized Additive Models (GAM), RuleFit, Support Vector Machine (SVM), Stacked Ensembles, Automatic Machine Learning (AutoML), etc.

Stars: ✭ 5,656 (+14402.56%)

Mutual labels: big-data, hadoop

Hive

Apache Hive

Stars: ✭ 4,031 (+10235.9%)

Mutual labels: big-data, hadoop

Data Science Ipython Notebooks

Data science Python notebooks: Deep learning (TensorFlow, Theano, Caffe, Keras), scikit-learn, Kaggle, big data (Spark, Hadoop MapReduce, HDFS), matplotlib, pandas, NumPy, SciPy, Python essentials, AWS, and various command lines.

Stars: ✭ 22,048 (+56433.33%)

Mutual labels: big-data, hadoop

Hadoop For Geoevent

ArcGIS GeoEvent Server sample Hadoop connector for storing GeoEvents in HDFS.

Stars: ✭ 5 (-87.18%)

Mutual labels: big-data, hadoop

seatunnel-example

seatunnel plugin developing examples.

Stars: ✭ 27 (-30.77%)

Mutual labels: etl-framework, etl-pipeline

Pyspark Setup Demo

Demo of PySpark and Jupyter Notebook with the Jupyter Docker Stacks

Stars: ✭ 24 (-38.46%)

Mutual labels: big-data, pyspark

Moosefs

MooseFS – Open Source, Petabyte, Fault-Tolerant, Highly Performing, Scalable Network Distributed File System (Software-Defined Storage)

Stars: ✭ 1,025 (+2528.21%)

Mutual labels: big-data, hadoop

Setl

A simple Spark-powered ETL framework that just works 🍺

Stars: ✭ 79 (+102.56%)

Mutual labels: big-data, etl

Bandar Log

Monitoring tool to measure flow throughput of data sources and processing components that are part of Data Ingestion and ETL pipelines.

Stars: ✭ 19 (-51.28%)

Mutual labels: big-data, etl

Docker Spark Cluster

A Spark cluster setup running on Docker containers

Stars: ✭ 57 (+46.15%)

Mutual labels: big-data, hadoop

Bitcoin Value Predictor

[NOT MAINTAINED] Predicting Bit coin price using Time series analysis and sentiment analysis of tweets on bitcoin

Stars: ✭ 91 (+133.33%)

Mutual labels: big-data, pyspark

Drill

Apache Drill is a distributed MPP query layer for self describing data

Stars: ✭ 1,619 (+4051.28%)

Mutual labels: big-data, hadoop

Asakusafw

Asakusa Framework

Stars: ✭ 114 (+192.31%)

Mutual labels: big-data, hadoop

BETL-old

BETL. Meta data driven ETL generation using T-SQL

Stars: ✭ 17 (-56.41%)

Mutual labels: etl, etl-framework

Scala Spark Tutorial

Project for James' Apache Spark with Scala course

Stars: ✭ 121 (+210.26%)

Mutual labels: big-data, apache-spark

Bigdata Notes

大数据入门指南 ⭐

Stars: ✭ 10,991 (+28082.05%)

Mutual labels: big-data, hadoop

Hdfs Shell

HDFS Shell is a HDFS manipulation tool to work with functions integrated in Hadoop DFS

Stars: ✭ 117 (+200%)

Mutual labels: big-data, hadoop

zingg

Scalable identity resolution, entity resolution, data mastering and deduplication using ML

Stars: ✭ 655 (+1579.49%)

Mutual labels: etl, datalake

Calcite Avatica

Mirror of Apache Calcite - Avatica

Stars: ✭ 130 (+233.33%)

Mutual labels: big-data, hadoop

Ozone

Scalable, redundant, and distributed object store for Apache Hadoop

Stars: ✭ 330 (+746.15%)

Mutual labels: big-data, hadoop

Spark Py Notebooks

Apache Spark & Python (pySpark) tutorials for Big Data Analysis and Machine Learning as IPython / Jupyter notebooks

Stars: ✭ 1,338 (+3330.77%)

Mutual labels: big-data, pyspark

Gaffer

A large-scale entity and relation database supporting aggregation of properties

Stars: ✭ 1,642 (+4110.26%)

Mutual labels: big-data, hadoop

Spark On Lambda

Apache Spark on AWS Lambda

Stars: ✭ 137 (+251.28%)

Mutual labels: big-data, apache-spark

Data-pipeline-project

Data pipeline project

Stars: ✭ 18 (-53.85%)

Mutual labels: hadoop, data-pipeline

Calcite

Apache Calcite

Stars: ✭ 2,816 (+7120.51%)

Mutual labels: big-data, hadoop

pyspark-algorithms

PySpark Algorithms Book: https://www.amazon.com/dp/B07X4B2218/ref=sr_1_2

Stars: ✭ 72 (+84.62%)

Mutual labels: big-data, pyspark

Presto

The official home of the Presto distributed SQL query engine for big data

Stars: ✭ 12,957 (+33123.08%)

Mutual labels: big-data, hadoop

1-60 of 967 similar projects

›

next*5