All Projects → datalake-etl-pipeline → Similar Projects or Alternatives

967 Open source projects that are alternatives of or similar to datalake-etl-pipeline

DaFlow
Apache-Spark based Data Flow(ETL) Framework which supports multiple read, write destinations of different types and also support multiple categories of transformation rules.
Stars: ✭ 24 (-38.46%)
aut
The Archives Unleashed Toolkit is an open-source toolkit for analyzing web archives.
Stars: ✭ 111 (+184.62%)
Mutual labels:  big-data, apache-spark, hadoop, pyspark
Spark With Python
Fundamentals of Spark with Python (using PySpark), code examples
Stars: ✭ 150 (+284.62%)
Mutual labels:  big-data, apache-spark, hadoop, pyspark
SparkProgrammingInScala
Apache Spark Course Material
Stars: ✭ 57 (+46.15%)
Mutual labels:  big-data, apache-spark, datalake, spark-sql
vixtract
www.vixtract.ru
Stars: ✭ 40 (+2.56%)
Hydrograph
A visual ETL development and debugging tool for big data
Stars: ✭ 144 (+269.23%)
Mutual labels:  big-data, apache-spark, etl, etl-framework
big data
A collection of tutorials on Hadoop, MapReduce, Spark, Docker
Stars: ✭ 34 (-12.82%)
Mutual labels:  big-data, hadoop, pyspark, spark-sql
basin
Basin is a visual programming editor for building Spark and PySpark pipelines. Easily build, debug, and deploy complex ETL pipelines from your browser
Stars: ✭ 25 (-35.9%)
Mutual labels:  hadoop, etl, pyspark
etlflow
EtlFlow is an ecosystem of functional libraries in Scala based on ZIO for writing various different tasks, jobs on GCP and AWS.
Stars: ✭ 38 (-2.56%)
Mutual labels:  etl, etl-framework, etl-pipeline
Metorikku
A simplified, lightweight ETL Framework based on Apache Spark
Stars: ✭ 361 (+825.64%)
Mutual labels:  big-data, etl, etl-framework
sparkucx
A high-performance, scalable and efficient ShuffleManager plugin for Apache Spark, utilizing UCX communication layer
Stars: ✭ 32 (-17.95%)
Mutual labels:  big-data, apache-spark, hadoop
pyspark-cheatsheet
PySpark Cheat Sheet - example code to help you learn PySpark and develop apps faster
Stars: ✭ 115 (+194.87%)
Mutual labels:  big-data, apache-spark, pyspark
Griffon Vm
Griffon Data Science Virtual Machine
Stars: ✭ 128 (+228.21%)
Mutual labels:  big-data, apache-spark, hadoop
Eel Sdk
Big Data Toolkit for the JVM
Stars: ✭ 140 (+258.97%)
Mutual labels:  big-data, hadoop, etl
DIRECT
DIRECT, the Data Integration Run-time Execution Control Tool, is a data logistics framework that can be used to monitor, log, audit and control data integration / ETL processes.
Stars: ✭ 20 (-48.72%)
Mutual labels:  etl, etl-framework, etl-pipeline
hamilton
A scalable general purpose micro-framework for defining dataflows. You can use it to create dataframes, numpy matrices, python objects, ML models, etc.
Stars: ✭ 612 (+1469.23%)
Mutual labels:  etl, etl-framework, etl-pipeline
redis-connect-dist
Real-Time Event Streaming & Change Data Capture
Stars: ✭ 21 (-46.15%)
Mutual labels:  etl, etl-framework, etl-pipeline
Sparkrdma
RDMA accelerated, high-performance, scalable and efficient ShuffleManager plugin for Apache Spark
Stars: ✭ 215 (+451.28%)
Mutual labels:  big-data, apache-spark, hadoop
Waterdrop
Production Ready Data Integration Product, documentation:
Stars: ✭ 1,856 (+4658.97%)
Mutual labels:  hadoop, etl-framework, etl-pipeline
SynapseML
Simple and Distributed Machine Learning
Stars: ✭ 3,355 (+8502.56%)
Mutual labels:  big-data, apache-spark, pyspark
Bigdata Playground
A complete example of a big data application using : Kubernetes (kops/aws), Apache Spark SQL/Streaming/MLib, Apache Flink, Scala, Python, Apache Kafka, Apache Hbase, Apache Parquet, Apache Avro, Apache Storm, Twitter Api, MongoDB, NodeJS, Angular, GraphQL
Stars: ✭ 177 (+353.85%)
Mutual labels:  big-data, apache-spark, hadoop
Butterfree
A tool for building feature stores.
Stars: ✭ 126 (+223.08%)
Mutual labels:  etl, pyspark, etl-framework
Movies-Analytics-in-Spark-and-Scala
Data cleaning, pre-processing, and Analytics on a million movies using Spark and Scala.
Stars: ✭ 47 (+20.51%)
Mutual labels:  big-data, hadoop, spark-sql
csvplus
csvplus extends the standard Go encoding/csv package with fluent interface, lazy stream operations, indices and joins.
Stars: ✭ 67 (+71.79%)
Mutual labels:  etl, etl-framework, etl-pipeline
AirflowETL
Blog post on ETL pipelines with Airflow
Stars: ✭ 20 (-48.72%)
Mutual labels:  etl, data-pipeline, etl-pipeline
leaflet heatmap
简单的可视化湖州通话数据 假设数据量很大,没法用浏览器直接绘制热力图,把绘制热力图这一步骤放到线下计算分析。使用Apache Spark并行计算数据之后,再使用Apache Spark绘制热力图,然后用leafletjs加载OpenStreetMap图层和热力图图层,以达到良好的交互效果。现在使用Apache Spark实现绘制,可能是Apache Spark不擅长这方面的计算或者是我没有设计好算法,并行计算的速度比不上单机计算。Apache Spark绘制热力图和计算代码在这 https://github.com/yuanzhaokang/ParallelizeHeatmap.git .
Stars: ✭ 13 (-66.67%)
Mutual labels:  big-data, apache-spark, hadoop
spark-twitter-sentiment-analysis
Sentiment Analysis of a Twitter Topic with Spark Structured Streaming
Stars: ✭ 55 (+41.03%)
Mutual labels:  apache-spark, pyspark, spark-sql
Mmlspark
Simple and Distributed Machine Learning
Stars: ✭ 2,899 (+7333.33%)
Mutual labels:  big-data, apache-spark, pyspark
mmtf-workshop-2018
Structural Bioinformatics Training Workshop & Hackathon 2018
Stars: ✭ 50 (+28.21%)
Mutual labels:  big-data, apache-spark, pyspark
Trino
Official repository of Trino, the distributed SQL query engine for big data, formerly known as PrestoSQL (https://trino.io)
Stars: ✭ 4,581 (+11646.15%)
Mutual labels:  big-data, hadoop, datalake
Orc
Apache ORC - the smallest, fastest columnar storage for Hadoop workloads
Stars: ✭ 389 (+897.44%)
Mutual labels:  big-data, hadoop
Ignite
Apache Ignite
Stars: ✭ 4,027 (+10225.64%)
Mutual labels:  big-data, hadoop
Kafka Connect Hdfs
Kafka Connect HDFS connector
Stars: ✭ 400 (+925.64%)
Mutual labels:  big-data, hadoop
H2o 3
H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random Forest, Generalized Linear Modeling (GLM with Elastic Net), K-Means, PCA, Generalized Additive Models (GAM), RuleFit, Support Vector Machine (SVM), Stacked Ensembles, Automatic Machine Learning (AutoML), etc.
Stars: ✭ 5,656 (+14402.56%)
Mutual labels:  big-data, hadoop
Hive
Apache Hive
Stars: ✭ 4,031 (+10235.9%)
Mutual labels:  big-data, hadoop
Data Science Ipython Notebooks
Data science Python notebooks: Deep learning (TensorFlow, Theano, Caffe, Keras), scikit-learn, Kaggle, big data (Spark, Hadoop MapReduce, HDFS), matplotlib, pandas, NumPy, SciPy, Python essentials, AWS, and various command lines.
Stars: ✭ 22,048 (+56433.33%)
Mutual labels:  big-data, hadoop
Hadoop For Geoevent
ArcGIS GeoEvent Server sample Hadoop connector for storing GeoEvents in HDFS.
Stars: ✭ 5 (-87.18%)
Mutual labels:  big-data, hadoop
seatunnel-example
seatunnel plugin developing examples.
Stars: ✭ 27 (-30.77%)
Mutual labels:  etl-framework, etl-pipeline
Pyspark Setup Demo
Demo of PySpark and Jupyter Notebook with the Jupyter Docker Stacks
Stars: ✭ 24 (-38.46%)
Mutual labels:  big-data, pyspark
Moosefs
MooseFS – Open Source, Petabyte, Fault-Tolerant, Highly Performing, Scalable Network Distributed File System (Software-Defined Storage)
Stars: ✭ 1,025 (+2528.21%)
Mutual labels:  big-data, hadoop
Setl
A simple Spark-powered ETL framework that just works 🍺
Stars: ✭ 79 (+102.56%)
Mutual labels:  big-data, etl
Bandar Log
Monitoring tool to measure flow throughput of data sources and processing components that are part of Data Ingestion and ETL pipelines.
Stars: ✭ 19 (-51.28%)
Mutual labels:  big-data, etl
Docker Spark Cluster
A Spark cluster setup running on Docker containers
Stars: ✭ 57 (+46.15%)
Mutual labels:  big-data, hadoop
Bitcoin Value Predictor
[NOT MAINTAINED] Predicting Bit coin price using Time series analysis and sentiment analysis of tweets on bitcoin
Stars: ✭ 91 (+133.33%)
Mutual labels:  big-data, pyspark
Drill
Apache Drill is a distributed MPP query layer for self describing data
Stars: ✭ 1,619 (+4051.28%)
Mutual labels:  big-data, hadoop
Asakusafw
Asakusa Framework
Stars: ✭ 114 (+192.31%)
Mutual labels:  big-data, hadoop
BETL-old
BETL. Meta data driven ETL generation using T-SQL
Stars: ✭ 17 (-56.41%)
Mutual labels:  etl, etl-framework
Scala Spark Tutorial
Project for James' Apache Spark with Scala course
Stars: ✭ 121 (+210.26%)
Mutual labels:  big-data, apache-spark
Bigdata Notes
大数据入门指南 ⭐
Stars: ✭ 10,991 (+28082.05%)
Mutual labels:  big-data, hadoop
Hdfs Shell
HDFS Shell is a HDFS manipulation tool to work with functions integrated in Hadoop DFS
Stars: ✭ 117 (+200%)
Mutual labels:  big-data, hadoop
zingg
Scalable identity resolution, entity resolution, data mastering and deduplication using ML
Stars: ✭ 655 (+1579.49%)
Mutual labels:  etl, datalake
Calcite Avatica
Mirror of Apache Calcite - Avatica
Stars: ✭ 130 (+233.33%)
Mutual labels:  big-data, hadoop
Ozone
Scalable, redundant, and distributed object store for Apache Hadoop
Stars: ✭ 330 (+746.15%)
Mutual labels:  big-data, hadoop
Spark Py Notebooks
Apache Spark & Python (pySpark) tutorials for Big Data Analysis and Machine Learning as IPython / Jupyter notebooks
Stars: ✭ 1,338 (+3330.77%)
Mutual labels:  big-data, pyspark
Gaffer
A large-scale entity and relation database supporting aggregation of properties
Stars: ✭ 1,642 (+4110.26%)
Mutual labels:  big-data, hadoop
Spark On Lambda
Apache Spark on AWS Lambda
Stars: ✭ 137 (+251.28%)
Mutual labels:  big-data, apache-spark
Data-pipeline-project
Data pipeline project
Stars: ✭ 18 (-53.85%)
Mutual labels:  hadoop, data-pipeline
Calcite
Apache Calcite
Stars: ✭ 2,816 (+7120.51%)
Mutual labels:  big-data, hadoop
pyspark-algorithms
PySpark Algorithms Book: https://www.amazon.com/dp/B07X4B2218/ref=sr_1_2
Stars: ✭ 72 (+84.62%)
Mutual labels:  big-data, pyspark
Presto
The official home of the Presto distributed SQL query engine for big data
Stars: ✭ 12,957 (+33123.08%)
Mutual labels:  big-data, hadoop
1-60 of 967 similar projects