Simplified ETL process in Hadoop using Apache Spark. Has complete ETL pipeline for datalake. SparkSession extensions, DataFrame validation, Column extensions, SQL functions, and DataFrame transformations

Stars: ✭ 39 (-97.9%)

Mutual labels: hadoop, etl-framework, etl-pipeline

DaFlow

Apache-Spark based Data Flow(ETL) Framework which supports multiple read, write destinations of different types and also support multiple categories of transformation rules.

Stars: ✭ 24 (-98.71%)

Mutual labels: hadoop, etl-framework, etl-pipeline

Dataspherestudio

DataSphereStudio is a one stop data application development& management portal, covering scenarios including data exchange, desensitization/cleansing, analysis/mining, quality measurement, visualization, and task scheduling.

Stars: ✭ 1,195 (-35.61%)

Mutual labels: spark, hadoop, flink

Metorikku

A simplified, lightweight ETL Framework based on Apache Spark

Stars: ✭ 361 (-80.55%)

Mutual labels: spark, etl-framework

Wedatasphere

WeDataSphere is a financial level one-stop open-source suitcase for big data platforms. Currently the source code of Scriptis and Linkis has already been released to the open-source community. WeDataSphere, Big Data Made Easy!

Stars: ✭ 372 (-79.96%)

Mutual labels: spark, hadoop

Iceberg

Iceberg is a table format for large, slow-moving tabular data

Stars: ✭ 393 (-78.83%)

Mutual labels: spark, hadoop

Learningspark

Scala examples for learning to use Spark

Stars: ✭ 421 (-77.32%)

Mutual labels: spark, spark-streaming

Big data architect skills

一个大数据架构师应该掌握的技能

Stars: ✭ 400 (-78.45%)

Mutual labels: spark, hadoop

Featran

A Scala feature transformation library for data science and machine learning

Stars: ✭ 420 (-77.37%)

Mutual labels: spark, flink

Flink Learning

flink learning blog. http://www.54tianzhisheng.cn/ 含 Flink 入门、概念、原理、实战、性能调优、源码解析等内容。涉及 Flink Connector、Metrics、Library、DataStream API、Table API & SQL 等内容的学习案例，还有 Flink 落地应用的大型项目案例（PVUV、日志存储、百亿数据实时去重、监控告警）分享。欢迎大家支持我的专栏《大数据实时计算引擎 Flink 实战与性能优化》

Stars: ✭ 11,378 (+513.04%)

Mutual labels: spark, flink

Bigdata Notes

大数据入门指南 ⭐

Stars: ✭ 10,991 (+492.19%)

Mutual labels: spark, hadoop

Cdap

An open source framework for building data analytic applications.

Stars: ✭ 509 (-72.58%)

Mutual labels: spark, spark-streaming

Hops Examples

Examples for Deep Learning/Feature Store/Spark/Flink/Hive/Kafka jobs and Jupyter notebooks on Hops

Stars: ✭ 84 (-95.47%)

Mutual labels: spark, flink

Hadoop cookbook

Cookbook to install Hadoop 2.0+ using Chef

Stars: ✭ 82 (-95.58%)

Mutual labels: spark, hadoop

H2o 3

H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random Forest, Generalized Linear Modeling (GLM with Elastic Net), K-Means, PCA, Generalized Additive Models (GAM), RuleFit, Support Vector Machine (SVM), Stacked Ensembles, Automatic Machine Learning (AutoML), etc.

Stars: ✭ 5,656 (+204.74%)

Mutual labels: spark, hadoop

Streaming Readings

Streaming System 相关的论文读物

Stars: ✭ 554 (-70.15%)

Mutual labels: flink, spark-streaming

Zeppelin

Web-based notebook that enables data-driven, interactive data analytics and collaborative documents with SQL, Scala and more.

Stars: ✭ 5,513 (+197.04%)

Mutual labels: spark, flink

Useractionanalyzeplatform

电商用户行为分析大数据平台

Stars: ✭ 645 (-65.25%)

Mutual labels: spark, hadoop

Sylph

Stream computing platform for bigdata

Stars: ✭ 362 (-80.5%)

Mutual labels: flink, spark-streaming

Ytk Learn

Ytk-learn is a distributed machine learning library which implements most of popular machine learning algorithms(GBDT, GBRT, Mixture Logistic Regression, Gradient Boosting Soft Tree, Factorization Machines, Field-aware Factorization Machines, Logistic Regression, Softmax).

Stars: ✭ 337 (-81.84%)

Mutual labels: spark, hadoop

Bigdl

Building Large-Scale AI Applications for Distributed Big Data

Stars: ✭ 3,813 (+105.44%)

Mutual labels: spark, hadoop

Coolplayspark

酷玩 Spark: Spark 源代码解析、Spark 类库等

Stars: ✭ 3,318 (+78.77%)

Mutual labels: spark, spark-streaming

Marmaray

Generic Data Ingestion & Dispersal Library for Hadoop

Stars: ✭ 414 (-77.69%)

Mutual labels: spark, hadoop

Devops Python Tools

80+ DevOps & Data CLI Tools - AWS, GCP, GCF Python Cloud Function, Log Anonymizer, Spark, Hadoop, HBase, Hive, Impala, Linux, Docker, Spark Data Converters & Validators (Avro/Parquet/JSON/CSV/INI/XML/YAML), Travis CI, AWS CloudFormation, Elasticsearch, Solr etc.

Stars: ✭ 406 (-78.12%)

Mutual labels: spark, hadoop

Spline

Data Lineage Tracking And Visualization Solution

Stars: ✭ 306 (-83.51%)

Mutual labels: spark, hadoop

Pdf

编程电子书，电子书，编程书籍，包括C，C#，Docker，Elasticsearch，Git，Hadoop，HeadFirst，Java，Javascript，jvm，Kafka，Linux，Maven，MongoDB，MyBatis，MySQL，Netty，Nginx，Python，RabbitMQ，Redis，Scala，Solr，Spark，Spring，SpringBoot，SpringCloud，TCPIP，Tomcat，Zookeeper，人工智能，大数据类，并发编程，数据库类，数据挖掘，新面试题，架构设计，算法系列，计算机类，设计模式，软件测试，重构优化，等更多分类

Stars: ✭ 12,009 (+547.04%)

Mutual labels: spark, hadoop

Bdp Dataplatform

大数据生态解决方案数据平台：基于大数据、数据平台、微服务、机器学习、商城、自动化运维、DevOps、容器部署平台、数据平台采集、数据平台存储、数据平台计算、数据平台开发、数据平台应用搭建的大数据解决方案。

Stars: ✭ 456 (-75.43%)

Mutual labels: spark, flink

Sparta

Real Time Analytics and Data Pipelines based on Spark Streaming

Stars: ✭ 513 (-72.36%)

Mutual labels: spark, spark-streaming

Data Science Ipython Notebooks

Data science Python notebooks: Deep learning (TensorFlow, Theano, Caffe, Keras), scikit-learn, Kaggle, big data (Spark, Hadoop MapReduce, HDFS), matplotlib, pandas, NumPy, SciPy, Python essentials, AWS, and various command lines.

Stars: ✭ 22,048 (+1087.93%)

Mutual labels: spark, hadoop

Spark States

Custom state store providers for Apache Spark

Stars: ✭ 83 (-95.53%)

Mutual labels: spark, spark-streaming

Alluxio

Alluxio, data orchestration for analytics and machine learning in the cloud

Stars: ✭ 5,379 (+189.82%)

Mutual labels: spark, hadoop

Elasticluster

Create clusters of VMs on the cloud and configure them with Ansible.

Stars: ✭ 298 (-83.94%)

Mutual labels: spark, hadoop

Kylo

Kylo is a data lake management software platform and framework for enabling scalable enterprise-class data lakes on big data technologies such as Teradata, Apache Spark and/or Hadoop. Kylo is licensed under Apache 2.0. Contributed by Teradata Inc.

Stars: ✭ 916 (-50.65%)

Mutual labels: spark, hadoop

Mobius

C# and F# language binding and extensions to Apache Spark

Stars: ✭ 929 (-49.95%)

Mutual labels: spark, spark-streaming

Dockerfiles

50+ DockerHub public images for Docker & Kubernetes - Hadoop, Kafka, ZooKeeper, HBase, Cassandra, Solr, SolrCloud, Presto, Apache Drill, Nifi, Spark, Consul, Riak, TeamCity and DevOps tools built on the major Linux distros: Alpine, CentOS, Debian, Fedora, Ubuntu

Stars: ✭ 847 (-54.36%)

Mutual labels: spark, hadoop

Interview Questions Collection

按知识领域整理面试题，包括C++、Java、Hadoop、机器学习等

Stars: ✭ 21 (-98.87%)

Mutual labels: spark, hadoop

Weblogsanalysissystem

A big data platform for analyzing web access logs

Stars: ✭ 37 (-98.01%)

Mutual labels: spark, hadoop

Real Time Stream Processing Engine

This is an example of real time stream processing using Spark Streaming, Kafka & Elasticsearch.