Amazon S3 Find and Forget is a solution to handle data erasure requests from data lakes stored on Amazon S3, for example, pursuant to the European General Data Protection Regulation (GDPR)

Stars: ✭ 115 (-92.9%)

Mutual labels: big-data, parquet

Bigdata docker

Big Data Ecosystem Docker

Stars: ✭ 161 (-90.06%)

Mutual labels: hive, hadoop

Linkis

Linkis helps easily connect to various back-end computation/storage engines(Spark, Python, TiDB...), exposes various interfaces(REST, JDBC, Java ...), with multi-tenancy, high performance, and resource control.

Stars: ✭ 2,323 (+43.48%)

Mutual labels: hive, jdbc

Wedatasphere

WeDataSphere is a financial level one-stop open-source suitcase for big data platforms. Currently the source code of Scriptis and Linkis has already been released to the open-source community. WeDataSphere, Big Data Made Easy!

Stars: ✭ 372 (-77.02%)

Mutual labels: hive, hadoop

Parquetviewer

Simple windows desktop application for viewing & querying Apache Parquet files

Stars: ✭ 145 (-91.04%)

Mutual labels: big-data, parquet

Bigdl

Building Large-Scale AI Applications for Distributed Big Data

Stars: ✭ 3,813 (+135.52%)

Mutual labels: big-data, hadoop

Orc

Apache ORC - the smallest, fastest columnar storage for Hadoop workloads

Stars: ✭ 389 (-75.97%)

Mutual labels: big-data, hadoop

God Of Bigdata

专注大数据学习面试，大数据成神之路开启。Flink/Spark/Hadoop/Hbase/Hive...

Stars: ✭ 6,008 (+271.09%)

Mutual labels: hive, hadoop

Gimel

Big Data Processing Framework - Unified Data API or SQL on Any Storage

Stars: ✭ 216 (-86.66%)

Mutual labels: big-data, jdbc

Awkward 0.x

Manipulate arrays of complex data structures as easily as Numpy.

Stars: ✭ 216 (-86.66%)

Mutual labels: big-data, parquet

smart-data-lake

Smart Automation Tool for building modern Data Lakes and Data Pipelines

Stars: ✭ 79 (-95.12%)

Mutual labels: hive, hadoop

Sparkrdma

RDMA accelerated, high-performance, scalable and efficient ShuffleManager plugin for Apache Spark

Stars: ✭ 215 (-86.72%)

Mutual labels: big-data, hadoop

Iceberg

Iceberg is a table format for large, slow-moving tabular data

Stars: ✭ 393 (-75.73%)

Mutual labels: hadoop, parquet

Data Science Ipython Notebooks

Data science Python notebooks: Deep learning (TensorFlow, Theano, Caffe, Keras), scikit-learn, Kaggle, big data (Spark, Hadoop MapReduce, HDFS), matplotlib, pandas, NumPy, SciPy, Python essentials, AWS, and various command lines.

Stars: ✭ 22,048 (+1261.83%)

Mutual labels: big-data, hadoop

H2o 3

H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random Forest, Generalized Linear Modeling (GLM with Elastic Net), K-Means, PCA, Generalized Additive Models (GAM), RuleFit, Support Vector Machine (SVM), Stacked Ensembles, Automatic Machine Learning (AutoML), etc.

Stars: ✭ 5,656 (+249.35%)

Mutual labels: big-data, hadoop

hive to es

同步Hive数据仓库数据到Elasticsearch的小工具

Stars: ✭ 21 (-98.7%)

Mutual labels: hive, hadoop

hive-bigquery-storage-handler

Hive Storage Handler for interoperability between BigQuery and Apache Hive

Stars: ✭ 16 (-99.01%)

Mutual labels: hive, hadoop

Helicalinsight

Helical Insight software is world’s first Open Source Business Intelligence framework which helps you to make sense out of your data and make well informed decisions.

Stars: ✭ 214 (-86.78%)

Mutual labels: big-data, hive

terraform-aws-kinesis-firehose

This code creates a Kinesis Firehose in AWS to send CloudWatch log data to S3.

Stars: ✭ 25 (-98.46%)

Mutual labels: big-data, parquet

datalake-etl-pipeline

Simplified ETL process in Hadoop using Apache Spark. Has complete ETL pipeline for datalake. SparkSession extensions, DataFrame validation, Column extensions, SQL functions, and DataFrame transformations

Stars: ✭ 39 (-97.59%)

Mutual labels: big-data, hadoop

xxhadoop

Data Analysis Using Hadoop/Spark/Storm/ElasticSearch/MachineLearning etc. This is My Daily Notes/Code/Demo. Don't fork, Just star !

Stars: ✭ 37 (-97.71%)

Mutual labels: hive, hadoop

sparkucx

A high-performance, scalable and efficient ShuffleManager plugin for Apache Spark, utilizing UCX communication layer

Stars: ✭ 32 (-98.02%)

Mutual labels: big-data, hadoop

Hadoop For Geoevent

ArcGIS GeoEvent Server sample Hadoop connector for storing GeoEvents in HDFS.

Stars: ✭ 5 (-99.69%)

Mutual labels: big-data, hadoop

Bigdataguide

大数据学习，从零开始学习大数据，包含大数据学习各阶段学习视频、面试资料

Stars: ✭ 817 (-49.54%)

Mutual labels: hive, hadoop

Szt Bigdata

深圳地铁大数据客流分析系统🚇🚄🌟

Stars: ✭ 826 (-48.98%)

Mutual labels: hive, hadoop

Movies-Analytics-in-Spark-and-Scala

Data cleaning, pre-processing, and Analytics on a million movies using Spark and Scala.

Stars: ✭ 47 (-97.1%)

Mutual labels: big-data, hadoop

cobra-policytool

Manage Apache Atlas and Ranger configuration for your Hadoop environment.

Stars: ✭ 16 (-99.01%)

Mutual labels: hive, hadoop

aaocp

一个对用户行为日志进行分析的大数据项目

Stars: ✭ 53 (-96.73%)

Mutual labels: hive, hadoop

GooglePlay-Web-Crawler

Mapreduce project by Hadoop, Nutch, AWS EMR, Pig, Tez, Hive

Stars: ✭ 18 (-98.89%)

Mutual labels: hive, hadoop

TitanDataOperationSystem

最好的大数据项目。《Titan数据运营系统》，本项目是一个全栈闭环系统，我们有用作数据可视化的web系统，然后用flume-kafaka-flume进行日志的读取，在hive设计数仓，编写spark代码进行数仓表之间的转化以及ads层表到mysql的迁移，使用azkaban进行定时任务的调度，使用技术：Java/Scala语言，Hadoop、Spark、Hive、Kafka、Flume、Azkaban、SpringBoot，Bootstrap， Echart等；

Stars: ✭ 62 (-96.17%)

Mutual labels: hive, hadoop

big data

A collection of tutorials on Hadoop, MapReduce, Spark, Docker

Stars: ✭ 34 (-97.9%)

Mutual labels: big-data, hadoop

HiveJdbcStorageHandler

No description or website provided.

Stars: ✭ 21 (-98.7%)

Mutual labels: hive, jdbc

spark-acid

ACID Data Source for Apache Spark based on Hive ACID

Stars: ✭ 91 (-94.38%)

Mutual labels: big-data, hive

Addax

Addax is an open source universal ETL tool that supports most of those RDBMS and NoSQLs on the planet, helping you transfer data from any one place to another.

Stars: ✭ 615 (-62.01%)

Mutual labels: hive, hadoop

incubator-linkis

Stars: ✭ 2,459 (+51.88%)

Mutual labels: hive, jdbc

hadoop-data-ingestion-tool

OLAP and ETL of Big Data

Stars: ✭ 17 (-98.95%)

Mutual labels: big-data, hadoop

Parquet Dotnet

🏐 Apache Parquet for modern .NET

Stars: ✭ 276 (-82.95%)

Mutual labels: big-data, parquet

bigdata-fun

A complete (distributed) BigData stack, running in containers

Stars: ✭ 14 (-99.14%)

Mutual labels: big-data, hadoop

Moosefs

MooseFS – Open Source, Petabyte, Fault-Tolerant, Highly Performing, Scalable Network Distributed File System (Software-Defined Storage)

Stars: ✭ 1,025 (-36.69%)

Mutual labels: big-data, hadoop

implyr

SQL backend to dplyr for Impala

Stars: ✭ 74 (-95.43%)

Mutual labels: hadoop, jdbc

Tez

Apache Tez

Stars: ✭ 313 (-80.67%)

Mutual labels: big-data, hadoop

Parquet Format

Apache Parquet

Stars: ✭ 800 (-50.59%)

Mutual labels: big-data, parquet

Spark

Apache Spark - A unified analytics engine for large-scale data processing

Stars: ✭ 31,618 (+1852.93%)

Mutual labels: big-data, jdbc

Docker Spark Cluster

A Spark cluster setup running on Docker containers

Stars: ✭ 57 (-96.48%)

Mutual labels: big-data, hadoop

Lychee

The most complete and powerful data-binding library and persistence infra for Kotlin 1.3, Android & Splitties Views DSL, JavaFX & TornadoFX, JSON, JDBC & SQLite, SharedPreferences.

Stars: ✭ 102 (-93.7%)

Mutual labels: jdbc

Dataengineeringproject

Example end to end data engineering project.

Stars: ✭ 82 (-94.94%)

Mutual labels: big-data

Bigdata File Viewer

A cross-platform (Windows, MAC, Linux) desktop application to view common bigdata binary format like Parquet, ORC, AVRO, etc. Support local file system, HDFS, AWS S3, Azure Blob Storage ,etc.

Stars: ✭ 86 (-94.69%)

Mutual labels: parquet

Genie

Distributed Big Data Orchestration Service

Stars: ✭ 1,544 (-4.63%)

Mutual labels: big-data

Pyhive

Python interface to Hive and Presto. 🐝

Stars: ✭ 1,378 (-14.89%)

Mutual labels: hive

Hops Examples

Examples for Deep Learning/Feature Store/Spark/Flink/Hive/Kafka jobs and Jupyter notebooks on Hops