All Categories → Data Processing → hadoop

Top 231 hadoop open source projects

Hadoop utility to compact small files

✭ 18

java hadoop hdfs hadoop-filesystem hadoop-smallfiles smallfiles hdfs-compaction

WASP is a framework to build complex real time big data applications. It relies on a kind of Kappa/Lambda architecture mainly leveraging Kafka and Spark. If you need to ingest huge amount of heterogeneous data and analyze them through complex pipelines, this is the framework for you.

✭ 19

scala shell java Dockerfile XSLT javascript elasticsearch kafka akka spark yarn hadoop solr jdbc hbase spark-streaming hdfs parquet

presto

Teradata Distribution of Presto -- A Distributed SQL Query Engine for Big Data

✭ 91

cloud sql hadoop teradata

hadoop-ecosystem

Visualizations of the Hadoop Ecosystem

✭ 20

shell visualization hadoop

liquibase-impala

Liquibase extension to add Impala Database support

✭ 23

java hive hadoop impala database-migrations liquibase

hadoop-etl-udfs

The Hadoop ETL UDFs are the main way to load data from Hadoop into EXASOL

✭ 17

java python hive hadoop parquet udf exasol hcatalog user-defined-function exasol-integration

memex-gate

General Architecture for Text Engineering

✭ 47

python shell java information-retrieval hadoop entities named-entities named-entity-recognition lexicon memex gate behemoth federal court text-engine

sparkucx

A high-performance, scalable and efficient ShuffleManager plugin for Apache Spark, utilizing UCX communication layer

✭ 32

scala java shell big-data spark apache-spark hadoop hpc rdma

hadoopoffice

HadoopOffice - Analyze Office documents using the Hadoop ecosystem (Spark/Flink/Hive)

✭ 56

java scala shell spark hive hadoop excel bigdata office poi flink hadoop-ecosystem hadoopoffice analyze-office-documents

rastercube

rastercube is a python library for big data analysis of georeferenced time series data (e.g. MODIS NDVI)

✭ 15

python shell ruby data big-data spark hadoop geospatial

learning-spark

Tidy up Spark and Hadoop tutorials.

✭ 28

java shell r data-science spark hadoop bigdata

oci-cloudera

Terraform module to deploy Cloudera on Oracle Cloud Infrastructure (OCI)

✭ 20

python shell HCL cloud spark hadoop terraform cloudera oracle oci cdp cdh dsw edh partner-led

jmx exporter-cloudera-hadoop

Prometheus jmx_exporter configurations for Cloudera Hadoop

✭ 33

hadoop exporter prometheus prometheus-exporter cdh cdh5 jmx-exporter

skein

A tool and library for easily deploying applications on Apache YARN

✭ 128

python java HTML shell hadoop deployment cluster hdfs apache-yarn

xxhadoop

Data Analysis Using Hadoop/Spark/Storm/ElasticSearch/MachineLearning etc. This is My Daily Notes/Code/Demo. Don't fork, Just star !

✭ 37

java scala shell elasticsearch kafka spark hive hadoop storm hbase zookeeper spark-streaming mr hadoop-rpc

disq

A library for manipulating bioinformatics sequencing formats in Apache Spark

✭ 29

java shell spark hadoop genomics ngs htsjdk

corc

An ORC File Scheme for the Cascading data processing platform.

✭ 14

java hadoop cascading orc-files

BigInsights-on-Apache-Hadoop

Example projects for 'BigInsights for Apache Hadoop' on IBM Bluemix

✭ 21

spark hive hadoop hbase spark-streaming ibm-bluemix oozie ambari zeppelin webhdfs knox biginsights bigsql

pyspark-ML-in-Colab

Pyspark in Google Colab: A simple machine learning (Linear Regression) model

✭ 32

Jupyter Notebook spark hadoop machine-learning-algorithms pyspark regression-models rdd colab-notebook

disk

基于hadoop+hbase+springboot实现分布式网盘系统

✭ 53

javascript CSS HTML java hadoop hbase springboot

big-data-exploration

[Archive] Intern project - Big Data Exploration using MongoDB - This Repository is NOT a supported MongoDB product

✭ 43

javascript CSS coffeescript python java HTML mongodb hadoop datasets

LogAnalyzeHelper

论坛日志分析系统清洗程序(包含IP规则库，UDF开发，MapReduce程序，日志数据)

✭ 33

java hadoop

datalake-etl-pipeline

Simplified ETL process in Hadoop using Apache Spark. Has complete ETL pipeline for datalake. SparkSession extensions, DataFrame validation, Column extensions, SQL functions, and DataFrame transformations

✭ 39

python big-data spark apache-spark hadoop etl xml xml-parsing pyspark data-pipeline datalake hadoop-mapreduce spark-sql etl-framework hadoop-hdfs etl-pipeline etl-components

qs-hadoop

大数据生态圈学习

✭ 18

java scala elasticsearch spark hadoop storm bigdata spark-streaming mapreduce

dockerfiles

Multi docker container images for main Big Data Tools. (Hadoop, Spark, Kafka, HBase, Cassandra, Zookeeper, Zeppelin, Drill, Flink, Hive, Hue, Mesos, ... )

✭ 29

shell Dockerfile python Makefile Batchfile XSLT javascript dockerfile kafka spark cassandra hive hadoop docker-image bigdata hbase zookeeper mesos hue flink zeppelin drill

the-apache-ignite-book

All code samples, scripts and more in-depth examples for The Apache Ignite Book. Include Apache Ignite 2.6 or above

✭ 65

java streaming memoization sql spark hive hadoop spring-data bigdata hibernate distributed-database ignite nosql-database in-memory-database streaming-data gridgain hibernate-ogm in-memory-computations in-memory-caching

hive-bigquery-storage-handler

Hive Storage Handler for interoperability between BigQuery and Apache Hive

✭ 16

java Dockerfile bigquery google hive hadoop gcp apache

iis

Information Inference Service of the OpenAIRE system

✭ 16

java HTML PigLatin python scala shell HiveQL text-mining data-mining big-data spark hadoop iis openaire information-inference data-processing-system

hive to es

同步Hive数据仓库数据到Elasticsearch的小工具

✭ 21

python hive hadoop impala hdfs python-hive

HDFS-Netdisc

基于Hadoop的分布式云存储系统 🌴

✭ 56

java hadoop hdfs netdisk hadoop-filesystem hdfs-netdisc

smart-data-lake

Smart Automation Tool for building modern Data Lakes and Data Pipelines

✭ 79

scala java shell spark hive hadoop transform-data data-lake data-pipelines deltalake smart-data-lake

learning-hadoop-and-spark

Companion to Learning Hadoop and Learning Spark courses on Linked In Learning

✭ 146

HTML java python TeX r scala emr spark apache-spark hadoop mapreduce wordcount dataproc learning-hadoop

bigdata-doc

大数据学习笔记，学习路线，技术案例整理。

✭ 37

shell python kafka hive hadoop bigdata hdfs mapreduce flink

openPDC

Open Source Phasor Data Concentrator

webhdfs

Node.js WebHDFS REST API client

✭ 88

javascript hadoop webhdfs node-webhdfs

dpkb

大数据相关内容汇总，包括分布式存储引擎、分布式计算引擎、数仓建设等。关键词：Hadoop、HBase、ES、Kudu、Hive、Presto、Spark、Flink、Kylin、ClickHouse

✭ 123

spark presto hive hadoop hbase flink

TonY

TonY is a framework to natively run deep learning frameworks on Apache Hadoop.

✭ 687

java python shell CSS HTML machine-learning deep-learning hadoop tensorflow horovod hadoop-yarn

yarn-prometheus-exporter

Export Hadoop YARN (resource-manager) metrics in prometheus format

✭ 44

go Makefile Dockerfile yarn hadoop metrics exporter apache prometheus resource-manager yarn-hadoop-cluster apache-hadoop

gomrjob

gomrjob - a Go Framework for Hadoop Map Reduce Jobs

✭ 39

go hadoop mapreduce mrjob dataproc

teraslice

Scalable data processing pipelines in JavaScript

✭ 48

typescript javascript PEG.js shell Handlebars python elasticsearch json kafka hadoop hdfs

JavaFramework

Simple Java Framework,designed for easily develop Spring based java program.Support Bigdata And metadata management.A common elasticsearch comm query tool and so on.

✭ 16

java SCSS Less javascript HTML CSS spring hadoop microservice javaframework metadata-management

beanszoo

Distributed Java micro-services using ZooKeeper

✭ 12

java distributed-systems yarn hadoop rpc dis

orion

Management and automation platform for Stateful Distributed Systems

✭ 77

java javascript kafka hadoop hbase cluster-management

hadoop-ansible

Install hadoop cluster with ansible

✭ 35

shell ansible hadoop

RecommendationEngine

Source code and dataset for paper "CBMR: An optimized MapReduce for item‐based collaborative filtering recommendation algorithm with empirical analysis"

✭ 43

java python shell hadoop collaborative-filtering recommendation-engine

ambari-hdp-docker

Dockerfiles and Docker Compose for HDP 2.6 with Blueprints

✭ 23

shell docker dockerfile hadoop blueprint ambari hdp ambari-blueprints

kafka-connect-fs

Kafka Connect FileSystem Connector

✭ 107