All Categories → Data Processing → hadoop

Top 231 hadoop open source projects

550+ DevOps Bash Scripts - AWS, GCP, Kubernetes, Kafka, Docker, APIs, Hadoop, SQL, PostgreSQL, MySQL, Hive, Impala, Travis CI, Jenkins, Concourse, GitHub, GitLab, BitBucket, Azure DevOps, TeamCity, Spotify, MP3, LDAP, Code/Build Linting, pkg mgmt for Linux, Mac, Python, Perl, Ruby, NodeJS, Golang, Advanced dotfiles: .bashrc, .vimrc, .gitconfig, .screenrc, .tmux.conf, .psqlrc ...

✭ 226

python java ruby shell golang perl bash docker api aws mysql postgresql devops kafka spotify jenkins hadoop gcp

Hadoop Attack Library

A collection of pentest tools and resources targeting Hadoop environments

✭ 228

python pentest hadoop bigdata

Luigi

Luigi is a Python module that helps you build complex pipelines of batch jobs. It handles dependency resolution, workflow management, visualization etc. It also comes with Hadoop support built in.

✭ 15,226

python javascript HTML hadoop scheduling orchestration-framework luigi

Hadoop Connectors

Libraries and tools for interoperability between Hadoop-related open-source software and Google Cloud Platform.

✭ 218

java hadoop bigquery

Sparkrdma

RDMA accelerated, high-performance, scalable and efficient ShuffleManager plugin for Apache Spark

✭ 215

java scala spark big-data hadoop bigdata apache-spark

Calcite

Apache Calcite

✭ 2,816

java kotlin HTML SCSS FreeMarker shell sql big-data geospatial hadoop calcite

Facebook Hive Udfs

Facebook's Hive UDFs

✭ 213

java hadoop hive

Shifu

An end-to-end machine learning and data mining framework on Hadoop

✭ 207

java machine-learning neural-network pipeline hadoop bigdata random-forest

Javaorbigdata Interview

Java开发者或者大数据开发者面试知识点整理

✭ 203

java spark interview hadoop bigdata storm

Recommendsys

推荐项目（实时推荐和离线推荐）

✭ 198

java kafka hadoop storm

Awesome Learning

实践源码库：https://github.com/jast90/bigdata 。微信搜索Jast关注公众号，获取最新技术分享😯。

✭ 197

java awesome book hadoop bigdata

Nutch

Apache Nutch is an extensible and scalable web crawler

✭ 2,277

java HTML shell XSLT Rich Text Format Dockerfile hadoop apache crawling web-crawler nutch

Hive Jdbc Uber Jar

Hive JDBC "uber" or "standalone" jar based on the latest Apache Hive version

✭ 188

java driver hadoop apache jdbc hive

Bigdata Playground

A complete example of a big data application using : Kubernetes (kops/aws), Apache Spark SQL/Streaming/MLib, Apache Flink, Scala, Python, Apache Kafka, Apache Hbase, Apache Parquet, Apache Avro, Apache Storm, Twitter Api, MongoDB, NodeJS, Angular, GraphQL

✭ 177

python typescript scala nodejs machine-learning docker angular graphql mongodb kafka big-data hadoop apache-spark twitter-api hbase avro parquet spark-streaming

Deeplearning4j

Suite of tools for deploying and training deep learning models using the JVM. Highlights include model import for keras, tensorflow, and onnx/pytorch, a modular and tiny c++ library for running math code and a java based math library on top of the core c++ library. Also includes samediff: a pytorch/tensorflow like library for running deep learni…

✭ 12,277

java C++Cuda kotlin javascript c artificial-intelligence gpu spark deeplearning intellij hadoop linear-algebra deeplearning4j neural-nets dl4j matrix-library

Big Whale

Spark、Flink等离线任务的调度以及实时任务的监控

✭ 163

java spark hadoop flink

Bigdata docker

Big Data Ecosystem Docker

✭ 161

vba jupyter-notebook mysql spark hadoop zookeeper hive mongo hbase hdfs hue presto

Presto

The official home of the Presto distributed SQL query engine for big data

✭ 12,957

java javascript shell ANTLR HTML CSS sql big-data hadoop hive presto

Hadoop Common

Mirror of Apache Hadoop common

✭ 155

java hadoop

Movie recommend

基于Spark的电影推荐系统，包含爬虫项目、web网站、后台管理系统以及spark推荐系统

✭ 2,092

java scala mysql nginx hadoop scrapy hive spark-streaming ssm-maven spark-mllib

Hadoop Hdfs

Mirror of Apache Hadoop HDFS

✭ 152

java hadoop

Spark With Python

Fundamentals of Spark with Python (using PySpark), code examples

✭ 150

python jupyter-notebook machine-learning database sql spark analytics big-data hadoop apache parallel-computing distributed-computing apache-spark dataframe pyspark hdfs

Hadoop

Apache Hadoop

✭ 12,177

java C++c javascript shell HTML hadoop

Parquet Rs

Apache Parquet implementation in Rust

✭ 144

rust hadoop parquet

Eel Sdk

Big Data Toolkit for the JVM

✭ 140

scala kafka big-data hadoop etl hive parquet

Xlearning

AI on Hadoop

✭ 1,709

java shell tensorflow ai deeplearning caffe yarn hadoop mxnet machinelearning

Hbaseclient

HBase客户端数据管理软件

✭ 135

java hadoop hbase

Aliyun Emapreduce Datasources

Extended datasource support for Spark/Hadoop on Aliyun E-MapReduce.

✭ 132

scala kafka spark hadoop aliyun

Calcite Avatica

Mirror of Apache Calcite - Avatica

✭ 130

java sql big-data geospatial hadoop

Gaffer

A large-scale entity and relation database supporting aggregation of properties

✭ 1,642

java javascript graph spark big-data hadoop graph-database hbase parquet accumulo aggregation

Airflow Pipeline

An Airflow docker image preconfigured to work well with Spark and Hadoop/EMR

✭ 128

python docker spark hadoop airflow

Spydra

Ephemeral Hadoop clusters using Google Compute Platform

✭ 128

java hadoop google-cloud

Griffon Vm

Griffon Data Science Virtual Machine

✭ 128

python ruby scala r jupyter-notebook database mysql data-science elasticsearch big-data node-js virtual-machine hadoop apache-spark

Hadoopcryptoledger

Hadoop Crypto Ledger - Analyzing CryptoLedgers, such as Bitcoin Blockchain, on Big Data platforms, such as Hadoop/Spark/Flink/Hive

✭ 126

java blockchain ethereum bitcoin spark hadoop bigdata flink hive

Parquet4s

Read and write Parquet in Scala. Use Scala classes as schema. No need to start a cluster.

✭ 125

scala aws hadoop akka reader streams parquet writer akka-streams

Dynamometer

A tool for scale and performance testing of HDFS with a specific focus on the NameNode.

✭ 122

java testing testing-tools hadoop performance-analysis scale performance-testing performance-metrics hdfs

Hdfs Shell

HDFS Shell is a HDFS manipulation tool to work with functions integrated in Hadoop DFS

✭ 117

java shell linux cli big-data hadoop hdfs

Ibis

A pandas-like deferred expression system, with first-class SQL support

✭ 1,630

python C++pandas hadoop hdfs spark impala ibis

Datax

DataX is an open source universal ETL tool that support Cassandra, ClickHouse, DBF, Hive, InfluxDB, Kudu, MySQL, Oracle, Presto(Trino), PostgreSQL, SQL Server

✭ 116

java database mysql hadoop oracle influxdb etl sqlserver hive clickhouse

Asakusafw

Asakusa Framework

✭ 114

java framework big-data hadoop batch mapreduce batch-processing data-flow

Tensorflowonyarn

Support TensorFlow on YARN

✭ 114

java deep-learning tensorflow yarn hadoop

Parquet Go

Go package to read and write parquet files. parquet is a file format to store nested data structures in a flat columnar data format. It can be used in the Hadoop ecosystem and with tools such as Presto and AWS Athena.

✭ 114

go golang hadoop parquet presto

Xlearning Xdml

extremely distributed machine learning

✭ 113

scala machine-learning ai spark distributed hadoop

Avro Hadoop Starter

Example MapReduce jobs in Java, Hive, Pig, and Hadoop Streaming that work on Avro data.

✭ 110

java hadoop hive avro mapreduce

Introtohadoopandmr udacity course

🐘 Source code for assignments of Udacity course "Introduction to Hadoop and MapReduce"

✭ 110

python java hadoop mooc

Waterdrop

Production Ready Data Integration Product, documentation：

✭ 1,856

java scala shell spark hadoop flink spark-streaming etl-framework sql-engine etl-pipeline

Haproxy Configs

80+ HAProxy Configs for Hadoop, Big Data, NoSQL, Docker, Elasticsearch, SolrCloud, HBase, MySQL, PostgreSQL, Apache Drill, Hive, Presto, Impala, Hue, ZooKeeper, SSH, RabbitMQ, Redis, Riak, Cloudera, OpenTSDB, InfluxDB, Prometheus, Kibana, Graphite, Rancher etc.

✭ 106