All Categories → Data Processing → bigdata

Top 164 bigdata open source projects

Reddit sse stream

A Server Side Event stream to deliver Reddit comments and submissions in near real-time to a client.

✭ 39

python flask stream reddit bigdata sse

Optimus

🚚 Agile Data Preparation Workflows made easy with dask, cudf, dask_cudf and pyspark

✭ 986

jupyter-notebook machine-learning data-science spark data-analysis bigdata pyspark data-cleaning data-wrangling

Autocrawler

Google, Naver multiprocess image web crawler (Selenium)

✭ 957

python deep-learning google crawler selenium bigdata customizable thread chromedriver

Aws Auto Terminate Idle Emr

AWS Auto Terminate Idle AWS EMR Clusters Framework is an AWS based solution using AWS CloudWatch and AWS Lambda using a Python script that is using Boto3 to terminate AWS EMR clusters that have been idle for a specified period of time.

✭ 21

python aws automation serverless aws-lambda bigdata etl cloudformation amazon-web-services aws-cloudformation

Panther

Detect threats with log data and improve cloud security posture

✭ 885

python go typescript react security aws graphql serverless bigdata etl security-automation compliance

Spark Streaming Monitoring With Lightning

Plot live-stats as graph from ApacheSpark application using Lightning-viz

✭ 15

scala realtime bigdata apache-spark monitoring-tool spark-streaming

Bigdata Interview

🎯 🌟[大数据面试题]分享自己在网络上收集的大数据相关的面试题以及自己的答案总结.目前包含Hadoop/Hive/Spark/Flink/Hbase/Kafka/Zookeeper框架的面试题知识总结

✭ 857

kafka spark interview interview-questions yarn hadoop bigdata flink hbase hdfs mapreduce

Mobius

C# and F# language binding and extensions to Apache Spark

✭ 929

csharp fsharp dataset spark streaming bigdata apache-spark dataframe spark-streaming mapreduce

10 Weeks

10-weeks of technology exploration

✭ 22

bigdata mobile-development cloud-computing cms-framework javascript-framework webservice blockchain-technology

Hadoop For Geoevent

ArcGIS GeoEvent Server sample Hadoop connector for storing GeoEvents in HDFS.

✭ 5

java server big-data hadoop bigdata hdfs transport arcgis connector

Bigdataguide

大数据学习，从零开始学习大数据，包含大数据学习各阶段学习视频、面试资料

✭ 817

java scala kafka spark hadoop zookeeper bigdata flink hive hbase

Kube Batch

A batch scheduler of kubernetes for high performance workload, e.g. AI/ML, BigData, HPC

✭ 804

go machine-learning kubernetes bigdata hpc

Coding Now

学习记录的一些笔记，以及所看得一些电子书eBooks、视频资源和平常收纳的一些自己认为比较好的博客、网站、工具。涉及大数据几大组件、Python机器学习和数据分析、Linux、操作系统、算法、网络等

✭ 750

python java linux spark notes bigdata coding

Gearpump

Lightweight real-time big data streaming engine over Akka

✭ 745

scala bigdata akka stream-processing

Spark Movie Lens

An on-line movie recommender using Spark, Python Flask, and the MovieLens dataset

✭ 745

python jupyter-notebook flask spark big-data bigdata

Vaex

Out-of-Core hybrid Apache Arrow/NumPy DataFrame for Python, ML, visualize and explore big tabular data at a billion rows per second 🚀

✭ 6,793

python javascript C++HTML PHP shell machine-learning visualization bigdata machinelearning dataframe tabular-data hdf5 memory-mapped-file

Running Elasticsearch Fun Profit

A book about running Elasticsearch

✭ 664

documentation elasticsearch ebook bigdata sysadmin

Bigartm

Fast topic modeling platform

✭ 563

python machine-learning bigdata text-mining topic-modeling python-api

Cds

Data syncing in golang for ClickHouse.

✭ 501

go golang bigdata clickhouse kafka-consumer

Bigslice

A serverless cluster computing system for the Go programming language

✭ 469

go golang cluster bigdata etl machinelearning mapreduce

Tensorbase

TensorBase BE is building a high performance, cloud neutral bigdata warehouse for SMEs fully in Rust.

✭ 440

rust database data analytics high-performance infrastructure bigdata engineering modern

Bigdataie

大数据博客、笔试题、教程、项目、面经的整理

✭ 445

java spark bigdata offer

God Of Bigdata

专注大数据学习面试，大数据成神之路开启。Flink/Spark/Hadoop/Hbase/Hive...

✭ 6,008

kafka spark hadoop zookeeper bigdata flink hive hbase hdfs flume azkaban

Circosjs

d3 library to build circular graphs

✭ 436

javascript bioinformatics big-data bigdata d3js circular

Cortx

CORTX Community Object Storage is 100% open source object storage uniquely optimized for mass capacity storage devices.

✭ 426

jupyter-notebook hacktoberfest open-source opensource distributed-systems storage big-data s3 bigdata distributed-storage object-storage s3-storage

Big data architect skills

一个大数据架构师应该掌握的技能

✭ 400

spark analytics hadoop bigdata skills

Sidekick

High Performance HTTP Sidecar Load Balancer

✭ 366

go kubernetes proxy spark bigdata load-balancer

Jigsaw

Jigsaw七巧板 provides a set of web components based on Angular5/8/9+. The main purpose of Jigsaw is to help the application developers to construct complex & intensive interacting & user friendly web pages. Jigsaw is supporting the development of all applications of Big Data Product of ZTE.

✭ 354

typescript angular component bigdata webui

Datawave

DataWave is an ingest/query framework that leverages Apache Accumulo to provide fast, secure data access.

✭ 347

java hacktoberfest bigdata

Api.rss

RSS as RESTful. This service allows you to transform RSS feed into an awesome API.

✭ 340

ruby machine-learning api rails elasticsearch rest-api rss bigdata feed semantic-web rss-feed sidekiq

Datafaker

Datafaker is a large-scale test data and flow test data generation tool. Datafaker fakes data and inserts to varied data sources. 测试数据生成工具

✭ 327

python mysql testing postgresql kafka oracle bigdata hive hbase faker

Uproot3

ROOT I/O in pure Python and NumPy.

✭ 312

python python3 numpy big-data analysis bigdata root hep file-format

Spline

Data Lineage Tracking And Visualization Solution

✭ 306

scala visualization spark tracking hadoop bigdata

Janusgraph.cn

分布式图数据库 JanusGraph 中文社区，关于 JanusGraph 的一切

✭ 273

graph bigdata gremlin

Arvados

An open source platform for managing and analyzing biomedical big data

✭ 274

python go ruby docker aws cloud azure workflow bioinformatics cluster gcp genomics bigdata workflow-engine

Ldetool

Code generator for fast log file parsers

✭ 273

go parsing bigdata

Docker Spark Cluster

A simple spark standalone cluster for your testing environment purposses

✭ 261

dockerfile docker-compose spark developer-tools bigdata

Big Data Rosetta Code

Code snippets for solving common big data problems in various platforms. Inspired by Rosetta Code

✭ 254

scala spark bigdata

DetEdit

A graphical user interface for annotating and editing events detected in long-term acoustic monitoring data

✭ 20

matlab bigdata data-visualization annotation-tool acoustic-monitoring long-term-monitoring classification-tool species-classification mbarc

jigsaw-seed

这是组件库 Jigsaw-七巧板(https://github.com/rdkmaster/jigsaw) 的种子工程，建议所有新增的app都以这个工程作为种子开始构建。

✭ 17

typescript HTML javascript CSS SCSS angular component bigdata webui jigsaw webcomponent

NiFi-Rule-engine-processor

Drools processor for Apache NiFi

✭ 34

java json big-data rule-engine bigdata drools nifi apache-nifi rules-engine nifi-processors nifi-processor big-data-projects matrixbi

leaflet heatmap

简单的可视化湖州通话数据假设数据量很大，没法用浏览器直接绘制热力图，把绘制热力图这一步骤放到线下计算分析。使用Apache Spark并行计算数据之后，再使用Apache Spark绘制热力图，然后用leafletjs加载OpenStreetMap图层和热力图图层，以达到良好的交互效果。现在使用Apache Spark实现绘制，可能是Apache Spark不擅长这方面的计算或者是我没有设计好算法，并行计算的速度比不上单机计算。Apache Spark绘制热力图和计算代码在这 https://github.com/yuanzhaokang/ParallelizeHeatmap.git .

✭ 13

visualization css d3 map big-data html5 dataviz spark apache-spark hadoop heatmap leaflet bigdata data-visualization hdfs data-analysis javscript d3js tilelayer datavisualization

proteic

Streaming and static data visualization for the modern web.

✭ 37

typescript javascript CSS shell visualization d3 charts dataviz bigdata data-visualization

Spark-and-Kafka IoT-Data-Processing-and-Analytics

Final Project for IoT: Big Data Processing and Analytics class. Analyzing U.S nationwide temperature from IoT sensors in real-time

✭ 42

python kafka bigdata pyspark spark-streaming iot-sensors

yuzhouwan

Code Library for My Blog

✭ 39

java javascript Jupyter Notebook CSS scala c elasticsearch algorithm spark hadoop tensorflow bigdata hbase zookeeper nio druid

data processing course

Some class materials for a data processing course using PySpark

✭ 50

python ruby shell Makefile Dockerfile HTML Jupyter Notebook course spark bigdata stream-processing pyspark apache-beam data-processing

centurion

Kotlin Bigdata Toolkit

✭ 320

kotlin java shell bigdata parquet orc

learning notes

学习笔记

✭ 18

Jupyter Notebook go Makefile shell Dockerfile algorithms notes bigdata msq

big data

A collection of tutorials on Hadoop, MapReduce, Spark, Docker

✭ 34

Jupyter Notebook docker big-data spark hadoop bigdata jupyter-notebook pyspark mapreduce spark-sql testdfsio mapreduce-bash

v6.dooring.public

可视化大屏解决方案, 提供一套可视化编辑引擎, 助力个人或企业轻松定制自己的可视化大屏应用.

✭ 323

typescript CSS Less react nodejs big-data bigdata webgl2 low-code big-data-analytics antv lowcode dooring

pulsar-user-group-loc-cn

Workspace for China local user group.

✭ 19

streaming messaging bigdata apachepulsar

datasphere-service

an open source dataworks platform

✭ 20

java typescript ANTLR microservice bigdata cloud-native datamanagement data-governance bigdatacloud-api middle-office

room-renting

用Python爬取安居客房源信息，并用高德地图进行可视化

✭ 16

python HTML bigdata data-visualization scrawler

ETL-Starter-Kit

📁 Extract, Transform, Load (ETL) 👷 refers to a process in database usage and especially in data warehousing. This repository contains a starter kit featuring ETL related work.

✭ 21