Top 369 big-data open source projects

CS Book
🔥 Latest computer science e-books。提供最新技术类电子书下载, “我无非就是想卷死各位,或者被各位卷死!”
spark-records
Bulletproof Apache Spark jobs with fast root cause analysis of failures.
RemoteShuffleService
Celeborn provides an elastic and high-performance service for shuffle and spilled data.
IoT-system-PLC-data-to-InfluxDB
This project aim is to provide free software to fetch data from plcs (Siemens S7-300/400/1200/1500) and store it. Used stack is completly opensource. I used InfluDB as data storage, so application principle is following Big Data paradigm.
spark-root
Apache Spark Data Source for ROOT File Format
nebula
A distributed, fast open-source graph database featuring horizontal scalability and high availability
img2dataset
Easily turn large sets of image urls to an image dataset. Can download, resize and package 100M urls in 20h on one machine.
lcbo-api
A crawler and API server for Liquor Control Board of Ontario retail data
datalake-etl-pipeline
Simplified ETL process in Hadoop using Apache Spark. Has complete ETL pipeline for datalake. SparkSession extensions, DataFrame validation, Column extensions, SQL functions, and DataFrame transformations
FlameStream
Distributed stream processing model and its implementation
ngm
swissgeol.ch gives you insight in geoscientific data - above and below the surface.
predictionio-template-ecom-recommender
PredictionIO E-Commerce Recommendation Engine Template (Scala-based parallelized engine)
FIW KRT
Families In the WIld: A Kinship Recogntion Toolbox.
HadoopDedup
🍉基于Hadoop和HBase的大规模海量数据去重
big-data-engineering-indonesia
A curated list of big data engineering tools, resources and communities.
awesome-tools
curated list of awesome tools and libraries for specific domains
merkle-db
High-scalability analytics database built on immutable merkle-trees
javaer-mind
Java 程序员进阶学习的思维导图
metriql
The metrics layer for your data. Join us at https://metriql.com/slack
dislib
The Distributed Computing library for python implemented using PyCOMPSs programming model for HPC.
awesome-coder-resources
编程路上加油站!------【持续更新中...欢迎star,欢迎常回来看看......】【内容:编程/学习/阅读资源,开源项目,面试题,网站,书,博客,教程等等】
cdp-service
cdp数据平台,帮助企业充分了解客户,实现千人千面的精准营销。
Quantitative-Big-Imaging-2018
(Latest semester at https://github.com/kmader/Quantitative-Big-Imaging-2019) The material for the Quantitative Big Imaging course at ETHZ for the Spring Semester 2018
sgd
An R package for large scale estimation with stochastic gradient descent
mmtf-spark
Methods for the parallel and distributed analysis and mining of the Protein Data Bank using MMTF and Apache Spark.
Clustering4Ever
C4E, a JVM friendly library written in Scala for both local and distributed (Spark) Clustering.
twitter-archive-reader
Full featured TypeScript Twitter archive reader and browser
bullet-core
Bullet is a streaming query engine that can be plugged into any singular data stream using a Stream Processing framework like Apache Storm, Spark or Flink.
301-360 of 369 big-data projects