Top 369 big-data open source projects

Verticapy
VerticaPy is a Python library that exposes sci-kit like functionality to conduct data science projects on data stored in Vertica, thus taking advantage Vertica’s speed and built-in analytics and machine learning capabilities.
Attic Lens
Mirror of Apache Lens
Ymcache
YMCache is a lightweight object caching solution for iOS and Mac OS X that is designed for highly parallel access scenarios.
Docker Spark Cluster
A Spark cluster setup running on Docker containers
Kibble 1
Apache Kibble - a tool to collect, aggregate and visualize data about any software project
Lifion Kinesis
A native Node.js producer and consumer library for Amazon Kinesis Data Streams
Macro ml
Course Website on Macroeconomic Analysis with Machine Learning and Big Data
Oodt
Mirror of Apache OODT
Datumbox Framework
Datumbox is an open-source Machine Learning framework written in Java which allows the rapid development of Machine Learning and Statistical applications.
Traildb
TrailDB is an efficient tool for storing and querying series of events
Moosefs
MooseFS – Open Source, Petabyte, Fault-Tolerant, Highly Performing, Scalable Network Distributed File System (Software-Defined Storage)
Attaca
Robust, distributed version control for large files.
Egads
A Java package to automatically detect anomalies in large scale time-series data
Metrics
Measure behavior of Java applications
Skymap
High-throughput gene to knowledge mapping through massive integration of public sequencing data.
Qcportal
A client interface to the QCArchive Project (read-only image of QCFractal)
Spark
Apache Spark - A unified analytics engine for large-scale data processing
K8s Ingress Claim
An admission control policy that safeguards against accidental duplicate claiming of Hosts/Domains.
Phoenix
Mirror of Apache Phoenix
Dremio Oss
Dremio - the missing link in modern data
Sparkjni
A heterogeneous Apache Spark framework.
Accumulo
Apache Accumulo
Dataflowjavasdk
Google Cloud Dataflow provides a simple, powerful model for building both batch and streaming parallel data processing pipelines.
Pretzel
Javascript full-stack framework for Big Data visualisation and analysis
Pyspark Setup Demo
Demo of PySpark and Jupyter Notebook with the Jupyter Docker Stacks
Bandar Log
Monitoring tool to measure flow throughput of data sources and processing components that are part of Data Ingestion and ETL pipelines.
Hadoop For Geoevent
ArcGIS GeoEvent Server sample Hadoop connector for storing GeoEvents in HDFS.
Sqoop
Mirror of Apache Sqoop
✭ 817
javabig-data
Parquet Format
Apache Parquet
Titanoboa
Titanoboa makes complex workflows easy. It is a low-code workflow orchestration platform for JVM - distributed, highly scalable and fault tolerant.
Rakam Api
📈 Collect customer event data from your apps. (Note that this project only includes the API collector, not the visualization platform)
Spark Movie Lens
An on-line movie recommender using Spark, Python Flask, and the MovieLens dataset
Cython
The most widely used Python to C compiler
Samza
Mirror of Apache Samza
Data Science Career
Career Resources for Data Science, Machine Learning, Big Data and Business Analytics Career Repository
Sdc
Intel® Scalable Dataframe Compiler for Pandas*
Kafka Streams
equivalent to kafka-streams 🐙 for nodejs ✨🐢🚀✨
H2o 3
H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random Forest, Generalized Linear Modeling (GLM with Elastic Net), K-Means, PCA, Generalized Additive Models (GAM), RuleFit, Support Vector Machine (SVM), Stacked Ensembles, Automatic Machine Learning (AutoML), etc.
Oozie
Mirror of Apache Oozie
Zeppelin
Web-based notebook that enables data-driven, interactive data analytics and collaborative documents with SQL, Scala and more.
Giraph
Mirror of Apache Giraph
✭ 569
javabig-data
Scanner
Efficient video analysis at scale
Nipype
Workflows and interfaces for neuroimaging packages
Couchdb
Seamless multi-master syncing database with an intuitive HTTP/JSON API, designed for reliability
Thrill
Thrill - An EXPERIMENTAL Algorithmic Distributed Big Data Batch Processing Framework in C++
Arkime
Arkime (formerly Moloch) is an open source, large scale, full packet capturing, indexing, and database system.
Beam
Apache Beam is a unified programming model for Batch and Streaming
Onlinestats.jl
Single-pass algorithms for statistics
Magellan
Geo Spatial Data Analytics on Spark
121-180 of 369 big-data projects