All Categories → Data Processing → big-data

Top 369 big-data open source projects

VerticaPy is a Python library that exposes sci-kit like functionality to conduct data science projects on data stored in Vertica, thus taking advantage Vertica’s speed and built-in analytics and machine learning capabilities.

✭ 59

python machine-learning data-science data-visualization big-data

Attic Lens

Mirror of Apache Lens

✭ 58

java big-data lens

Ymcache

YMCache is a lightweight object caching solution for iOS and Mac OS X that is designed for highly parallel access scenarios.

✭ 58

ios macos mobile big-data caching

Docker Spark Cluster

A Spark cluster setup running on Docker containers

✭ 57

shell scala docker docker-image spark big-data hadoop

Kibble 1

Apache Kibble - a tool to collect, aggregate and visualize data about any software project

✭ 54

python visualization open-source big-data

Lifion Kinesis

A native Node.js producer and consumer library for Amazon Kinesis Data Streams

✭ 54

javascript aws client big-data amazon consumer

Macro ml

Course Website on Macroeconomic Analysis with Machine Learning and Big Data

✭ 53

machine-learning big-data

Oodt

Mirror of Apache OODT

✭ 52

java big-data

Datumbox Framework

Datumbox is an open-source Machine Learning framework written in Java which allows the rapid development of Machine Learning and Statistical applications.

✭ 1,063

java machine-learning nlp data-science statistics big-data

Trck

Query engine for TrailDB

✭ 48

c compiler big-data state-machine time-series-analysis data-analytics multicore

Traildb

TrailDB is an efficient tool for storing and querying series of events

✭ 1,029

c database big-data time-series data-analytics

Moosefs

MooseFS – Open Source, Petabyte, Fault-Tolerant, Highly Performing, Scalable Network Distributed File System (Software-Defined Storage)

✭ 1,025

c storage big-data filesystem clustering hadoop distributed-computing posix fuse scalability snapshot high-availability distributed-storage file-system

Couchdb Couch

Mirror of Apache CouchDB

✭ 43

javascript erlang cplusplus database http cloud big-data content couchdb

Attaca

Robust, distributed version control for large files.

✭ 41

rust data-science big-data distributed version-control

Egads

A Java package to automatically detect anomalies in large scale time-series data

✭ 997

java big-data time-series

Analysispreservation.cern.ch

Source code for the CERN Analysis Preservation portal

✭ 37

javascript python hacktoberfest flask big-data json-schema reproducible-research reproducible-science

Esper Tv

Esper instance for TV news analysis

✭ 37

jupyter-notebook docker video visualization big-data google-cloud

Metrics

Measure behavior of Java applications

✭ 35

java metrics big-data

Predictionio Template Text Classifier

Text Classification Engine

✭ 30

scala big-data

Skymap

High-throughput gene to knowledge mapping through massive integration of public sequencing data.

✭ 29

jupyter-notebook big-data

Qcportal

A client interface to the QCArchive Project (read-only image of QCFractal)

✭ 29

python analytics big-data quantum-chemistry

Awesome Scalability

The Patterns of Scalable, Reliable, and Performant Large-Scale Systems

Spark

Apache Spark - A unified analytics engine for large-scale data processing

✭ 31,618

python java scala r Jupyter Notebook HiveQL sql spark big-data jdbc

K8s Ingress Claim

An admission control policy that safeguards against accidental duplicate claiming of Hosts/Domains.

✭ 14

go golang kubernetes big-data ingress

Phoenix

Mirror of Apache Phoenix

✭ 867

java database sql big-data phoenix

Dremio Oss

Dremio - the missing link in modern data

✭ 862

java ui analytics big-data data-analytics

Sparkjni

A heterogeneous Apache Spark framework.

✭ 11

java spark big-data

Accumulo

Apache Accumulo

✭ 857

java hacktoberfest big-data

Hazelcast Jet

Distributed Stream and Batch Processing

✭ 855

java hacktoberfest kafka big-data stream-processing low-latency batch-processing

Dataflowjavasdk

Google Cloud Dataflow provides a simple, powerful model for building both batch and streaming parallel data processing pipelines.

✭ 854

data-science data-analysis big-data data-mining data-processing

Autodl

Automated Deep Learning without ANY human intervention. 1'st Solution for AutoDL [email protected]

✭ 854

python machine-learning pytorch tensorflow data-science artificial-intelligence ai deeplearning big-data resnet automl feature-engineering nas lightgbm automated-machine-learning

Pretzel

Javascript full-stack framework for Big Data visualisation and analysis

✭ 26

javascript data-science open-source express data-visualization bioinformatics big-data expressjs ember emberjs

Pyspark Setup Demo

Demo of PySpark and Jupyter Notebook with the Jupyter Docker Stacks

✭ 24

python jupyter-notebook docker jupyter big-data pyspark

Bandar Log

Monitoring tool to measure flow throughput of data sources and processing components that are part of Data Ingestion and ETL pipelines.

✭ 19

scala monitoring kafka big-data etl spark-streaming presto

Hadoop For Geoevent

ArcGIS GeoEvent Server sample Hadoop connector for storing GeoEvents in HDFS.

✭ 5

java server big-data hadoop bigdata hdfs transport arcgis connector

Sqoop

Mirror of Apache Sqoop

✭ 817

java big-data

Parquet Format

Apache Parquet

✭ 800

java big-data parquet

Titanoboa

Titanoboa makes complex workflows easy. It is a low-code workflow orchestration platform for JVM - distributed, highly scalable and fault tolerant.

✭ 787

clojure workflow distributed-systems big-data distributed jvm workflow-engine service-bus

Rakam Api

📈 Collect customer event data from your apps. (Note that this project only includes the API collector, not the visualization platform)

✭ 772

java analytics big-data

Storm

Mirror of Apache Storm

✭ 6,297

java python HTML clojure c javascript big-data storm

Spark Movie Lens

An on-line movie recommender using Spark, Python Flask, and the MovieLens dataset

✭ 745

python jupyter-notebook flask spark big-data bigdata

Cython

The most widely used Python to C compiler

✭ 6,588

python c cython C++emacs lisp shell performance big-data cpython-extensions

Sciblog support

Support content for my blog

✭ 694

jupyter-notebook deep-learning machine-learning data-science artificial-intelligence neural-networks analytics big-data examples

Samza

Mirror of Apache Samza

✭ 676

java scala big-data

Data Science Career

Career Resources for Data Science, Machine Learning, Big Data and Business Analytics Career Repository

✭ 630

machine-learning data-science analytics big-data resources business-intelligence

Sdc

Intel® Scalable Dataframe Compiler for Pandas*

✭ 623

python machine-learning numpy pandas big-data parallel-computing compilers

Kafka Streams

equivalent to kafka-streams 🐙 for nodejs ✨🐢🚀✨

✭ 613

typescript nodejs node kafka big-data stream-processing streams kafka-streams

H2o 3

H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random Forest, Generalized Linear Modeling (GLM with Elastic Net), K-Means, PCA, Generalized Additive Models (GAM), RuleFit, Support Vector Machine (SVM), Stacked Ensembles, Automatic Machine Learning (AutoML), etc.

✭ 5,656