H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random Forest, Generalized Linear Modeling (GLM with Elastic Net), K-Means, PCA, Generalized Additive Models (GAM), RuleFit, Support Vector Machine (SVM), Stacked Ensembles, Automatic Machine Learning (AutoML), etc.

Stars: ✭ 5,656 (+3621.05%)

Mutual labels: data-science, spark, big-data

Data Science Ipython Notebooks

Data science Python notebooks: Deep learning (TensorFlow, Theano, Caffe, Keras), scikit-learn, Kaggle, big data (Spark, Hadoop MapReduce, HDFS), matplotlib, pandas, NumPy, SciPy, Python essentials, AWS, and various command lines.

Stars: ✭ 22,048 (+14405.26%)

Mutual labels: data-science, spark, big-data

Pyspark Example Project

Example project implementing best practices for PySpark ETL jobs and applications.

Stars: ✭ 633 (+316.45%)

Mutual labels: data-science, spark, data-engineering

Verticapy

VerticaPy is a Python library that exposes sci-kit like functionality to conduct data science projects on data stored in Vertica, thus taking advantage Vertica’s speed and built-in analytics and machine learning capabilities.

Stars: ✭ 59 (-61.18%)

Mutual labels: data-science, big-data

Data Science Cookbook

🎓 Jupyter notebooks from UFC data science course

Stars: ✭ 60 (-60.53%)

Mutual labels: data-science, spark

Spark.jl

Julia binding for Apache Spark

Stars: ✭ 153 (+0.66%)

Mutual labels: spark, big-data

Datasciencevm

Tools and Docs on the Azure Data Science Virtual Machine (http://aka.ms/dsvm)

Stars: ✭ 153 (+0.66%)

Mutual labels: data-science, big-data

Spark Doc Zh

Apache Spark 官方文档中文版

Stars: ✭ 1,126 (+640.79%)

Mutual labels: spark, big-data

Ensae teaching cs

Teaching materials in python at the @ENSAE

Stars: ✭ 69 (-54.61%)

Mutual labels: data-science, distributed-computing

My Journey In The Data Science World

📢 Ready to learn or review your knowledge!

Stars: ✭ 1,175 (+673.03%)

Mutual labels: data-science, big-data

Sayn

Data processing and modelling framework for automating tasks (incl. Python & SQL transformations).

Stars: ✭ 79 (-48.03%)

Mutual labels: data-science, data-engineering

Dataengineeringproject

Example end to end data engineering project.

Stars: ✭ 82 (-46.05%)

Mutual labels: big-data, data-engineering

Applied Ml

📚 Papers & tech blogs by companies sharing their work on data science & machine learning in production.

Stars: ✭ 17,824 (+11626.32%)

Mutual labels: data-science, data-engineering

Logisland

Scalable stream processing platform for advanced realtime analytics on top of Kafka and Spark. LogIsland also supports MQTT and Kafka Streams (Flink being in the roadmap). The platform does complex event processing and is suitable for time series analysis. A large set of valuable ready to use processors, data sources and sinks are available.

Stars: ✭ 97 (-36.18%)

Mutual labels: spark, big-data

Rumble

⛈️ Rumble 1.11.0 "Banyan Tree"🌳 for Apache Spark | Run queries on your large-scale, messy JSON-like data (JSON, text, CSV, Parquet, ROOT, AVRO, SVM...) | No install required (just a jar to download) | Declarative Machine Learning and more

Stars: ✭ 58 (-61.84%)

Mutual labels: data-science, spark

Waimak

Waimak is an open-source framework that makes it easier to create complex data flows in Apache Spark.

Stars: ✭ 60 (-60.53%)

Mutual labels: spark, data-engineering

Pwrake

Parallel Workflow extension for Rake, runs on multicores, clusters, clouds.

Stars: ✭ 57 (-62.5%)

Mutual labels: parallel-computing, distributed-computing

Labs

Research on distributed system

Stars: ✭ 73 (-51.97%)

Mutual labels: spark, big-data

Danfojs

danfo.js is an open source, JavaScript library providing high performance, intuitive, and easy to use data structures for manipulating and processing structured data.

Stars: ✭ 1,304 (+757.89%)

Mutual labels: dataframe, data-science

Bigdata Notes

大数据入门指南 ⭐

Stars: ✭ 10,991 (+7130.92%)

Mutual labels: spark, big-data

Parapet

A purely functional library to build distributed and event-driven systems

Stars: ✭ 106 (-30.26%)

Mutual labels: parallel-computing, distributed-computing

W2v

Word2Vec models with Twitter data using Spark. Blog:

Stars: ✭ 64 (-57.89%)

Mutual labels: data-science, spark

Big Data Engineering Coursera Yandex

Big Data for Data Engineers Coursera Specialization from Yandex

Stars: ✭ 71 (-53.29%)

Mutual labels: spark, big-data

Docker Spark Cluster

A Spark cluster setup running on Docker containers

Stars: ✭ 57 (-62.5%)

Mutual labels: spark, big-data

Spark Website

Apache Spark Website

Stars: ✭ 75 (-50.66%)

Mutual labels: spark, big-data

Drake

An R-focused pipeline toolkit for reproducibility and high-performance computing

Stars: ✭ 1,301 (+755.92%)

Mutual labels: data-science, high-performance-computing

Cookbook

The Data Engineering Cookbook

Stars: ✭ 9,829 (+6366.45%)

Mutual labels: big-data, data-engineering

Boinc

Open-source software for volunteer computing and grid computing.

Stars: ✭ 1,320 (+768.42%)

Mutual labels: distributed-computing, high-performance-computing

Vizuka

Explore high-dimensional datasets and how your algo handles specific regions.

Stars: ✭ 100 (-34.21%)

Mutual labels: data-science, big-data

Drake Examples

Example workflows for the drake R package

Stars: ✭ 57 (-62.5%)

Mutual labels: data-science, high-performance-computing

Bigdataclass

Two-day workshop that covers how to use R to interact databases and Spark

Stars: ✭ 110 (-27.63%)

Mutual labels: spark, big-data

Spark R Notebooks

R on Apache Spark (SparkR) tutorials for Big Data analysis and Machine Learning as IPython / Jupyter notebooks

Stars: ✭ 109 (-28.29%)

Mutual labels: data-science, big-data

Elephas

Distributed Deep learning with Keras & Spark

Stars: ✭ 1,521 (+900.66%)

Mutual labels: spark, distributed-computing

Datacompy

Pandas and Spark DataFrame comparison for humans

Stars: ✭ 147 (-3.29%)

Mutual labels: data-science, spark

Pyspark Cheatsheet

🐍 Quick reference guide to common patterns & functions in PySpark.

Stars: ✭ 108 (-28.95%)

Mutual labels: data-science, spark

Python Bigdata

Data science and Big Data with Python

Stars: ✭ 112 (-26.32%)

Mutual labels: data-science, spark

Pythondata

repo for code published on pythondata.com

Stars: ✭ 113 (-25.66%)

Mutual labels: data-science, big-data

Pyhpc Benchmarks

A suite of benchmarks to test the sequential CPU and GPU performance of most popular high-performance libraries for Python.

Stars: ✭ 119 (-21.71%)

Mutual labels: parallel-computing, high-performance-computing

D6t Python

Accelerate data science

Stars: ✭ 118 (-22.37%)

Mutual labels: data-science, data-engineering

Opencoarrays

A parallel application binary interface for Fortran 2018 compilers.

Stars: ✭ 151 (-0.66%)

Mutual labels: parallel-computing, high-performance-computing

Superset

Apache Superset is a Data Visualization and Data Exploration Platform

Stars: ✭ 42,634 (+27948.68%)

Mutual labels: data-science, data-engineering

Aws Data Wrangler

Pandas on AWS - Easy integration with Athena, Glue, Redshift, Timestream, QuickSight, Chime, CloudWatchLogs, DynamoDB, EMR, SecretManager, PostgreSQL, MySQL, SQLServer and S3 (Parquet, CSV, JSON and EXCEL).

Stars: ✭ 2,385 (+1469.08%)

Mutual labels: data-science, data-engineering

Cape Python

Collaborate on privacy-preserving policy for data science projects in Pandas and Apache Spark

Stars: ✭ 125 (-17.76%)

Mutual labels: data-science, spark

Batchtools

Tools for computation on batch systems

Stars: ✭ 127 (-16.45%)

Mutual labels: parallel-computing, high-performance-computing

Griffon Vm

Griffon Data Science Virtual Machine