H2o 3H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random Forest, Generalized Linear Modeling (GLM with Elastic Net), K-Means, PCA, Generalized Additive Models (GAM), RuleFit, Support Vector Machine (SVM), Stacked Ensembles, Automatic Machine Learning (AutoML), etc.
Stars: ✭ 5,656 (+8601.54%)
KoalasKoalas: pandas API on Apache Spark
Stars: ✭ 3,044 (+4583.08%)
Benchm MlA minimal benchmark for scalability, speed and accuracy of commonly used open source implementations (R packages, Python scikit-learn, H2O, xgboost, Spark MLlib etc.) of the top machine learning algorithms for binary classification (random forests, gradient boosted trees, deep neural networks etc.).
Stars: ✭ 1,835 (+2723.08%)
Data Science Ipython NotebooksData science Python notebooks: Deep learning (TensorFlow, Theano, Caffe, Keras), scikit-learn, Kaggle, big data (Spark, Hadoop MapReduce, HDFS), matplotlib, pandas, NumPy, SciPy, Python essentials, AWS, and various command lines.
Stars: ✭ 22,048 (+33820%)
GeniA Clojure dataframe library that runs on Spark
Stars: ✭ 152 (+133.85%)
SetlA simple Spark-powered ETL framework that just works 🍺
Stars: ✭ 79 (+21.54%)
Spark Py NotebooksApache Spark & Python (pySpark) tutorials for Big Data Analysis and Machine Learning as IPython / Jupyter notebooks
Stars: ✭ 1,338 (+1958.46%)
MydatascienceportfolioApplying Data Science and Machine Learning to Solve Real World Business Problems
Stars: ✭ 227 (+249.23%)
autThe Archives Unleashed Toolkit is an open-source toolkit for analyzing web archives.
Stars: ✭ 111 (+70.77%)
SuccinctEnabling queries on compressed data.
Stars: ✭ 257 (+295.38%)
Data Science Live BookAn open source book to learn data science, data analysis and machine learning, suitable for all ages!
Stars: ✭ 193 (+196.92%)
Gwu data miningMaterials for GWU DNSC 6279 and DNSC 6290.
Stars: ✭ 217 (+233.85%)
leaflet heatmap简单的可视化湖州通话数据 假设数据量很大,没法用浏览器直接绘制热力图,把绘制热力图这一步骤放到线下计算分析。使用Apache Spark并行计算数据之后,再使用Apache Spark绘制热力图,然后用leafletjs加载OpenStreetMap图层和热力图图层,以达到良好的交互效果。现在使用Apache Spark实现绘制,可能是Apache Spark不擅长这方面的计算或者是我没有设计好算法,并行计算的速度比不上单机计算。Apache Spark绘制热力图和计算代码在这 https://github.com/yuanzhaokang/ParallelizeHeatmap.git .
Stars: ✭ 13 (-80%)
bigdata-funA complete (distributed) BigData stack, running in containers
Stars: ✭ 14 (-78.46%)
Spark NotebookInteractive and Reactive Data Science using Scala and Spark.
Stars: ✭ 3,081 (+4640%)
DeltaAn open-source storage layer that brings scalable, ACID transactions to Apache Spark™ and big data workloads.
Stars: ✭ 3,903 (+5904.62%)
Sk DistDistributed scikit-learn meta-estimators in PySpark
Stars: ✭ 260 (+300%)
BigdlBuilding Large-Scale AI Applications for Distributed Big Data
Stars: ✭ 3,813 (+5766.15%)
SparklerSpark-Crawler: Apache Nutch-like crawler that runs on Apache Spark.
Stars: ✭ 362 (+456.92%)
Agile data code 2Code for Agile Data Science 2.0, O'Reilly 2017, Second Edition
Stars: ✭ 413 (+535.38%)
W2vWord2Vec models with Twitter data using Spark. Blog:
Stars: ✭ 64 (-1.54%)
Interpretable machine learning with pythonExamples of techniques for training interpretable ML models, explaining ML models, and debugging ML models for accuracy, discrimination, and security.
Stars: ✭ 530 (+715.38%)
NipypeWorkflows and interfaces for neuroimaging packages
Stars: ✭ 557 (+756.92%)
ZeppelinWeb-based notebook that enables data-driven, interactive data analytics and collaborative documents with SQL, Scala and more.
Stars: ✭ 5,513 (+8381.54%)
VerticapyVerticaPy is a Python library that exposes sci-kit like functionality to conduct data science projects on data stored in Vertica, thus taking advantage Vertica’s speed and built-in analytics and machine learning capabilities.
Stars: ✭ 59 (-9.23%)
Rumble⛈️ Rumble 1.11.0 "Banyan Tree"🌳 for Apache Spark | Run queries on your large-scale, messy JSON-like data (JSON, text, CSV, Parquet, ROOT, AVRO, SVM...) | No install required (just a jar to download) | Declarative Machine Learning and more
Stars: ✭ 58 (-10.77%)
Spark Movie LensAn on-line movie recommender using Spark, Python Flask, and the MovieLens dataset
Stars: ✭ 745 (+1046.15%)
Sparkling WaterSparkling Water provides H2O functionality inside Spark cluster
Stars: ✭ 887 (+1264.62%)
spark-acidACID Data Source for Apache Spark based on Hive ACID
Stars: ✭ 91 (+40%)
awesome-AI-kubernetes❄️ 🐳 Awesome tools and libs for AI, Deep Learning, Machine Learning, Computer Vision, Data Science, Data Analytics and Cognitive Computing that are baked in the oven to be Native on Kubernetes and Docker with Python, R, Scala, Java, C#, Go, Julia, C++ etc
Stars: ✭ 95 (+46.15%)
DatasciencevmTools and Docs on the Azure Data Science Virtual Machine (http://aka.ms/dsvm)
Stars: ✭ 153 (+135.38%)
Awesome H2oA curated list of research, applications and projects built using the H2O Machine Learning platform
Stars: ✭ 293 (+350.77%)
Oie ResourcesA curated list of Open Information Extraction (OIE) resources: papers, code, data, etc.
Stars: ✭ 283 (+335.38%)
TrinoOfficial repository of Trino, the distributed SQL query engine for big data, formerly known as PrestoSQL (https://trino.io)
Stars: ✭ 4,581 (+6947.69%)
Pulsar SparkWhen Apache Pulsar meets Apache Spark
Stars: ✭ 55 (-15.38%)
MetorikkuA simplified, lightweight ETL Framework based on Apache Spark
Stars: ✭ 361 (+455.38%)
Datascience Ai Machinelearning ResourcesAlex Castrounis' curated set of resources for artificial intelligence (AI), machine learning, data science, internet of things (IoT), and more.
Stars: ✭ 414 (+536.92%)
MagellanGeo Spatial Data Analytics on Spark
Stars: ✭ 507 (+680%)
CoursesQuiz & Assignment of Coursera
Stars: ✭ 454 (+598.46%)
PachydermReproducible Data Science at Scale!
Stars: ✭ 5,305 (+8061.54%)
DataflowjavasdkGoogle Cloud Dataflow provides a simple, powerful model for building both batch and streaming parallel data processing pipelines.
Stars: ✭ 854 (+1213.85%)
Pyspark Example ProjectExample project implementing best practices for PySpark ETL jobs and applications.
Stars: ✭ 633 (+873.85%)
Data Science CareerCareer Resources for Data Science, Machine Learning, Big Data and Business Analytics Career Repository
Stars: ✭ 630 (+869.23%)
Mli ResourcesH2O.ai Machine Learning Interpretability Resources
Stars: ✭ 428 (+558.46%)
SparkjniA heterogeneous Apache Spark framework.
Stars: ✭ 11 (-83.08%)
Optimus🚚 Agile Data Preparation Workflows made easy with dask, cudf, dask_cudf and pyspark
Stars: ✭ 986 (+1416.92%)
AutodlAutomated Deep Learning without ANY human intervention. 1'st Solution for AutoDL [email protected]
Stars: ✭ 854 (+1213.85%)
VdsVerteego Data Suite
Stars: ✭ 9 (-86.15%)
SparkApache Spark - A unified analytics engine for large-scale data processing
Stars: ✭ 31,618 (+48543.08%)
PretzelJavascript full-stack framework for Big Data visualisation and analysis
Stars: ✭ 26 (-60%)
AttacaRobust, distributed version control for large files.
Stars: ✭ 41 (-36.92%)
AcceleratorThe Accelerator is a tool for fast and reproducible processing of large amounts of data.
Stars: ✭ 137 (+110.77%)
DatacompyPandas and Spark DataFrame comparison for humans
Stars: ✭ 147 (+126.15%)
Tiledb VcfEfficient variant-call data storage and retrieval library using the TileDB storage library.
Stars: ✭ 26 (-60%)