MmlsparkSimple and Distributed Machine Learning
Stars: ✭ 2,899 (-23.97%)
GafferA large-scale entity and relation database supporting aggregation of properties
Stars: ✭ 1,642 (-56.94%)
H2o 3H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random Forest, Generalized Linear Modeling (GLM with Elastic Net), K-Means, PCA, Generalized Additive Models (GAM), RuleFit, Support Vector Machine (SVM), Stacked Ensembles, Automatic Machine Learning (AutoML), etc.
Stars: ✭ 5,656 (+48.33%)
Analytics ZooDistributed Tensorflow, Keras and PyTorch on Apache Spark/Flink & Ray
Stars: ✭ 2,448 (-35.8%)
Spark With PythonFundamentals of Spark with Python (using PySpark), code examples
Stars: ✭ 150 (-96.07%)
Xlearning Xdmlextremely distributed machine learning
Stars: ✭ 113 (-97.04%)
SparkrdmaRDMA accelerated, high-performance, scalable and efficient ShuffleManager plugin for Apache Spark
Stars: ✭ 215 (-94.36%)
Data Science Ipython NotebooksData science Python notebooks: Deep learning (TensorFlow, Theano, Caffe, Keras), scikit-learn, Kaggle, big data (Spark, Hadoop MapReduce, HDFS), matplotlib, pandas, NumPy, SciPy, Python essentials, AWS, and various command lines.
Stars: ✭ 22,048 (+478.23%)
Hadoop cookbookCookbook to install Hadoop 2.0+ using Chef
Stars: ✭ 82 (-97.85%)
Docker Spark🚢 Docker image for Apache Spark
Stars: ✭ 78 (-97.95%)
Repository个人学习知识库涉及到数据仓库建模、实时计算、大数据、Java、算法等。
Stars: ✭ 92 (-97.59%)
Ytk LearnYtk-learn is a distributed machine learning library which implements most of popular machine learning algorithms(GBDT, GBRT, Mixture Logistic Regression, Gradient Boosting Soft Tree, Factorization Machines, Field-aware Factorization Machines, Logistic Regression, Softmax).
Stars: ✭ 337 (-91.16%)
HiveApache Hive
Stars: ✭ 4,031 (+5.72%)
SparklerSpark-Crawler: Apache Nutch-like crawler that runs on Apache Spark.
Stars: ✭ 362 (-90.51%)
HadoopcryptoledgerHadoop Crypto Ledger - Analyzing CryptoLedgers, such as Bitcoin Blockchain, on Big Data platforms, such as Hadoop/Spark/Flink/Hive
Stars: ✭ 126 (-96.7%)
FeastFeature Store for Machine Learning
Stars: ✭ 2,576 (-32.44%)
Airflow PipelineAn Airflow docker image preconfigured to work well with Spark and Hadoop/EMR
Stars: ✭ 128 (-96.64%)
Sparkling GraphSparklingGraph provides easy to use set of features that will give you ability to proces large scala graphs using Spark and GraphX.
Stars: ✭ 139 (-96.35%)
Spark.jlJulia binding for Apache Spark
Stars: ✭ 153 (-95.99%)
WedatasphereWeDataSphere is a financial level one-stop open-source suitcase for big data platforms. Currently the source code of Scriptis and Linkis has already been released to the open-source community. WeDataSphere, Big Data Made Easy!
Stars: ✭ 372 (-90.24%)
DataspherestudioDataSphereStudio is a one stop data application development& management portal, covering scenarios including data exchange, desensitization/cleansing, analysis/mining, quality measurement, visualization, and task scheduling.
Stars: ✭ 1,195 (-68.66%)
Apache Spark Hands OnEducational notes,Hands on problems w/ solutions for hadoop ecosystem
Stars: ✭ 74 (-98.06%)
SetlA simple Spark-powered ETL framework that just works 🍺
Stars: ✭ 79 (-97.93%)
LabsResearch on distributed system
Stars: ✭ 73 (-98.09%)
LogislandScalable stream processing platform for advanced realtime analytics on top of Kafka and Spark. LogIsland also supports MQTT and Kafka Streams (Flink being in the roadmap). The platform does complex event processing and is suitable for time series analysis. A large set of valuable ready to use processors, data sources and sinks are available.
Stars: ✭ 97 (-97.46%)
Spark Py NotebooksApache Spark & Python (pySpark) tutorials for Big Data Analysis and Machine Learning as IPython / Jupyter notebooks
Stars: ✭ 1,338 (-64.91%)
GimelBig Data Processing Framework - Unified Data API or SQL on Any Storage
Stars: ✭ 216 (-94.34%)
VespaThe open big data serving engine. https://vespa.ai
Stars: ✭ 3,747 (-1.73%)
OzoneScalable, redundant, and distributed object store for Apache Hadoop
Stars: ✭ 330 (-91.35%)
WaterdropProduction Ready Data Integration Product, documentation:
Stars: ✭ 1,856 (-51.32%)
BigdataclassTwo-day workshop that covers how to use R to interact databases and Spark
Stars: ✭ 110 (-97.12%)
Devops RoadmapDevOps methodology & roadmap for a devops developer in 2019. Interesting books to learn new technologies.
Stars: ✭ 349 (-90.85%)
GeniA Clojure dataframe library that runs on Spark
Stars: ✭ 152 (-96.01%)
RsparklingRSparkling: Use H2O Sparkling Water from R (Spark + R + Machine Learning)
Stars: ✭ 65 (-98.3%)
Deeplearning4jSuite of tools for deploying and training deep learning models using the JVM. Highlights include model import for keras, tensorflow, and onnx/pytorch, a modular and tiny c++ library for running math code and a java based math library on top of the core c++ library. Also includes samediff: a pytorch/tensorflow like library for running deep learni…
Stars: ✭ 12,277 (+221.98%)
Data AcceleratorData Accelerator for Apache Spark simplifies onboarding to Streaming of Big Data. It offers a rich, easy to use experience to help with creation, editing and management of Spark jobs on Azure HDInsights or Databricks while enabling the full power of the Spark engine.
Stars: ✭ 247 (-93.52%)
XlearningAI on Hadoop
Stars: ✭ 1,709 (-55.18%)
DatasciencevmTools and Docs on the Azure Data Science Virtual Machine (http://aka.ms/dsvm)
Stars: ✭ 153 (-95.99%)
HyperspaceAn open source indexing subsystem that brings index-based query acceleration to Apache Spark™ and big data workloads.
Stars: ✭ 246 (-93.55%)
GeopysparkGeoTrellis for PySpark
Stars: ✭ 167 (-95.62%)
AutodlAutomated Deep Learning without ANY human intervention. 1'st Solution for AutoDL [email protected]
Stars: ✭ 854 (-77.6%)
KoalasKoalas: pandas API on Apache Spark
Stars: ✭ 3,044 (-20.17%)
Big WhaleSpark、Flink等离线任务的调度以及实时任务的监控
Stars: ✭ 163 (-95.73%)
TrinoOfficial repository of Trino, the distributed SQL query engine for big data, formerly known as PrestoSQL (https://trino.io)
Stars: ✭ 4,581 (+20.14%)
SuccinctEnabling queries on compressed data.
Stars: ✭ 257 (-93.26%)
Oie ResourcesA curated list of Open Information Extraction (OIE) resources: papers, code, data, etc.
Stars: ✭ 283 (-92.58%)
CloudbreakA tool for provisioning and managing Apache Hadoop clusters in the cloud. Cloudbreak, as part of the Hortonworks Data Platform, makes it easy to provision, configure and elastically grow HDP clusters on cloud infrastructure. Cloudbreak can be used to provision Hadoop across cloud infrastructure providers including AWS, Azure, GCP and OpenStack.
Stars: ✭ 301 (-92.11%)
ElasticlusterCreate clusters of VMs on the cloud and configure them with Ansible.
Stars: ✭ 298 (-92.18%)
SplineData Lineage Tracking And Visualization Solution
Stars: ✭ 306 (-91.97%)
TezApache Tez
Stars: ✭ 313 (-91.79%)
WaimakWaimak is an open-source framework that makes it easier to create complex data flows in Apache Spark.
Stars: ✭ 60 (-98.43%)