Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

Created with love in Canada, visit hostnodejs.com today

Feel like to post an Ad? Learn Details

All Projects → databricks → Spark Sklearn

databricks / Spark Sklearn

Licence: apache-2.0

(Deprecated) Scikit-learn integration package for Apache Spark

Programming Languages

python

139335 projects - #7 most used programming language

Labels

machine-learning scikit-learn apache-spark

Projects that are alternatives of or similar to Spark Sklearn

Sparkit Learn

PySpark + Scikit-learn = Sparkit-learn

Stars: ✭ 1,073 (+1.71%)

Mutual labels: apache-spark, scikit-learn

Cheatsheets.pdf

📚 Various cheatsheets in PDF

Stars: ✭ 159 (-84.93%)

Mutual labels: apache-spark, scikit-learn

Openscoring

REST web service for the true real-time scoring (<1 ms) of Scikit-Learn, R and Apache Spark models

Stars: ✭ 536 (-49.19%)

Mutual labels: apache-spark, scikit-learn

Prediciting Binary Options

Predicting forex binary options using time series data and machine learning

Stars: ✭ 33 (-96.87%)

Mutual labels: scikit-learn

Mlcourse.ai

Open Machine Learning Course

Stars: ✭ 7,963 (+654.79%)

Mutual labels: scikit-learn

Cvpr paper search tool

Automatic paper clustering and search tool by fastext from Facebook Research

Stars: ✭ 43 (-95.92%)

Mutual labels: scikit-learn

Apache Spark Internals

The Internals of Apache Spark

Stars: ✭ 1,045 (-0.95%)

Mutual labels: apache-spark

Machine Learning Alpine

Alpine Container for Machine Learning

Stars: ✭ 30 (-97.16%)

Mutual labels: scikit-learn

Data Science Complete Tutorial

For extensive instructor led learning

Stars: ✭ 1,027 (-2.65%)

Mutual labels: scikit-learn

Spark Examples

Spark examples

Stars: ✭ 41 (-96.11%)

Mutual labels: apache-spark

The Hello World Of Machine Learning

Learn to build a basic machine learning model from scratch with this repo and tutorial series.

Stars: ✭ 41 (-96.11%)

Mutual labels: scikit-learn

Machinelearningcourse

A collection of notebooks of my Machine Learning class written in python 3

Stars: ✭ 35 (-96.68%)

Mutual labels: scikit-learn

Spark Tda

SparkTDA is a package for Apache Spark providing Topological Data Analysis Functionalities.

Stars: ✭ 45 (-95.73%)

Mutual labels: apache-spark

The Deep Learning With Keras Workshop

An Interactive Approach to Understanding Deep Learning with Keras

Stars: ✭ 34 (-96.78%)

Mutual labels: scikit-learn

Iml

Курс "Введение в машинное обучение" (ВМК, МГУ имени М.В. Ломоносова)

Stars: ✭ 46 (-95.64%)

Mutual labels: scikit-learn

Mljar Supervised

Automated Machine Learning Pipeline with Feature Engineering and Hyper-Parameters Tuning 🚀

Stars: ✭ 961 (-8.91%)

Mutual labels: scikit-learn

Machine Learning

notebooks with example for machine learning examples

Stars: ✭ 45 (-95.73%)

Mutual labels: scikit-learn

Computer Vision

Computer vision sabbatical study materials

Stars: ✭ 39 (-96.3%)

Mutual labels: scikit-learn

Dblink

Distributed Bayesian Entity Resolution in Apache Spark

Stars: ✭ 38 (-96.4%)

Mutual labels: apache-spark

Sklearn Porter

Transpile trained scikit-learn estimators to C, Java, JavaScript and others.

Stars: ✭ 1,014 (-3.89%)

Mutual labels: scikit-learn

View All Similar Projects ➔

Deprecation

This project is deprecated. We now recommend using scikit-learn and Joblib Apache Spark Backend <https://github.com/joblib/joblib-spark>_ to distribute scikit-learn hyperparameter tuning tasks on a Spark cluster:

You need pyspark>=2.4.4 and scikit-learn>=0.21 to use Joblib Apache Spark Backend, which can be installed using pip:

.. code:: bash

pip install joblibspark

The following example shows how to distributed GridSearchCV on a Spark cluster using joblibspark. Same applies to RandomizedSearchCV.

.. code:: python

from sklearn import svm, datasets
from sklearn.model_selection import GridSearchCV
from joblibspark import register_spark
from sklearn.utils import parallel_backend

register_spark() # register spark backend

iris = datasets.load_iris()
parameters = {'kernel':('linear', 'rbf'), 'C':[1, 10]}
svr = svm.SVC(gamma='auto')

clf = GridSearchCV(svr, parameters, cv=5)

with parallel_backend('spark', n_jobs=3):
    clf.fit(iris.data, iris.target)

Scikit-learn integration package for Apache Spark

This package contains some tools to integrate the Spark computing framework <https://spark.apache.org/>_ with the popular scikit-learn machine library <https://scikit-learn.org/stable/>_. Among other things, it can:

train and evaluate multiple scikit-learn models in parallel. It is a distributed analog to the multicore implementation <https://pythonhosted.org/joblib/parallel.html>_ included by default in scikit-learn
convert Spark's Dataframes seamlessly into numpy ndarray or sparse matrices
(experimental) distribute Scipy's sparse matrices as a dataset of sparse vectors

It focuses on problems that have a small amount of data and that can be run in parallel. For small datasets, it distributes the search for estimator parameters (GridSearchCV in scikit-learn), using Spark. For datasets that do not fit in memory, we recommend using the distributed implementation inSpark MLlib https://spark.apache.org/docs/latest/api/python/pyspark.mllib.html`_.

This package distributes simple tasks like grid-search cross-validation. It does not distribute individual learning algorithms (unlike Spark MLlib).

Installation

This package is available on PYPI:

pip install spark-sklearn

This project is also available as Spark package <https://spark-packages.org/package/databricks/spark-sklearn>_.

The developer version has the following requirements:

scikit-learn 0.18 or 0.19. Later versions may work, but tests currently are incompatible with 0.20.
Spark >= 2.1.1. Spark may be downloaded from the Spark website <https://spark.apache.org/>. In order to use this package, you need to use the pyspark interpreter or another Spark-compliant python interpreter. See the Spark guide <https://spark.apache.org/docs/latest/programming-guide.html#overview> for more details.
nose <https://nose.readthedocs.org>_ (testing dependency only)
pandas, if using the pandas integration or testing. pandas==0.18 has been tested.

If you want to use a developer version, you just need to make sure the python/ subdirectory is in the PYTHONPATH when launching the pyspark interpreter:

PYTHONPATH=$PYTHONPATH:./python:$SPARK_HOME/bin/pyspark

You can directly run tests:

cd python && ./run-tests.sh

This requires the environment variable SPARK_HOME to point to your local copy of Spark.

Example

Here is a simple example that runs a grid search with Spark. See the Installation <#installation>_ section on how to install the package.

.. code:: python

from sklearn import svm, datasets
from spark_sklearn import GridSearchCV
iris = datasets.load_iris()
parameters = {'kernel':('linear', 'rbf'), 'C':[1, 10]}
svr = svm.SVC(gamma='auto')
clf = GridSearchCV(sc, svr, parameters)
clf.fit(iris.data, iris.target)

This classifier can be used as a drop-in replacement for any scikit-learn classifier, with the same API.

Documentation

API documentation <http://databricks.github.io/spark-sklearn-docs>_ is currently hosted on Github pages. To build the docs yourself, see the instructions in docs/.

.. image:: https://travis-ci.org/databricks/spark-sklearn.svg?branch=master :target: https://travis-ci.org/databricks/spark-sklearn

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Stars: ✭ 1,055

Visit Git Page 🔗Visit User Page 🔗Visit Issues Page (15) 🔗