Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

Created with love in Canada, visit hostnodejs.com today

Feel like to post an Ad? Learn Details

All Projects → combust → Mleap

combust / Mleap

Licence: apache-2.0

MLeap: Deploy ML Pipelines to Production

Programming Languages

139335 projects - #7 most used programming language

5932 projects

Labels

tensorflow spark scikit-learn

Projects that are alternatives of or similar to Mleap

MLOps Platform

Stars: ✭ 213 (-82.71%)

Mutual labels: spark, scikit-learn

Zeek Analysis Tools (ZAT): Processing and analysis of Zeek network data with Pandas, scikit-learn, Kafka and Spark

Stars: ✭ 303 (-75.41%)

Mutual labels: spark, scikit-learn

Framework of vectorized and distributed data analytics

Stars: ✭ 59 (-95.21%)

Mutual labels: spark, scikit-learn

Data Science Ipython Notebooks

Data science Python notebooks: Deep learning (TensorFlow, Theano, Caffe, Keras), scikit-learn, Kaggle, big data (Spark, Hadoop MapReduce, HDFS), matplotlib, pandas, NumPy, SciPy, Python essentials, AWS, and various command lines.

Stars: ✭ 22,048 (+1689.61%)

Mutual labels: spark, scikit-learn

Distributed scikit-learn meta-estimators in PySpark

Stars: ✭ 260 (-78.9%)

Mutual labels: spark, scikit-learn

Data Science Cookbook

🎓 Jupyter notebooks from UFC data science course

Stars: ✭ 60 (-95.13%)

Mutual labels: spark, scikit-learn

Spark Twitter Stream Example

"Sentiment analysis" on a live Twitter feed with Apache Spark and Apache Bahir

Stars: ✭ 73 (-94.07%)

Mutual labels: spark

Docker Alpine Python Machinelearning

Small Docker image with Python Machine Learning tools (~180MB) https://hub.docker.com/r/frolvlad/alpine-python-machinelearning/

Stars: ✭ 76 (-93.83%)

Mutual labels: scikit-learn

Research on distributed system

Stars: ✭ 73 (-94.07%)

Mutual labels: spark

Luigi Warehouse

A luigi powered analytics / warehouse stack

Stars: ✭ 72 (-94.16%)

Mutual labels: spark

Hybrid model of Gradient Boosting Trees and Logistic Regression (GBDT+LR) on Spark

Stars: ✭ 81 (-93.43%)

Mutual labels: spark

Scikit Learn Tips

🤖⚡️ scikit-learn tips

Stars: ✭ 1,203 (-2.35%)

Mutual labels: scikit-learn

Apache Spark Website

Stars: ✭ 75 (-93.91%)

Mutual labels: spark

Optimize and improve the Label propagation algorithm

Stars: ✭ 75 (-93.91%)

Mutual labels: spark

ApacheCN 开源组织：公告、介绍、成员、活动、交流方式

Stars: ✭ 1,199 (-2.68%)

Mutual labels: spark

Islr With Python

Introduction to Statistical Learning with R을 Python으로

Stars: ✭ 73 (-94.07%)

Mutual labels: scikit-learn

🚢 Docker image for Apache Spark

Stars: ✭ 78 (-93.67%)

Mutual labels: spark

Next generation tool for decentralized exchange and transformation of semi-structured data

Stars: ✭ 69 (-94.4%)

Mutual labels: spark

type-class based data cleansing library for Apache Spark SQL

Stars: ✭ 75 (-93.91%)

Mutual labels: spark

50% faster, 50% less RAM Machine Learning. Numba rewritten Sklearn. SVD, NNMF, PCA, LinearReg, RidgeReg, Randomized, Truncated SVD/PCA, CSR Matrices all 50+% faster

Stars: ✭ 1,204 (-2.27%)

Mutual labels: scikit-learn

View All Similar Projects ➔

Deploying machine learning data pipelines and algorithms should not be a time-consuming or difficult task. MLeap allows data scientists and engineers to deploy machine learning pipelines from Spark and Scikit-learn to a portable format and execution engine.

Documentation

Documentation is available at mleap-docs.combust.ml.

Read Serializing a Spark ML Pipeline and Scoring with MLeap to gain a full sense of what is possible.

Introduction

Using the MLeap execution engine and serialization format, we provide a performant, portable and easy-to-integrate production library for machine learning data pipelines and algorithms.

For portability, we build our software on the JVM and only use serialization formats that are widely-adopted.

We also provide a high level of integration with existing technologies.

Our goals for this project are:

Allow Researchers/Data Scientists and Engineers to continue to build data pipelines and train algorithms with Spark and Scikit-Learn
Extend Spark/Scikit/TensorFlow by providing ML Pipelines serialization/deserialization to/from a common framework (Bundle.ML)
Use MLeap Runtime to execute your pipeline and algorithm without dependenices on Spark or Scikit (numpy, pandas, etc)

Overview

Core execution engine implemented in Scala
Spark, PySpark and Scikit-Learn support
Export a model with Scikit-learn or Spark and execute it using the MLeap Runtime (without dependencies on the Spark Context, or sklearn/numpy/pandas/etc)
Choose from 2 portable serialization formats (JSON, Protobuf)
Implement your own custom data types and transformers for use with MLeap data frames and transformer pipelines
Extensive test coverage with full parity tests for Spark and MLeap pipelines
Optional Spark transformer extension to extend Spark's default transformer offerings

Unified Runtime

Requirements

MLeap is built against Scala 2.11 and Java 8. Because we depend heavily on Typesafe config for MLeap, we only support Java 8 at the moment.

MLeap/Spark Version

Choose the right version of the mleap-spark module to export your pipeline. The serialization format is backwards compatible between different versions of MLeap. So if you export a pipeline using MLeap 0.11.0 and Spark 2.1, you can still load that pipeline using MLeap runtime version 0.12.0.

MLeap Version	Spark Version
0.16.0	2.4.5
0.15.0	2.4
0.14.0	2.4
0.13.0	2.3
0.12.0	2.3
0.11.0	2.2
0.11.0	2.1
0.11.0	2.0
0.10.3	2.2
0.10.3	2.1
0.10.3	2.0

Please see the release notes for changes (especially breaking changes) included with each release.

Setup

Link with Maven or SBT

SBT

libraryDependencies += "ml.combust.mleap" %% "mleap-runtime" % "0.16.0"

Maven

<dependency>
    <groupId>ml.combust.mleap</groupId>
    <artifactId>mleap-runtime_2.11</artifactId>
    <version>0.16.0</version>
</dependency>

For Spark Integration

SBT

libraryDependencies += "ml.combust.mleap" %% "mleap-spark" % "0.16.0"

Maven

<dependency>
    <groupId>ml.combust.mleap</groupId>
    <artifactId>mleap-spark_2.11</artifactId>
    <version>0.16.0</version>
</dependency>

Spark Packages

$ bin/spark-shell --packages ml.combust.mleap:mleap-spark_2.11:0.16.0

PySpark Integration

Install MLeap from PyPI

$ pip install mleap

Using the Library

For more complete examples, see our other Git repository: MLeap Demos

Create and Export a Spark Pipeline

The first step is to create our pipeline in Spark. For our example we will manually build a simple Spark ML pipeline.

import ml.combust.bundle.BundleFile
import ml.combust.mleap.spark.SparkSupport._
import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.bundle.SparkBundleContext
import org.apache.spark.ml.feature.{Binarizer, StringIndexer}
import org.apache.spark.sql._
import org.apache.spark.sql.functions._
import resource._

  val datasetName = "./examples/spark-demo.csv"

  val dataframe: DataFrame = spark.sqlContext.read.format("csv")
    .option("header", true)
    .load(datasetName)
    .withColumn("test_double", col("test_double").cast("double"))

  // User out-of-the-box Spark transformers like you normally would
  val stringIndexer = new StringIndexer().
    setInputCol("test_string").
    setOutputCol("test_index")

  val binarizer = new Binarizer().
    setThreshold(0.5).
    setInputCol("test_double").
    setOutputCol("test_bin")

  val pipelineEstimator = new Pipeline()
    .setStages(Array(stringIndexer, binarizer))

  val pipeline = pipelineEstimator.fit(dataframe)

  // then serialize pipeline
  val sbc = SparkBundleContext().withDataset(pipeline.transform(dataframe))
  for(bf <- managed(BundleFile("jar:file:/tmp/simple-spark-pipeline.zip"))) {
    pipeline.writeBundle.save(bf)(sbc).get
  }

The dataset used for training can be found here

Spark pipelines are not meant to be run outside of Spark. They require a DataFrame and therefore a SparkContext to run. These are expensive data structures and libraries to include in a project. With MLeap, there is no dependency on Spark to execute a pipeline. MLeap dependencies are lightweight and we use fast data structures to execute your ML pipelines.

PySpark Integration

Import the MLeap library in your PySpark job

import mleap.pyspark
from mleap.pyspark.spark_support import SimpleSparkSerializer

See PySpark Integration of python/README.md for more.

Create and Export a Scikit-Learn Pipeline

import pandas as pd

from mleap.sklearn.pipeline import Pipeline
from mleap.sklearn.preprocessing.data import FeatureExtractor, LabelEncoder, ReshapeArrayToN1
from sklearn.preprocessing import OneHotEncoder

data = pd.DataFrame(['a', 'b', 'c'], columns=['col_a'])

categorical_features = ['col_a']

feature_extractor_tf = FeatureExtractor(input_scalars=categorical_features, 
                                         output_vector='imputed_features', 
                                         output_vector_items=categorical_features)

# Label Encoder for x1 Label 
label_encoder_tf = LabelEncoder(input_features=feature_extractor_tf.output_vector_items,
                               output_features='{}_label_le'.format(categorical_features[0]))

# Reshape the output of the LabelEncoder to N-by-1 array
reshape_le_tf = ReshapeArrayToN1()

# Vector Assembler for x1 One Hot Encoder
one_hot_encoder_tf = OneHotEncoder(sparse=False)
one_hot_encoder_tf.mlinit(prior_tf = label_encoder_tf, 
                          output_features = '{}_label_one_hot_encoded'.format(categorical_features[0]))

one_hot_encoder_pipeline_x0 = Pipeline([
                                         (feature_extractor_tf.name, feature_extractor_tf),
                                         (label_encoder_tf.name, label_encoder_tf),
                                         (reshape_le_tf.name, reshape_le_tf),
                                         (one_hot_encoder_tf.name, one_hot_encoder_tf)
                                        ])

one_hot_encoder_pipeline_x0.mlinit()
one_hot_encoder_pipeline_x0.fit_transform(data)
one_hot_encoder_pipeline_x0.serialize_to_bundle('/tmp', 'mleap-scikit-test-pipeline', init=True)

# array([[ 1.,  0.,  0.],
#        [ 0.,  1.,  0.],
#        [ 0.,  0.,  1.]])

Load and Transform Using MLeap

Because we export Spark and Scikit-learn pipelines to a standard format, we can use either our Spark-trained pipeline or our Scikit-learn pipeline from the previous steps to demonstrate usage of MLeap in this section. The choice is yours!

import ml.combust.bundle.BundleFile
import ml.combust.mleap.runtime.MleapSupport._
import resource._
// load the Spark pipeline we saved in the previous section
val bundle = (for(bundleFile <- managed(BundleFile("jar:file:/tmp/simple-spark-pipeline.zip"))) yield {
  bundleFile.loadMleapBundle().get
}).opt.get

// create a simple LeapFrame to transform
import ml.combust.mleap.runtime.frame.{DefaultLeapFrame, Row}
import ml.combust.mleap.core.types._

// MLeap makes extensive use of monadic types like Try
val schema = StructType(StructField("test_string", ScalarType.String),
  StructField("test_double", ScalarType.Double)).get
val data = Seq(Row("hello", 0.6), Row("MLeap", 0.2))
val frame = DefaultLeapFrame(schema, data)

// transform the dataframe using our pipeline
val mleapPipeline = bundle.root
val frame2 = mleapPipeline.transform(frame).get
val data2 = frame2.dataset

// get data from the transformed rows and make some assertions
assert(data2(0).getDouble(2) == 1.0) // string indexer output
assert(data2(0).getDouble(3) == 1.0) // binarizer output

// the second row
assert(data2(1).getDouble(2) == 2.0)
assert(data2(1).getDouble(3) == 0.0)

Documentation

For more documentation, please see our documentation, where you can learn to:

Implement custom transformers that will work with Spark, MLeap and Scikit-learn
Implement custom data types to transform with Spark and MLeap pipelines
Transform with blazing fast speeds using optimized row-based transformers
Serialize MLeap data frames to various formats like avro, json, and a custom binary format
Implement new serialization formats for MLeap data frames
Work through several demonstration pipelines which use real-world data to create predictive pipelines
Supported Spark transformers
Supported Scikit-learn transformers
Custom transformers provided by MLeap

Contributing

Write documentation.
Write a tutorial/walkthrough for an interesting ML problem
Contribute an Estimator/Transformer from Spark
Use MLeap at your company and tell us what you think
Make a feature request or report a bug in github
Make a pull request for an existing feature request or bug report
Join the discussion of how to get MLeap into Spark as a dependency. Talk with us on Gitter (see link at top of README.md)

Thank You

Thank you to Swoop for supporting the XGboost integration.

Contact Information

Hollin Wilkins ([email protected])
Mikhail Semeniuk ([email protected])
Anca Sarb ([email protected])
Talal Riaz ([email protected])
Ryan Vogan ([email protected])

License

See LICENSE and NOTICE file in this repository.

Copyright 2016 Combust, Inc.

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Stars: ✭ 1,232

Visit Git Page 🔗Visit User Page 🔗Visit Issues Page (91) 🔗