Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

Created with love in Canada, visit hostnodejs.com today

Feel like to post an Ad? Learn Details

All Projects → blebreton → Spark Fm Parallelsgd

blebreton / Spark Fm Parallelsgd

Licence: apache-2.0

Implementation of Factorization Machines on Spark using parallel stochastic gradient descent (python and scala)

Labels

jupyter-notebook

Projects that are alternatives of or similar to Spark Fm Parallelsgd

Arima Lstm Hybrid Corrcoef Predict

Applied an ARIMA-LSTM hybrid model to predict future price correlation coefficients of two assets

Stars: ✭ 218 (-0.91%)

Mutual labels: jupyter-notebook

The WeightWatcher tool for predicting the accuracy of Deep Neural Networks

Stars: ✭ 213 (-3.18%)

Mutual labels: jupyter-notebook

Visualisation tool for CNNs in pytorch

Stars: ✭ 219 (-0.45%)

Mutual labels: jupyter-notebook

Ml Tutorial Experiment

Coding the Machine Learning Tutorial for Learning to Learn

Stars: ✭ 2,489 (+1031.36%)

Mutual labels: jupyter-notebook

Gwu data mining

Materials for GWU DNSC 6279 and DNSC 6290.

Stars: ✭ 217 (-1.36%)

Mutual labels: jupyter-notebook

Pytorch Deep Learning Template

A Pytorch Computer Vision template to quick start your next project! 🚀🚀

Stars: ✭ 220 (+0%)

Mutual labels: jupyter-notebook

Temporal Causal Discovery Framework (PyTorch): discovering causal relationships between time series

Stars: ✭ 217 (-1.36%)

Mutual labels: jupyter-notebook

Oxford Deep NLP 2017 course - Practical 1: word2vec

Stars: ✭ 220 (+0%)

Mutual labels: jupyter-notebook

Learn Python, Easy to learn, Awesome

Stars: ✭ 219 (-0.45%)

Mutual labels: jupyter-notebook

Amazing Feature Engineering

Feature engineering is the process of using domain knowledge to extract features from raw data via data mining techniques. These features can be used to improve the performance of machine learning algorithms. Feature engineering can be considered as applied machine learning itself.

Stars: ✭ 218 (-0.91%)

Mutual labels: jupyter-notebook

Pixel level land classification

Tutorial demonstrating how to create a semantic segmentation (pixel-level classification) model to predict land cover from aerial imagery. This model can be used to identify newly developed or flooded land. Uses ground-truth labels and processed NAIP imagery provided by the Chesapeake Conservancy.

Stars: ✭ 217 (-1.36%)

Mutual labels: jupyter-notebook

Research Paper Notes

Notes and Summaries on ML-related Research Papers (with optional implementations)

Stars: ✭ 218 (-0.91%)

Mutual labels: jupyter-notebook

Hacktoberfest2020

A repo for new open source contributors to begin with open source contribution. Contribute and earn awesome swags.

Stars: ✭ 221 (+0.45%)

Mutual labels: jupyter-notebook

A day to day plan for this challenge (50 Days of Machine Learning) . Covers both theoretical and practical aspects

Stars: ✭ 218 (-0.91%)

Mutual labels: jupyter-notebook

Deform conv pytorch

PyTorch Implementation of Deformable Convolution

Stars: ✭ 217 (-1.36%)

Mutual labels: jupyter-notebook

Medical Ai Course Materials

メディカルAIコースオンライン講義資料

Stars: ✭ 218 (-0.91%)

Mutual labels: jupyter-notebook

Visualising LIDAR data from KITTI dataset.

Stars: ✭ 217 (-1.36%)

Mutual labels: jupyter-notebook

Unsupervised clustering with (Gaussian mixture) VAEs

Stars: ✭ 220 (+0%)

Mutual labels: jupyter-notebook

Stock Prediction

Stock price prediction with recurrent neural network. The data is from the Chinese stock.

Stars: ✭ 219 (-0.45%)

Mutual labels: jupyter-notebook

edaviz - Python library for Exploratory Data Analysis and Visualization in Jupyter Notebook or Jupyter Lab

Stars: ✭ 220 (+0%)

Mutual labels: jupyter-notebook

View All Similar Projects ➔

FM on Spark with parallel SGD

Implementation of Factorization Machines on Spark using parallel stochastic gradient descent (python and scala)

Factorization Machines is a smart general predictor introduced by Rendle in 2010, which can capture all single and pairwise interactions in a dataset. It can be applied to any real valued feature vector and also performs well on highly sparse data. An extension on FMs, namely Field Factorization Machines, proved to be a successful method in predicting advertisement clicks in the Display Advertising Challenge on Kaggle.

I built a custom Spark implementation to use in Python and Scala. To make optimum use of parallel computing in Spark, I implemented Parallel Stochastic Gradient Descent to train the FMs. This forms an alternative to Mini-batch SGD, which is currently available in MLLib to train Logistic Regression models.

This implementation shows impressive results in terms of speed and effectivness.

I worked on this project during my summer internship at ING Netherlands in 2015. ING has strong teams of data scientists and I thank them for their help during this project. I could also use a powerful cluster to test my code and train my models.

Tutorial

Here's a short tutorial on how to use them in pyspark. (Note: the procedure is quite the same in scala, see below) You may prefer try it directly using the ipython notebook tutorial FMonSpark_demo_a9a.ipynb. You will need to download the a9a dataset first.

Pyspark

import the script fm_parallel_sgd.py. You can do this by adding the following lines to your code:

sc.addPyFile("spark-FM-parallelSGD/fm/fm_parallel_sgd.py")

import fm_parallel_sgd as fm

or by running the codes directly when starting spark:

pyspark –py-files spark-FM-parallelSGD/fm/fm_parallel_sgd.py
Preprocess your data such that you

a) Divide it into test and train

b) The data is an RDD with labeled points

Labels should be -1 or 1. If your data has 0/1 labels, transform them with the function fm.transform_data(data)
Features should be either SparseVector or DenseVector from mllib.linalg library.

If you think it makes sense, take a (stratified) sample of you data using RDD.sample(). this is not done as part of the fm procedure
How many partitions is your data? The performance of the parallel sgd is best with as few partitions as possible. Coalesce your data into 1 or 2 partitions per excecutor by using coalesce(nrPartitions) or repartition(nrPartitions) on your RDD
Call the function fm.trainFM_parallel_sgd(sc, train, params...). There are the following parameters that you can specify:

iterations : Nr of iterations of parallel SGD. default=50
iter_sgd : Nr of iterations of SGD in each partition. default=5 (between 1 and 10 is better)
alpha : Learning rate of SGD. default=0.01
regParam : Regularization parameter. default=0.01
factorLength : Length of the weight vectors of the FMs. default=4
verbose: Whether to ouptut evaluations on train and validation set after each iteration. (the code split your dataset into train(80%) + validation(20%) sets)
savingFilename: Whether to save the model after each iteration. The file is saved in a pickle file into your current folder.
evalTraining : Useful to plot the evolution of the evaluation during the training.
- Instance of the class evaluation. Create the instance before using fm.trainFM_parallel_sgd !
- You can set a modulo to evaluate the model after each #modulo iterations with ‘instance’.modulo

This returns a weight matrix w. if you want to store this for future use, you can use the function fm.saveModel(w, "path/to/store/model")
To evaluate the perfomance of the model on the test set, call fm.evaluate(train, w). This returns the area under the Precision/recall curve, the AUC of ROC, the average logloss, the MSE and the accuracy.
To calculate the probabilities according to the model for a test set, call fm.predictFM(data, w). This returns an RDD with probability scores.
To load a model that you saved, you can use the function fm.loadModel("path/to/store/model”)

Plot :

You can plot the error (rtv_pr_auc, rtv_auc, logl, MSE) function of different learning rates by using fm.plotAlpha(sc, data, alpha_list, params…). 'alpha_list’ is a list of the learnings rates you want to test. The training is on 80% of the data, the evaluation is on the remaining 20%.
You can do the same for the regularization parameter and the factor length with fm.plotRegParam(sc, data, regParam_list, params…) and fm.plotFactorLength(sc, data, factorLength_list, params…)
You can plot a color map of the logloss function for learning rate/regParam combinations using fm.plotAlpha_regParam(sc, data, alpha_list, regParam_list, params…). The brighter square is the lower logloss. The training is on 80% of the data, the evaluation is on the remaining 20%.

Scala

Load the file fm_parallel_sgd.scala. You can do this by adding the following lines to your code:

:load spark-FM-parallelSGD/fm/fm_parallel_sgd.scala

or by running the code directly when starting spark

spark-shell –i spark-FM-parallelSGD/fm/fm_parallel_sgd.scala
Preprocess your data such that you

a) Divide it into test and train

b) The data is an RDD with labeled points

Labels should be -1 or 1.
Features should be Vector from mllib.linalg.

If you think it makes sense, take a (stratified) sample of you data using RDD.sample(). this is not done as part of the fm procedure
How many partitions is your data? The performance of the parallel sgd is best with as few partitions as possible. Coalesce your data into 1 or 2 partitions per excecutor by using coalesce(nrPartitions) or repartition(nrPartitions) on your RDD
Call the function fm.trainFM_parallel_sgd(train, params...). There are the following parameters that you can specify:

iterations : Nr of iterations of parallel SGD. default=50
iter_sgd : Nr of iterations of SGD in each partition. default=5 (between 1 and 10 is better)
alpha : Learning rate of SGD. default=0.01
regParam : Regularization parameter. default=0.01
factorLength : Length of the weight vectors of the FMs. default=4
verbose: Whether to ouptut evaluations on train and validation set after each iteration. (the code split your dataset into train(80%) + validation(20%) sets)

This returns a weight matrix w.
To evaluate the perfomance of the model on the test set, call fm.evaluate(train, w). This returns the average logloss.
To calculate the probabilities according to the model for a test set, call fm.predictFM(data, w). This returns an RDD with probability scores.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Stars: ✭ 220

Visit Git Page 🔗Visit User Page 🔗Visit Issues Page (3) 🔗