All Projects → blebreton → Spark Fm Parallelsgd

blebreton / Spark Fm Parallelsgd

Licence: apache-2.0
Implementation of Factorization Machines on Spark using parallel stochastic gradient descent (python and scala)

Projects that are alternatives of or similar to Spark Fm Parallelsgd

Arima Lstm Hybrid Corrcoef Predict
Applied an ARIMA-LSTM hybrid model to predict future price correlation coefficients of two assets
Stars: ✭ 218 (-0.91%)
Mutual labels:  jupyter-notebook
Weightwatcher
The WeightWatcher tool for predicting the accuracy of Deep Neural Networks
Stars: ✭ 213 (-3.18%)
Mutual labels:  jupyter-notebook
Mirror
Visualisation tool for CNNs in pytorch
Stars: ✭ 219 (-0.45%)
Mutual labels:  jupyter-notebook
Ml Tutorial Experiment
Coding the Machine Learning Tutorial for Learning to Learn
Stars: ✭ 2,489 (+1031.36%)
Mutual labels:  jupyter-notebook
Gwu data mining
Materials for GWU DNSC 6279 and DNSC 6290.
Stars: ✭ 217 (-1.36%)
Mutual labels:  jupyter-notebook
Pytorch Deep Learning Template
A Pytorch Computer Vision template to quick start your next project! 🚀🚀
Stars: ✭ 220 (+0%)
Mutual labels:  jupyter-notebook
Tcdf
Temporal Causal Discovery Framework (PyTorch): discovering causal relationships between time series
Stars: ✭ 217 (-1.36%)
Mutual labels:  jupyter-notebook
Practical 1
Oxford Deep NLP 2017 course - Practical 1: word2vec
Stars: ✭ 220 (+0%)
Mutual labels:  jupyter-notebook
Python Awesome
Learn Python, Easy to learn, Awesome
Stars: ✭ 219 (-0.45%)
Mutual labels:  jupyter-notebook
Amazing Feature Engineering
Feature engineering is the process of using domain knowledge to extract features from raw data via data mining techniques. These features can be used to improve the performance of machine learning algorithms. Feature engineering can be considered as applied machine learning itself.
Stars: ✭ 218 (-0.91%)
Mutual labels:  jupyter-notebook
Pixel level land classification
Tutorial demonstrating how to create a semantic segmentation (pixel-level classification) model to predict land cover from aerial imagery. This model can be used to identify newly developed or flooded land. Uses ground-truth labels and processed NAIP imagery provided by the Chesapeake Conservancy.
Stars: ✭ 217 (-1.36%)
Mutual labels:  jupyter-notebook
Research Paper Notes
Notes and Summaries on ML-related Research Papers (with optional implementations)
Stars: ✭ 218 (-0.91%)
Mutual labels:  jupyter-notebook
Hacktoberfest2020
A repo for new open source contributors to begin with open source contribution. Contribute and earn awesome swags.
Stars: ✭ 221 (+0.45%)
Mutual labels:  jupyter-notebook
50 Days Of Ml
A day to day plan for this challenge (50 Days of Machine Learning) . Covers both theoretical and practical aspects
Stars: ✭ 218 (-0.91%)
Mutual labels:  jupyter-notebook
Deform conv pytorch
PyTorch Implementation of Deformable Convolution
Stars: ✭ 217 (-1.36%)
Mutual labels:  jupyter-notebook
Medical Ai Course Materials
メディカルAIコース オンライン講義資料
Stars: ✭ 218 (-0.91%)
Mutual labels:  jupyter-notebook
Kitti Dataset
Visualising LIDAR data from KITTI dataset.
Stars: ✭ 217 (-1.36%)
Mutual labels:  jupyter-notebook
Vae Clustering
Unsupervised clustering with (Gaussian mixture) VAEs
Stars: ✭ 220 (+0%)
Mutual labels:  jupyter-notebook
Stock Prediction
Stock price prediction with recurrent neural network. The data is from the Chinese stock.
Stars: ✭ 219 (-0.45%)
Mutual labels:  jupyter-notebook
Edaviz
edaviz - Python library for Exploratory Data Analysis and Visualization in Jupyter Notebook or Jupyter Lab
Stars: ✭ 220 (+0%)
Mutual labels:  jupyter-notebook

FM on Spark with parallel SGD

Implementation of Factorization Machines on Spark using parallel stochastic gradient descent (python and scala)

Factorization Machines is a smart general predictor introduced by Rendle in 2010, which can capture all single and pairwise interactions in a dataset. It can be applied to any real valued feature vector and also performs well on highly sparse data. An extension on FMs, namely Field Factorization Machines, proved to be a successful method in predicting advertisement clicks in the Display Advertising Challenge on Kaggle.

I built a custom Spark implementation to use in Python and Scala. To make optimum use of parallel computing in Spark, I implemented Parallel Stochastic Gradient Descent to train the FMs. This forms an alternative to Mini-batch SGD, which is currently available in MLLib to train Logistic Regression models.

parallel-sgd

This implementation shows impressive results in terms of speed and effectivness.

I worked on this project during my summer internship at ING Netherlands in 2015. ING has strong teams of data scientists and I thank them for their help during this project. I could also use a powerful cluster to test my code and train my models.

Tutorial

Here's a short tutorial on how to use them in pyspark. (Note: the procedure is quite the same in scala, see below) You may prefer try it directly using the ipython notebook tutorial FMonSpark_demo_a9a.ipynb. You will need to download the a9a dataset first.

Pyspark

  1. import the script fm_parallel_sgd.py. You can do this by adding the following lines to your code:

    sc.addPyFile("spark-FM-parallelSGD/fm/fm_parallel_sgd.py")

    import fm_parallel_sgd as fm

    or by running the codes directly when starting spark:

    pyspark –py-files spark-FM-parallelSGD/fm/fm_parallel_sgd.py

  2. Preprocess your data such that you

a) Divide it into test and train

b) The data is an RDD with labeled points

  • Labels should be -1 or 1. If your data has 0/1 labels, transform them with the function fm.transform_data(data)
  • Features should be either SparseVector or DenseVector from mllib.linalg library.
  1. If you think it makes sense, take a (stratified) sample of you data using RDD.sample(). this is not done as part of the fm procedure
  2. How many partitions is your data? The performance of the parallel sgd is best with as few partitions as possible. Coalesce your data into 1 or 2 partitions per excecutor by using coalesce(nrPartitions) or repartition(nrPartitions) on your RDD
  3. Call the function fm.trainFM_parallel_sgd(sc, train, params...). There are the following parameters that you can specify:
  • iterations : Nr of iterations of parallel SGD. default=50
  • iter_sgd : Nr of iterations of SGD in each partition. default=5 (between 1 and 10 is better)
  • alpha : Learning rate of SGD. default=0.01
  • regParam : Regularization parameter. default=0.01
  • factorLength : Length of the weight vectors of the FMs. default=4
  • verbose: Whether to ouptut evaluations on train and validation set after each iteration. (the code split your dataset into train(80%) + validation(20%) sets)
  • savingFilename: Whether to save the model after each iteration. The file is saved in a pickle file into your current folder.
  • evalTraining : Useful to plot the evolution of the evaluation during the training.
    • Instance of the class evaluation. Create the instance before using fm.trainFM_parallel_sgd !
    • You can set a modulo to evaluate the model after each #modulo iterations with ‘instance’.modulo
  1. This returns a weight matrix w. if you want to store this for future use, you can use the function fm.saveModel(w, "path/to/store/model")
  2. To evaluate the perfomance of the model on the test set, call fm.evaluate(train, w). This returns the area under the Precision/recall curve, the AUC of ROC, the average logloss, the MSE and the accuracy.
  3. To calculate the probabilities according to the model for a test set, call fm.predictFM(data, w). This returns an RDD with probability scores.
  4. To load a model that you saved, you can use the function fm.loadModel("path/to/store/model”)
Plot :
  1. You can plot the error (rtv_pr_auc, rtv_auc, logl, MSE) function of different learning rates by using fm.plotAlpha(sc, data, alpha_list, params…). 'alpha_list’ is a list of the learnings rates you want to test. The training is on 80% of the data, the evaluation is on the remaining 20%.
  2. You can do the same for the regularization parameter and the factor length with fm.plotRegParam(sc, data, regParam_list, params…) and fm.plotFactorLength(sc, data, factorLength_list, params…)
  3. You can plot a color map of the logloss function for learning rate/regParam combinations using fm.plotAlpha_regParam(sc, data, alpha_list, regParam_list, params…). The brighter square is the lower logloss. The training is on 80% of the data, the evaluation is on the remaining 20%.

Scala

  1. Load the file fm_parallel_sgd.scala. You can do this by adding the following lines to your code:

    :load spark-FM-parallelSGD/fm/fm_parallel_sgd.scala

    or by running the code directly when starting spark

    spark-shell –i spark-FM-parallelSGD/fm/fm_parallel_sgd.scala

  2. Preprocess your data such that you

a) Divide it into test and train

b) The data is an RDD with labeled points

  • Labels should be -1 or 1.
  • Features should be Vector from mllib.linalg.
  1. If you think it makes sense, take a (stratified) sample of you data using RDD.sample(). this is not done as part of the fm procedure
  2. How many partitions is your data? The performance of the parallel sgd is best with as few partitions as possible. Coalesce your data into 1 or 2 partitions per excecutor by using coalesce(nrPartitions) or repartition(nrPartitions) on your RDD
  3. Call the function fm.trainFM_parallel_sgd(train, params...). There are the following parameters that you can specify:
  • iterations : Nr of iterations of parallel SGD. default=50
  • iter_sgd : Nr of iterations of SGD in each partition. default=5 (between 1 and 10 is better)
  • alpha : Learning rate of SGD. default=0.01
  • regParam : Regularization parameter. default=0.01
  • factorLength : Length of the weight vectors of the FMs. default=4
  • verbose: Whether to ouptut evaluations on train and validation set after each iteration. (the code split your dataset into train(80%) + validation(20%) sets)
  1. This returns a weight matrix w.
  2. To evaluate the perfomance of the model on the test set, call fm.evaluate(train, w). This returns the average logloss.
  3. To calculate the probabilities according to the model for a test set, call fm.predictFM(data, w). This returns an RDD with probability scores.
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].