Deep Learning regression with Keras and Spark
About the repository
The Spark folder of this repository was written using Databricks if you want to replicate or continue the work you can checkout the free version Databrick community.
The main goal of the repository is to use the Spark structure from Databricks clusters, load and process data from the Kaggle competition and train deep learning models distributed.
What you will find
- Brief EDA of the data set. [link]
- Creation and usage of custom spark pipelines. [link]
- Data preparation. [link]
- Model training. [link]
- Model prediction (test set). [link]
- Model evaluation (evaluation of many different models. [link]
Store Item Demand Forecasting Challenge
link for the Kaggle competition: https://www.kaggle.com/c/demand-forecasting-kernels-only
datasets: https://www.kaggle.com/c/demand-forecasting-kernels-only/data
Overview
This competition is provided as a way to explore different time series techniques on a relatively simple and clean dataset.
You are given 5 years of store-item sales data, and asked to predict 3 months of sales for 50 different items at 10 different stores.
What's the best way to deal with seasonality? Should stores be modeled separately, or can you pool them together? Does deep learning work better than ARIMA? Can either beat xgboost?
This is a great competition to explore different models and improve your skills in forecasting.
PySpark Dependencies:
Python Dependencies:
To-Do:
- Persistence of the pipeline classes needs to be fixed.
- Pipeline classes needs revised.
- The data probably needs more feature extraction.