All Projects → dnkirill → Allstate_capstone

dnkirill / Allstate_capstone

Allstate Kaggle Competition ML Capstone Project

Projects that are alternatives of or similar to Allstate capstone

Deep Learning Keras Tensorflow
Introduction to Deep Neural Networks with Keras and Tensorflow
Stars: ✭ 2,868 (+3883.33%)
Mutual labels:  jupyter-notebook, tutorial, keras-tutorials, cudnn
Data Science Your Way
Ways of doing Data Science Engineering and Machine Learning in R and Python
Stars: ✭ 530 (+636.11%)
Mutual labels:  jupyter-notebook, data-science, notebook, tutorial
Deep Recommender System
深度学习在推荐系统中的应用及论文小结。
Stars: ✭ 657 (+812.5%)
Mutual labels:  kaggle, jupyter-notebook, data-science
Lambdaschooldatascience
Completed assignments and coding challenges from the Lambda School Data Science program.
Stars: ✭ 22 (-69.44%)
Mutual labels:  jupyter-notebook, data-science, notebook
Python Introducing Pandas
Introduction to pandas Treehouse course
Stars: ✭ 24 (-66.67%)
Mutual labels:  jupyter-notebook, data-science, tutorial
Dogs vs cats
猫狗大战
Stars: ✭ 570 (+691.67%)
Mutual labels:  kaggle, jupyter-notebook, keras-tutorials
Data Science Competitions
Goal of this repo is to provide the solutions of all Data Science Competitions(Kaggle, Data Hack, Machine Hack, Driven Data etc...).
Stars: ✭ 572 (+694.44%)
Mutual labels:  kaggle, data-science, xgboost
Har Keras Coreml
Human Activity Recognition (HAR) with Keras and CoreML
Stars: ✭ 23 (-68.06%)
Mutual labels:  jupyter-notebook, data-science, tutorial
Mli Resources
H2O.ai Machine Learning Interpretability Resources
Stars: ✭ 428 (+494.44%)
Mutual labels:  jupyter-notebook, data-science, xgboost
Machine Learning From Scratch
Succinct Machine Learning algorithm implementations from scratch in Python, solving real-world problems (Notebooks and Book). Examples of Logistic Regression, Linear Regression, Decision Trees, K-means clustering, Sentiment Analysis, Recommender Systems, Neural Networks and Reinforcement Learning.
Stars: ✭ 42 (-41.67%)
Mutual labels:  jupyter-notebook, data-science, notebook
Machinelearningcourse
A collection of notebooks of my Machine Learning class written in python 3
Stars: ✭ 35 (-51.39%)
Mutual labels:  kaggle, jupyter-notebook, data-science
Computervision Recipes
Best Practices, code samples, and documentation for Computer Vision.
Stars: ✭ 8,214 (+11308.33%)
Mutual labels:  jupyter-notebook, data-science, tutorial
Intro To Python
An intro to Python & programming for wanna-be data scientists
Stars: ✭ 536 (+644.44%)
Mutual labels:  jupyter-notebook, data-science, tutorial
Nteract
📘 The interactive computing suite for you! ✨
Stars: ✭ 5,713 (+7834.72%)
Mutual labels:  jupyter-notebook, data-science, notebook
Code search
Code For Medium Article: "How To Create Natural Language Semantic Search for Arbitrary Objects With Deep Learning"
Stars: ✭ 436 (+505.56%)
Mutual labels:  jupyter-notebook, data-science, tutorial
4th Place Home Credit Default Risk
Codes and dashboards for 4th place solution for Kaggle's Home Credit Default Risk competition
Stars: ✭ 23 (-68.06%)
Mutual labels:  kaggle, jupyter-notebook, data-science
Data Privacy For Data Scientists
A workshop on data privacy methods for data scientists.
Stars: ✭ 53 (-26.39%)
Mutual labels:  jupyter-notebook, data-science, tutorial
Pycon 2019 Tutorial
Data Science Best Practices with pandas
Stars: ✭ 410 (+469.44%)
Mutual labels:  jupyter-notebook, data-science, tutorial
Hands On Nltk Tutorial
The hands-on NLTK tutorial for NLP in Python
Stars: ✭ 419 (+481.94%)
Mutual labels:  jupyter-notebook, notebook, tutorial
Awesome Google Colab
Google Colaboratory Notebooks and Repositories (by @firmai)
Stars: ✭ 863 (+1098.61%)
Mutual labels:  jupyter-notebook, data-science, tutorial

Allstate Claims Severity Project

Structure

This project provides a sample solution to Allstate Claims Severity competition on Kaggle. It's been initially developed for the Capstone (a final project) on Udacity Machine Learning Nanodegree Program, but (with slight modifications) it's now available for everyone. Rather than achieving the best score on Kaggle, which requires good hardware and a considerate amount of time, this project aims to guide through the process of training and optimizing two models (XGBoost and MLP) and stacking the results using a linear regression. Such a structure makes it easy for beginners to get the basic concepts, and then to apply them in practice.

The project is divided into four sections, each section is described in a corresponding Jupyter notebook. Your feedback on improving notebooks is welcome!

  • Part 1: Data Discovery — we get accustomed with Allstate's dataset and do basic data analysis: we collect basic statistics, plot a correlation matrix, compare train and test distributions.
  • Part 2: XGBoost model training and tuning — we try to solve the regression problem with XGBoost, a powerful and popular gradient boosting library. We start with a simple model, develop a framework for hyper-parameters optimization and train optimized models.
  • Part 3: Multilayer Perceptron model training and tuning — we span the space with feed-forward neural networks. This section will be done with TensorFlow (acting as a backend) and Keras (acting as a frontend). We introduce (and show by example) the concept of overfitting, use K-Fold cross-validation to compare the performance of our models, tune hyper-parameters (number of units, dropout rates, optimizers) via Hyperopt and select the best model.
  • Part 4: Linear regression stacking and results validation — we combine predictions of XGBoost and MLP using a linear regression, observe the results and prove their statistical significance.

You can also read a Capstone Report which summarizes the implementation as well as the methodology of the whole project without going deep into details.

Requirements

Dataset

The dataset needs to be downloaded separately (21 MB). Just unzip it in the same directory with notebooks. The dataset is available for free on Kaggle's competition page.

Pretrained models

To get the results quickly, the default option is to use pretrained models. At the beginning of XGBoost and MLP notebooks, there's a flag: USE_PRETRAINED = True which can be set to False to enable calculations. The default option (True) just loads ready-to-use models from pretrained directory.

Software

This project uses the following software (if version number is omitted, latest version is recommended):

  • Python stack: python 2.7.12, numpy, scipy, sklearn, pandas, matplotlib, h5py.
  • XGBoost: multi-threaded xgboost should be compiled, xgboost python package is also required.
  • Deep Learning stack: CUDA 8.0.44, cuDNN 5.1, TensorFlow 0.11.0rc (compiled with GPU flags), Keras.
  • Hyperopt for hyper-parameter optimization: hyperopt, networkx python packages, MongoDB 3.2, pymongo python driver.

Guide to running this project

Using AWS instances (recommended)

Step 1. Launch EC2 instance The best option to run the project is to use EC2 AWS instances:

  • c4.8xlarge CPU optimized instance for XGBoost calculations (best for Part 2).
  • p2.xlarge GPU optimized instance for MLP and ensemble calculations (best for Part 3, Part 4). If you run an Ireland-based spot instance, the price will be about $0.15-0.2 per hour.

Please make sure you run Ubuntu 14.04. For Ireland region you can use this AMI: ami-ed82e39e. Also, add 30 GB of EBS volume to your instance. Your security group should be configured to allow incoming connections on port 8888 which is used by Jupyter.

Step 2. Clone this project

sudo apt-get install git

cd ~; git clone https://github.com/dnkirill/allstate_capstone.git

cd allstate_capstone

Step 3. Deploy configuration Deployment scripts (XGBoost-only and XGBoost + MLP) for Ubuntu instances are provided in config directory of this project.

Option 1: bootstrap_xgb_hyperopt.sh configures an instance (c4.8xlarge is recommended) for XGBoost and Hyperopt calculations. It installs essential libraries, python 2.7.12, xgboost, mongodb, hyperopt and python stack: numpy, scipy, pandas, sklearn, etc. Run this if you don't plan to train MLP or ensembles. Part 1 and Part 2 notebooks don't require anything beyond the scope of this script.

Option 2 (full): bootstrap_all.sh is the full deployment script and it also installs CUDA, cuDNN, TensorFlow and Keras. This is required to run Part 3 and Part 4 notebooks. Optional, but important: to speed up calculations, download cuDNN library (tarball) into your home directory before running this script. cuDNN 5.1 works best with this configuration.

It will take about 20 minutes to configure the instance. After that, all the packages are installed, Jupyter server is ready and you can connect to it via your browser: {instance_public_dns}:8888.

Using your own hardware

Of course, it's possible to test the project on your local machine. Here are the suggested steps:

This should be enough to run this project, provided that you have basic python packages (numpy scipy sklearn pandas matplotlib) installed. Now you can start Jupyter server and run notebooks.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].