All Projects → data-doctors → kaggle-house-prices-advanced-regression-techniques

data-doctors / kaggle-house-prices-advanced-regression-techniques

Licence: other
Repository for source code of kaggle competition: House Prices: Advanced Regression Techniques

Programming Languages

Jupyter Notebook
11667 projects
python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to kaggle-house-prices-advanced-regression-techniques

StoreItemDemand
(117th place - Top 26%) Deep learning using Keras and Spark for the "Store Item Demand Forecasting" Kaggle competition.
Stars: ✭ 24 (-35.14%)
Mutual labels:  regression, kaggle-competition
Recheck Web
recheck for web apps – change comparison tool with local Golden Masters, Git-like ignore syntax and "Unbreakable Selenium" tests.
Stars: ✭ 224 (+505.41%)
Mutual labels:  regression
Math Php
Powerful modern math library for PHP: Features descriptive statistics and regressions; Continuous and discrete probability distributions; Linear algebra with matrices and vectors, Numerical analysis; special mathematical functions; Algebra
Stars: ✭ 2,009 (+5329.73%)
Mutual labels:  regression
Dynaml
Scala Library/REPL for Machine Learning Research
Stars: ✭ 195 (+427.03%)
Mutual labels:  regression
Machine learning
Estudo e implementação dos principais algoritmos de Machine Learning em Jupyter Notebooks.
Stars: ✭ 161 (+335.14%)
Mutual labels:  regression
Lightautoml
LAMA - automatic model creation framework
Stars: ✭ 196 (+429.73%)
Mutual labels:  regression
Applied Ml
Code and Resources for "Applied Machine Learning"
Stars: ✭ 156 (+321.62%)
Mutual labels:  regression
Simple Statistics
simple statistics for node & browser javascript
Stars: ✭ 2,679 (+7140.54%)
Mutual labels:  regression
Deepfashion
Apparel detection using deep learning
Stars: ✭ 223 (+502.7%)
Mutual labels:  regression
Peroxide
Rust numeric library with R, MATLAB & Python syntax
Stars: ✭ 191 (+416.22%)
Mutual labels:  regression
Uci Ml Api
Simple API for UCI Machine Learning Dataset Repository (search, download, analyze)
Stars: ✭ 190 (+413.51%)
Mutual labels:  regression
Data Science Toolkit
Collection of stats, modeling, and data science tools in Python and R.
Stars: ✭ 169 (+356.76%)
Mutual labels:  regression
Morpheus Core
The foundational library of the Morpheus data science framework
Stars: ✭ 203 (+448.65%)
Mutual labels:  regression
Remixautoml
R package for automation of machine learning, forecasting, feature engineering, model evaluation, model interpretation, data generation, and recommenders.
Stars: ✭ 159 (+329.73%)
Mutual labels:  regression
Statistical Learning
Lecture Slides and R Sessions for Trevor Hastie and Rob Tibshinari's "Statistical Learning" Stanford course
Stars: ✭ 223 (+502.7%)
Mutual labels:  regression
Java Deep Learning Cookbook
Code for Java Deep Learning Cookbook
Stars: ✭ 156 (+321.62%)
Mutual labels:  regression
Correlation
🔗 Methods for Correlation Analysis
Stars: ✭ 192 (+418.92%)
Mutual labels:  regression
Image To 3d Bbox
Build a CNN network to predict 3D bounding box of car from 2D image.
Stars: ✭ 200 (+440.54%)
Mutual labels:  regression
Orange3
🍊 📊 💡 Orange: Interactive data analysis
Stars: ✭ 3,152 (+8418.92%)
Mutual labels:  regression
Margins
An R Port of Stata's 'margins' Command
Stars: ✭ 225 (+508.11%)
Mutual labels:  regression

kaggle-house-prices-advanced-regression-techniques

Repository for source code of kaggle competition: House Prices: Advanced Regression Techniques

Overview

There are several factors that influence the price a buyer is willing to pay for a house. Some are apparent and obvious and some are not. Nevertheless, a rational approach facilitated by machine learning can be very useful in predicting the house price. A large data set with 79 different features (like living area, number of rooms, location etc) along with their prices are provided for residential homes in Ames, Iowa. The challenge is to learn a relationship between the important features and the price and use it to predict the prices of a new set of houses.

Getting started

You can make a clone of the repository from Github on your local machine using the following command (prerequisite: you need git installed on your system):

$ git clone https://github.com/data-doctors/kaggle-house-prices-advanced-regression-techniques

Data

data folder contains original data

Repository structure

01-eda: Exploratory data analysis

Plot distribution of the numerical features examine the skewness Plot correlation matrix between the features

02-cleaning: Cleaning and preprocessing of data

remove skewenes of target features handle missing values in categorical features handle missing values in numerical features feature selection

03-feature_engineering: Engineering new features

Some examples:

A total area was created as a new feature by adding the basement area and living area. The number of bathrooms were added together to create a new feature. For numerical features with significant skewness, logarithms were taken to create new features. Some features were dropped that did not contribute significantly in predicting the SalePrice.

04-modelling: Fitting different models on the cleaned data and predict the house price on test set

Training

Training all models (bulk training)

The hyperparameters of all the single models were optimized by maximizing the cross validation score using the training set In order to train all the models (kept in models/tuning folder) in series the following shell script can be executed:

''' $ ./run_all.sh RMSE-xgb-CV(7)=0.15017262592+-0.0403780999289 RMSE-lgb-CV(7)=0.230416102431+-0.0982360336472 RMSE-rf-CV(7)=0.178752572944+-0.0495588133233 RMSE-et-CV(7)=0.177138296419+-0.0523244324721 RMSE-lasso-CV(7)=0.167043833535+-0.0590946122368 RMSE-ridge-CV(7)=0.16305872566+-0.0592719750453 RMSE-elasticnet-CV(7)=0.166431639245+-0.0591651827043 '''

Then the optimized parameters were plugged in the single models that are kept in models/single folder.

Saving submision of single model

$ python models/single/model_xgb.py save

Scores

Best single models:

Model CV LB
DecisionTreeRegressor 0.19013+-0.01304 0.18804
RandomForestRegressor 0.14744+-0.00871 0.14623
ExtraTreesRegressor 0.13888+-0.01208 0.15194
XGBoost 0.12137+-0.01128 0.12317
LightGBM 0.20030+-0.01182 0.21416
Lasso 0.11525+-0.01191 0.12091
Ridge 0.11748+-0.01170 0.12263
ElasticNet 0.11364+-0.01677 0.11976
SVM 0.19752+-0.01386 0.20416

Ensembling

We used 10 single models to individually predict the results. It is well established that a stacking/blending of the predictions by single models can improve the final results. Also it is ideal to select a few best performing but uncorrelated models for this purpose instead of considering all of them.

Inside 04-modelling/ensembling folder the correlations and performances of the single models were explored using the corr-coeff notebook.

5 best performing and least correlated models were selected and stacked together (using 04-modelling/ensembling/stacking notebook) to make the final prediction.

Acknowledgments

The Ames Housing dataset was compiled by Dean De Cock for use in data science education. It's an incredible alternative for data scientists looking for a modernized and expanded version of the often cited Boston Housing dataset.

Team

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].