All Projects β†’ rahulravindran0108 β†’ Boston-House-Price-Prediction

rahulravindran0108 / Boston-House-Price-Prediction

Licence: other
This repository contains files for Udacity's Machine Learning Nanodegree Project: Boston House Price Prediction

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to Boston-House-Price-Prediction

FlyingCarUdacity
πŸ›©οΈβš™οΈ 3D Planning, PID Control, Extended Kalman Filter for the Udacity Flying Car Nanodegree // FCND-Term1
Stars: ✭ 16 (-64.44%)
Mutual labels:  udacity-nanodegree
Introduction-to-Computer-Vision
Introduction to Computer Vision (Udacity Course)
Stars: ✭ 67 (+48.89%)
Mutual labels:  udacity-nanodegree
Udacity Nanodegrees
πŸŽ“ List of Udacity Nanodegree programs with links to the free courses in their curricula
Stars: ✭ 5,893 (+12995.56%)
Mutual labels:  udacity-nanodegree
UDACITY-Deep-Learning-Nanodegree-PROJECTS
These are the projects I did on my Udacity Deep Learning Nanodegree 🌟 πŸ’» πŸ’». πŸ’₯ 🌈
Stars: ✭ 18 (-60%)
Mutual labels:  udacity-nanodegree
udacity-fsnd
Udacity Full Stack Web Developer Nanodegree program (FSND) course materials
Stars: ✭ 66 (+46.67%)
Mutual labels:  udacity-nanodegree
SAComputerVisionMachineLearning
Computer Vision and Machine Learning related projects of Udacity's Self-driving Car Nanodegree Program
Stars: ✭ 36 (-20%)
Mutual labels:  udacity-nanodegree
Udacity
This repo includes all the projects I have finished in the Udacity Nanodegree programs
Stars: ✭ 57 (+26.67%)
Mutual labels:  udacity-nanodegree
Nano-Degree-Projects
πŸŽ“ Udacity Nano Degree Android Projects. All Needed projects you can check out my work here. Submitted and accepted projects.
Stars: ✭ 68 (+51.11%)
Mutual labels:  udacity-nanodegree
Image-Classifier
Final Project of the Udacity AI Programming with Python Nanodegree
Stars: ✭ 63 (+40%)
Mutual labels:  udacity-nanodegree
MyReads
πŸ“š MyReads is @udacity React Nanodegree Project. This is a book tracking app allows you to select and categorize books you have read, are currently reading, or want to read. The project emphasizes using React to build the application and provides an API server and client library that it should be persisted information as user’s interacts with th…
Stars: ✭ 13 (-71.11%)
Mutual labels:  udacity-nanodegree
Self-Driving-Car-Steering-Simulator
The aim of this project is to allow a self driving car to steer autonomously in a virtual environment.
Stars: ✭ 15 (-66.67%)
Mutual labels:  udacity-nanodegree
Udacity-Advance-Lane-detection-of-the-road
Udacity Self-Driving Car Engineer Nanodegree Advanced Lane Finding Project. Identifying lanes using edge detection (Sobel operator, gradient of magnitude and direction, and HLS color space), camera calibration and unwarping (distortion correction and perspective transform), and polynomial fitting for the lanes.
Stars: ✭ 27 (-40%)
Mutual labels:  udacity-nanodegree
Deep-Learning
It contains the coursework and the practice I have done while learning Deep Learning.πŸš€ πŸ‘¨β€πŸ’»πŸ’₯ 🚩🌈
Stars: ✭ 21 (-53.33%)
Mutual labels:  udacity-nanodegree
FCND-Term1-P3-3D-Quadrotor-Controller
Udacity Flying Car Nanodegree - Term 1 - Project 3 - 3D Quadrotor Controller
Stars: ✭ 31 (-31.11%)
Mutual labels:  udacity-nanodegree
Intro-to-Self-Driving-Cars
Self driving cars of udacity
Stars: ✭ 44 (-2.22%)
Mutual labels:  udacity-nanodegree
Data-Analyst-Nanodegree
This repo consists of the projects that I completed as a part of the Udacity's Data Analyst Nanodegree's curriculum.
Stars: ✭ 13 (-71.11%)
Mutual labels:  udacity-nanodegree
udacity-iOS-nanodegrees
List of iOS Udacity Nanodegree programs with links to the free courses in their curricula
Stars: ✭ 52 (+15.56%)
Mutual labels:  udacity-nanodegree
Udacity-Computer-Vision-Nanodegree
πŸ“· Computer Vision Nanodegree Repository
Stars: ✭ 34 (-24.44%)
Mutual labels:  udacity-nanodegree
udacity-cvnd-projects
My solutions to the projects assigned for the Udacity Computer Vision Nanodegree
Stars: ✭ 36 (-20%)
Mutual labels:  udacity-nanodegree
SUSE-Cloud-Native-Foundations-Scholarship
Udacity Suse Cloud Native Foundations Scholarship Course Walkthrough
Stars: ✭ 34 (-24.44%)
Mutual labels:  udacity-nanodegree

Boston Housing Dataset Report

Project Description

You want to be the best real estate agent out there. In order to compete with other agents in your area, you decide to use machine learning. You are going to use various statistical analysis tools to build the best model to predict the value of a given house. Your task is to find the best price your client can sell their house at. The best guess from a model is one that best generalizes the data.

For this assignment your client has a house with the following feature set: [11.95, 0.00, 18.100, 0, 0.6590, 5.6090, 90.00, 1.385, 24, 680.0, 20.20, 332.09, 12.13]. To get started, use the example scikit implementation. You will have to modify the code slightly to get the file up and running.

Statistical Analysis and Data Exploration

Loading the dataset:

from sklearn import datasets

city_data = datasets.load_boston()

# Get the labels and features from the housing data
housing_prices = city_data.target
housing_features = city_data.data

Let us know begin with the exploration of data using numpy

  • Number of data points (houses)?
number_of_houses = housing_features.shape[0]
print "number of houses:",number_of_houses

number of houses: 506


  • Number of features?
number_of_features = housing_features.shape[1]
print "number of features:",number_of_features

number of features: 13

  • Minimum and maximum housing prices?
max_price = np.max(housing_prices)
min_price = np.min(housing_prices)

print "max price of house:",max_price
print "min price of house:",min_price

max price of house: 50.0 min price of house: 5.0


  • Mean and median Boston housing prices?
mean_price = np.mean(housing_prices)
median_price = np.median(housing_prices)

print "mean price of house:",mean_price
print "median price of house:",median_price

mean price of house: 22.5328063241 median price of house: 21.2


  • Standard deviation?
standard_deviation = np.std(housing_prices)
print "standard deviation for prices of house:",standard_deviation

standard deviation for prices of house: 9.18801154528

Evaluating Model Performance

  • Which measure of model performance is best to use for predicting Boston housing data and analyzing the errors? Why do you think this measurement most appropriate? Why might the other measurements not be appropriate here?

In my previous iteration of the work I used r^2 but as the project demands a measure if the error, I believe using the mean squarred error would give us the best solution. The reason being MSE emphasises large error unlike absolute mean or median error. This is what is seen as ideal by most statisticians. However, another important aspect to see in the case of this project would be the measur that minimises the error the most. Therefore just the thought is not enough and we would need to see some kind of graphical proof for this.

Mean Absolute Error Mean Squarred Error Median Absolute Error

  • Why is it important to split the Boston housing data into training and testing data? What happens if you do not do this?

One of the biggest problems that could occur is overfitting or underfitting. If we were to not split the dataset, it would imply that we have used all our data for training and there is a high chance that our algorithm becomes tailor made to the given dataset. Thus, there could be a huge dip in the performance of the algorihtm on the test dataset.

Some of the advantages that splitting the dataset gives us are as follows:

  • It allows us to check if we have undefit or overfit the data.
  • It provides a sense of validation for our model.
  • It suggests the level of performance measure one would expect from the test data. Thus, we are better prepared for the unknown data.

  • What does grid search do and why might you want to use it?

Grid search helps in parameter tuning and the selecetion of appropriate model based on the parameters we wish to tune. While computing this, it also implemets cross validation folds, thus eliminating the risk of overfitting/underfitting. It also has the flexibility of making a customised scorer function.


  • Why is cross validation useful and why might we use it with grid search?

The goal of cross validation is to define a dataset to "test" the model in the training phase (i.e., the validation dataset), in order to limit problems like overfitting, give an insight on how the model will generalize to an independent dataset (i.e., an unknown dataset, for instance from a real problem), etc.

Gird search performs 3-fold cross validation, hence we can easily validate the optimized parameter model generated by it. Cross validation ensures that we have a sufficiently good model and that we don't over or underfit the data in any way. Since, the gridsearch provides parameter optimization and uses crosss validation to provide the optimal parameters, we can be sure that there are no significant problems of underfitting or overfitting.

Analyzing Model Performance

  • Look at all learning curve graphs provided. What is the general trend of training and testing error as training size increases?

As the training size increases, the error reduces and the model begins to predict better. This can be seen form the learning curves. With the increase in max depth of the decision tree and training size, the training error is almost 0 whereas the testing error continues to reduce. Thus, there is improvement in the regression model as training size increases.

This however, is the general trend for higher depths. In the beginning, we see that both the training and the testing error flattens out. Thus, as the training size increases there are errors in the testing dataset that increases.


  • Look at the learning curves for the decision tree regressor with max depth 1 and 10 (first and last learning curve graphs). When the model is fully trained does it suffer from either high bias/underfitting or high variance/overfitting?

Lets analyze the scenario for max depth being 1:

In this case, we are underfitting the dataset, both the training and test dataset has high errors and flattens out after a while. Thus, there is a high amount of bias when using max depth = 1. This is somewhat intuituive as we do not let the tree expand and we restrict the entropy or the learning value. Hence, with a depth levelof 1 we cannot the complexity of the dataset.

Lets analyze the scenario for max depth being 10:

Here, we are clearly overfitting. Using the same discussion above, since we allow increase in depth upto level 10, we are increasing the entropy levels. Thus, we see an error of 0 for training dataset. However, the testing data begins to flatten out with a few occassional spikes.


  • Look at the model complexity graph. How do the training and test error relate to increasing model complexity? Based on this relationship, which model (max depth) best generalizes the dataset and why?

I think a max_depth between 5 generalises the dataset the best. It doesn't overfit as in the case of higher depths, we simply get a training error of 0. This means we are completly fitting the data and there is a high chance of overfitting. The model complxity graph verifies the above conclusion

Model Prediction

  • Model makes predicted housing price with detailed model parameters (max depth) reported using grid search. Note due to the small randomization of the code it is recommended to run the program several times to identify the most common/reasonable price/model complexity.
Final Model:

DecisionTreeRegressor(criterion='mse', max_depth=5, max_features=None,
           max_leaf_nodes=None, min_samples_leaf=3, min_samples_split=1,
           min_weight_fraction_leaf=0.0, random_state=None,
           splitter='best')

House: [11.95, 0.0, 18.1, 0, 0.659, 5.609, 90.0,
 1.385, 24, 680.0, 20.2, 332.09, 12.13]
Prediction: [ 20.96776316]

The final decision tree regressor uses the following optimized parameters:

  • max_depth:5
  • min_samples_leaf = 3
  • min_samples_split = 1

  • Compare prediction to earlier statistics and make a case if you think it is a valid model.

The central tendency for the given dataset with respect to the mean and the median are as follows: mean price of house: 22.5328063241 median price of house: 21.2

This means that our current prediction for the given house is credible. Therefore we can say that we have a model that fits the data to an extent.

However, I would say that the model still doesn't completely describe the variance in the dataset. If we plot the graph of residuals, we get a cycle like graph. This means that a linear mode won't be able to generalise the data. We are underfitting the dataset.

![Image of Residual Plot] (residual_plot.png)

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].