In this project, we will analyze house prices from Saratoga, CA.

First we will read in the required libraries and a database of house prices.

library(dplyr)
library(readr)
library(ggplot2)
library(knitr)
library(caret)

houses = read_csv("http://tiny.cc/mosaic/SaratogaHouses.csv")
houses %>% head() %>% kable() 
Price Living.Area Baths Bedrooms Fireplace Acres Age
142212 1982 1.0 3 N 2.00 133
134865 1676 1.5 3 Y 0.38 14
118007 1694 2.0 3 Y 0.96 15
138297 1800 1.0 2 Y 0.48 49
129470 2088 1.0 3 Y 1.84 29
206512 1456 2.0 3 N 0.98 10

From examining the head of the file, we see what information the database provides us.

Next we will convert all the columns into numeric data.

houses <- houses %>% mutate(Fireplace = ifelse(Fireplace == "N",0,1))

Next we will create a training set and testing set to test some machine learning algorithms. We’ll use 80% of the data for training and 20% for testing.

set.seed(3)

index <- 1:nrow(houses)
testindex <- sample(index, trunc(length(index)/5))

train <- na.omit(houses[-testindex,]) 

test <- na.omit(houses[testindex,])
testsize <- nrow(test)

We’ll compare the different models by mean squared error. First we’ll try the NULL Model (The model whose output is always the mean of the output variable).

mod <- lm(Price ~ 1, data = train)

pred <- predict(mod, test)

mse <- sum( ( pred - test$Price )^2 ) / testsize

mse^(1/2)
## [1] 99636.08

So on average our error is about $100,000 which is unacceptable.

Next we’ll try a linear model.

mod <- lm(Price ~ ., data = train)

pred <- predict(mod, test)

mse <- sum( ( pred - test$Price )^2 ) / testsize

mse^(1/2)
## [1] 57189.04

This is significantly better.

Now we’ll try a Conditional Inference Tree.

mod <- train(Price ~ ., data = train, method = "ctree")

pred <- predict(mod, test)

mse <- sum( ( pred - test$Price )^2 ) / testsize

mse^(1/2)
## [1] 63163.48

More or less the same as the linear model. Finally we will plot the resulting model.

plot(mod$finalModel)

Principal Component Analysis

Finally we’ll try to apply PCA (principal component analysis) and see if that improved our results. First we will create the principal components using the feature columns.

prin_comp <- prcomp(train %>% select(-Price), scale. = T)

std_dev <- prin_comp$sdev
pr_var <- std_dev^2
prop_varex <- pr_var/sum(pr_var)

Next we will examine the cumulative proportion of variance explained for the principal components.

plot(cumsum(prop_varex), xlab = "Principal Component",
              ylab = "Cumulative Proportion of Variance Explained",
              type = "b")

Most of the variance is explained by the first two components. Next we will plot the resultant first two principal components.

biplot(prin_comp, scale = 0)

Now to apply some machine learning algorithms, we creat PCA training and testing sets.

train.pca <- data.frame(Price = train$Price, prin_comp$x) 

test.pca <- as.data.frame( predict(prin_comp, newdata = test %>% select(-Price)) )
test.pca$Price <- test$Price

Applying a linear model to just the first principal component produces the best mean squared error.

mod <- lm(Price ~ PC1, data = train.pca)

pred <- predict(mod, test.pca)

mse <- sum( ( pred - test.pca$Price )^2 ) / testsize

mse^(1/2)
## [1] 63778.04

Observe even though we failed to improve the mean squared error, we did manage to reduce the problem from 6 dimensions to 1 dimension and obtain a similar mean squared error.