In this project, we will analyze house prices from Saratoga, CA.

First we will read in the required libraries and a database of house prices.

```
library(dplyr)
library(readr)
library(ggplot2)
library(knitr)
library(caret)
houses = read_csv("http://tiny.cc/mosaic/SaratogaHouses.csv")
houses %>% head() %>% kable()
```

Price | Living.Area | Baths | Bedrooms | Fireplace | Acres | Age |
---|---|---|---|---|---|---|

142212 | 1982 | 1.0 | 3 | N | 2.00 | 133 |

134865 | 1676 | 1.5 | 3 | Y | 0.38 | 14 |

118007 | 1694 | 2.0 | 3 | Y | 0.96 | 15 |

138297 | 1800 | 1.0 | 2 | Y | 0.48 | 49 |

129470 | 2088 | 1.0 | 3 | Y | 1.84 | 29 |

206512 | 1456 | 2.0 | 3 | N | 0.98 | 10 |

From examining the head of the file, we see what information the database provides us.

Next we will convert all the columns into numeric data.

`houses <- houses %>% mutate(Fireplace = ifelse(Fireplace == "N",0,1))`

Next we will create a training set and testing set to test some machine learning algorithms. We’ll use 80% of the data for training and 20% for testing.

```
set.seed(3)
index <- 1:nrow(houses)
testindex <- sample(index, trunc(length(index)/5))
train <- na.omit(houses[-testindex,])
test <- na.omit(houses[testindex,])
testsize <- nrow(test)
```

We’ll compare the different models by mean squared error. First we’ll try the NULL Model (The model whose output is always the mean of the output variable).

```
mod <- lm(Price ~ 1, data = train)
pred <- predict(mod, test)
mse <- sum( ( pred - test$Price )^2 ) / testsize
mse^(1/2)
```

`## [1] 99636.08`

So on average our error is about $100,000 which is unacceptable.

Next we’ll try a linear model.

```
mod <- lm(Price ~ ., data = train)
pred <- predict(mod, test)
mse <- sum( ( pred - test$Price )^2 ) / testsize
mse^(1/2)
```

`## [1] 57189.04`

This is significantly better.

Now we’ll try a Conditional Inference Tree.

```
mod <- train(Price ~ ., data = train, method = "ctree")
pred <- predict(mod, test)
mse <- sum( ( pred - test$Price )^2 ) / testsize
mse^(1/2)
```

`## [1] 63163.48`

More or less the same as the linear model. Finally we will plot the resulting model.

`plot(mod$finalModel)`

Finally we’ll try to apply PCA (principal component analysis) and see if that improved our results. First we will create the principal components using the feature columns.

```
prin_comp <- prcomp(train %>% select(-Price), scale. = T)
std_dev <- prin_comp$sdev
pr_var <- std_dev^2
prop_varex <- pr_var/sum(pr_var)
```

Next we will examine the cumulative proportion of variance explained for the principal components.

```
plot(cumsum(prop_varex), xlab = "Principal Component",
ylab = "Cumulative Proportion of Variance Explained",
type = "b")
```

Most of the variance is explained by the first two components. Next we will plot the resultant first two principal components.

`biplot(prin_comp, scale = 0)`

Now to apply some machine learning algorithms, we creat PCA training and testing sets.

```
train.pca <- data.frame(Price = train$Price, prin_comp$x)
test.pca <- as.data.frame( predict(prin_comp, newdata = test %>% select(-Price)) )
test.pca$Price <- test$Price
```

Applying a linear model to just the first principal component produces the best mean squared error.

```
mod <- lm(Price ~ PC1, data = train.pca)
pred <- predict(mod, test.pca)
mse <- sum( ( pred - test.pca$Price )^2 ) / testsize
mse^(1/2)
```

`## [1] 63778.04`

Observe even though we failed to improve the mean squared error, we did manage to reduce the problem from 6 dimensions to 1 dimension and obtain a similar mean squared error.