In this project, we will analyze house prices from Saratoga, CA.
First we will read in the required libraries and a database of house prices.
houses = read_csv("")
houses %>% head() %>% kable()
Price | Living.Area | Baths | Bedrooms | Fireplace | Acres | Age |
142212 | 1982 | 1.0 | 3 | N | 2.00 | 133 |
134865 | 1676 | 1.5 | 3 | Y | 0.38 | 14 |
118007 | 1694 | 2.0 | 3 | Y | 0.96 | 15 |
138297 | 1800 | 1.0 | 2 | Y | 0.48 | 49 |
129470 | 2088 | 1.0 | 3 | Y | 1.84 | 29 |
206512 | 1456 | 2.0 | 3 | N | 0.98 | 10 |
From examining the head of the file, we see what information the database provides us.
Next we will convert all the columns into numeric data.
houses <- houses %>% mutate(Fireplace = ifelse(Fireplace == "N",0,1))
Next we will create a training set and testing set to test some machine learning algorithms. We’ll use 80% of the data for training and 20% for testing.
index <- 1:nrow(houses)
testindex <- sample(index, trunc(length(index)/5))
train <- na.omit(houses[-testindex,])
test <- na.omit(houses[testindex,])
testsize <- nrow(test)
We’ll compare the different models by mean squared error. First we’ll try the NULL Model (The model whose output is always the mean of the output variable).
mod <- lm(Price ~ 1, data = train)
pred <- predict(mod, test)
mse <- sum( ( pred - test$Price )^2 ) / testsize
## [1] 99636.08
So on average our error is about $100,000 which is unacceptable.
Next we’ll try a linear model.
mod <- lm(Price ~ ., data = train)
pred <- predict(mod, test)
mse <- sum( ( pred - test$Price )^2 ) / testsize
## [1] 57189.04
This is significantly better.
Now we’ll try a Conditional Inference Tree.
mod <- train(Price ~ ., data = train, method = "ctree")
pred <- predict(mod, test)
mse <- sum( ( pred - test$Price )^2 ) / testsize
## [1] 63163.48
More or less the same as the linear model. Finally we will plot the resulting model.
Finally we’ll try to apply PCA (principal component analysis) and see if that improved our results. First we will create the principal components using the feature columns.
prin_comp <- prcomp(train %>% select(-Price), scale. = T)
std_dev <- prin_comp$sdev
pr_var <- std_dev^2
prop_varex <- pr_var/sum(pr_var)
Next we will examine the cumulative proportion of variance explained for the principal components.
plot(cumsum(prop_varex), xlab = "Principal Component",
ylab = "Cumulative Proportion of Variance Explained",
type = "b")
Most of the variance is explained by the first two components. Next we will plot the resultant first two principal components.
biplot(prin_comp, scale = 0)
Now to apply some machine learning algorithms, we creat PCA training and testing sets.
train.pca <- data.frame(Price = train$Price, prin_comp$x)
test.pca <- predict(prin_comp, newdata = test %>% select(-Price)) )
test.pca$Price <- test$Price
Applying a linear model to just the first principal component produces the best mean squared error.
mod <- lm(Price ~ PC1, data = train.pca)
pred <- predict(mod, test.pca)
mse <- sum( ( pred - test.pca$Price )^2 ) / testsize
## [1] 63778.04
Observe even though we failed to improve the mean squared error, we did manage to reduce the problem from 6 dimensions to 1 dimension and obtain a similar mean squared error.