This project analyzes housing price data from King County, Washington, to build predictive models for home values. The analysis includes comprehensive data cleaning, exploratory data analysis, linear regression modeling, and logistic regression for home quality classification.
This section includes steps to identify and fix data entry errors and transform problematic variables in the King County housing dataset.
# Load the King County housing dataset
data <- read.csv("kc_house_data.csv", sep=",", header = TRUE)
# Data structure
# str(data)
# Check for missing values
cat("Missing values per column:\n")
## Missing values per column:
## id date price bedrooms bathrooms
## 0 0 0 0 0
## sqft_living sqft_lot floors waterfront view
## 0 0 0 0 0
## condition grade sqft_above sqft_basement yr_built
## 0 0 0 0 0
## yr_renovated zipcode lat long sqft_living15
## 0 0 0 0 0
## sqft_lot15
## 0
In this section, we identify observations that have clearly incorrect or suspicious values. These include homes with 0 or large numbers of bedrooms and bathrooms, 0 square footage, and any unusual pricing or year values.
# Check for problematic bedroom and bathroom values
problem_rows <- data[
(data$bedrooms == 0 | data$bedrooms == 33 | data$bathrooms == 0) &
!is.na(data$bedrooms) & !is.na(data$bathrooms),
c("id", "bedrooms", "bathrooms", "sqft_living", "price", "zipcode")
]
cat("Properties with suspicious bedroom/bathroom values:\n")
## Properties with suspicious bedroom/bathroom values:
## id bedrooms bathrooms sqft_living price zipcode
## 876 6306400140 0 0.00 3064 1095000 98102
## 1150 3421079032 1 0.00 670 75000 98022
## 3120 3918400017 0 0.00 1470 380000 98133
## 3468 1453602309 0 1.50 1430 288000 98125
## 4869 6896300380 0 1.00 390 228000 98118
## 5833 5702500050 1 0.00 600 280000 98045
## 6995 2954400190 0 0.00 4810 1295650 98053
## 8478 2569500210 0 2.50 2290 339950 98042
## 8485 2310060040 0 2.50 1810 240000 98038
## 9774 3374500520 0 0.00 2460 355000 98031
## 9855 7849202190 0 0.00 1470 235000 98065
## 10482 203100435 1 0.00 690 484000 98053
## 12654 7849202299 0 2.50 1490 320000 98065
## 14424 9543000205 0 0.00 844 139950 98001
## 15871 2402100895 33 1.75 1620 640000 98103
## 18380 1222029077 0 0.75 384 265000 98070
## 19453 3980300371 0 0.00 290 142000 98024
##
## Found 17 properties with extreme bedroom/bathroom values
Based on external verification using King County Parcel Viewer, we apply targeted corrections to properties with verified data entry errors:
# Apply manual corrections based on external verification
corrections <- list(
list(id = 6306400140, bedrooms = 5, bathrooms = 4.50),
list(id = 3421079032, bedrooms = 3, bathrooms = 3.75),
list(id = 3918400017, bedrooms = 3, bathrooms = 2.25),
list(id = 6896300380, bedrooms = 3, bathrooms = NA), # bedroom only
list(id = 2954400190, bedrooms = 4, bathrooms = 4),
list(id = 2569500210, bedrooms = 4, bathrooms = NA),
list(id = 2310060040, bedrooms = 4, bathrooms = NA),
list(id = 7849202190, bedrooms = 3, bathrooms = 1.50),
list(id = 7849202299, bedrooms = 0, bathrooms = NA), # verified studio
list(id = 9543000205, bedrooms = 2, bathrooms = 1),
list(id = 2402100895, bedrooms = 3, bathrooms = NA), # was 33 bedrooms
list(id = 1222029077, bedrooms = 1, bathrooms = 1.50),
list(id = 3374500520, bedrooms = 4, bathrooms = 3.5)
)
# Apply corrections
for (correction in corrections) {
if (!is.na(correction$bedrooms)) {
data[data$id == correction$id, "bedrooms"] <- correction$bedrooms
}
if (!is.na(correction$bathrooms)) {
data[data$id == correction$id, "bathrooms"] <- correction$bathrooms
}
}
# Remove unverifiable records
remove_ids <- c(5702500050, 203100435, 3980300371)
data <- data[!data$id %in% remove_ids, ]
cat("Applied corrections to", length(corrections), "properties")
## Applied corrections to 13 properties
##
## Removed 3 unverifiable records
##
## Remaining records: 21610
As part of the data cleaning process, we found 17 properties with unusual or clearly incorrect values for bedrooms and bathrooms. Some had 0 bedrooms or 0 bathrooms, and one even had 33 bedrooms, which is obviously unrealistic. Instead of deleting these rows right away, we looked each one up manually using the King County Parcel Viewer to verify what the correct values should be. For most of them, we was able to confirm the actual number of rooms and made the necessary corrections based on that. For instance, one home listed with 0 bedrooms and 0 bathrooms was corrected to 4 bedrooms and 4 bathrooms after verification, and the home listed with 33 bedrooms was updated to 3, which made more sense given its size. One property turned out to be a studio-style layout with 0 bedrooms and 1.5 bathrooms, so we kept that one as-is. In three cases, we couldn’t find any record of the property, and since the information couldn’t be confirmed and looked suspicious, we decided to remove those rows. Overall, this process helped us clean the dataset in a careful and meaningful way, using outside sources to make informed decisions instead of relying only on assumptions or automatic removals.
We performed several additional validation checks on the dataset: -
Duplicate Property IDs: No problematic duplicates were
found - Zero Living Area/Lot Size: No homes with 0
square footage were found - Invalid Prices: No homes
with prices ≤ 0 were found
- Future Years: No homes built or renovated after 2015
were found
All validation checks confirmed the dataset integrity for these variables.
The date column is a character string and not useful for modeling. We’ll extract: year and month sold.
data$year_sold <- as.numeric(substr(data$date, 1, 4))
data$month_sold <- as.numeric(substr(data$date, 5, 6))
table(data$year_sold)
##
## 2014 2015
## 14630 6980
##
## 1 2 3 4 5 6 7 8 9 10 11 12
## 978 1250 1875 2231 2414 2180 2211 1940 1772 1878 1410 1471
zipcode has over 70 unique values. That’s too many for regression modeling, and need 70 dummy variables, which will clutters model and Increases risk of overfitting
city_zips <- c(98101, 98102, 98104, 98105, 98109, 98112, 98115, 98116, 98118, 98119, 98121, 98122, 98125, 98126, 98133, 98134,98136, 98144, 98154, 98164, 98174, 98195)
suburb_zips <- c(98004, 98005, 98006, 98007, 98008, 98027, 98029, 98033,98034, 98040, 98052, 98053, 98056, 98057, 98059, 98072,98074, 98075, 98092, 98070, 98028, 98019)
rural_zips <- setdiff(unique(data$zipcode), union(city_zips, suburb_zips))
data$region <- case_when(
data$zipcode %in% city_zips ~ "City",
data$zipcode %in% suburb_zips ~ "Suburb",
data$zipcode %in% rural_zips ~ "Rural"
)
data$region <- factor(data$region, levels = c("City", "Suburb", "Rural"))
table(data$region)
##
## City Suburb Rural
## 4471 7266 9873
4471 are city home, 7266 are suburb, and 9873 are rural.
data$renovation_group <- case_when(
data$yr_renovated == 0 ~ "Never Renovated",
data$yr_renovated >= 2005 ~ "Recently Renovated",
TRUE ~ "Renovated Long Ago"
)
data$renovation_group <- factor(data$renovation_group)
table(data$renovation_group)
##
## Never Renovated Recently Renovated Renovated Long Ago
## 20696 320 594
20,699 homes were not renovated, 320 homes were recently renovated; and 594 are renovated long ago
# Convert waterfront to factor and calculate distance to downtown Seattle
data$waterfront <- factor(data$waterfront)
data$distance_to_downtown <- sqrt(
(data$lat - 47.6062)^2 + (data$long + 122.3321)^2
)
summary(data$distance_to_downtown)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.01311 0.09646 0.17881 0.20076 0.28468 1.02269
# Check for multicollinearity among square footage variables
sqft_vars <- data[, c("sqft_living", "sqft_above", "sqft_basement", "sqft_living15")]
cor_matrix <- round(cor(sqft_vars), 2)
cor_matrix
## sqft_living sqft_above sqft_basement sqft_living15
## sqft_living 1.00 0.88 0.43 0.76
## sqft_above 0.88 1.00 -0.05 0.73
## sqft_basement 0.43 -0.05 1.00 0.20
## sqft_living15 0.76 0.73 0.20 1.00
This section contains visualizations that explore how price is related to the other factors.
# Split data into training and testing. We will only use the training data for the visualizations
set.seed(1)
sample<-sample.int(nrow(data), floor(.80*nrow(data)), replace = F)
train<-data[sample, ]
test<-data[-sample, ]
# distribution of price
ggplot(train, aes(x=price))+
scale_x_continuous(breaks = breaks_extended(6),labels = label_dollar())+
theme(plot.title = element_text(hjust=.5))+
labs(x="Price", y="Density", title = "Distrubution of Price")+
geom_density()
### Boxplot of price
ggplot(train, aes(x= "", y=price))+
geom_boxplot()+
scale_y_continuous(breaks = breaks_extended(10),labels = label_dollar())+
theme(plot.title = element_text(hjust=.5))+
labs(x="Price", y="Price", title = "Summary of Price")
#bar chart and box plot that show how # of bedrooms are related to price
ggplot(train, aes(x=bedrooms, y=price))+
scale_y_continuous(breaks = breaks_extended(6),labels = label_dollar())+
theme(plot.title = element_text(hjust=.5))+
labs(x="Bedrooms", y="Averge Price", title = "Average Price by Number of Bedrooms")+
stat_summary(geom="bar", fill="red")
ggplot(train, aes(x=as.factor(bedrooms), y=price))+
geom_boxplot(fill="lightblue")+
scale_y_continuous(breaks = breaks_extended(6),labels = label_dollar())+
theme(plot.title = element_text(hjust=.5))+
labs(x="Bedrooms", y="Price", title = "Price Distribtion by Number of Bedrooms")
#bar chart and box plot that show how # of bedrooms are related to price
ggplot(train, aes(x=cut(bathrooms, breaks=c(0,1,2,3,4,5,6,7,8)), y=price))+
theme(plot.title = element_text(hjust=.5))+
labs(x="Bathrooms", y="Averge Price", title = "Average Price by Number of Bathrooms")+
scale_y_continuous(breaks = breaks_extended(6),labels = label_dollar())+
scale_x_discrete(labels = c("0-1", "1-2", "2-3","3-4","4-5", "5-6", "6-7", "7+"))+
stat_summary(geom="bar", fill="red")
ggplot(train, aes(x=sqft_living, y=price))+
scale_y_continuous(breaks = breaks_extended(6),labels = label_dollar())+
theme(plot.title = element_text(hjust=.5))+
labs(x="Living Square Footage", y="Price", title = "Price Against Living Square Footage")+
geom_point()
ggplot(train, aes(x=waterfront, y=price))+
scale_y_continuous(breaks = breaks_extended(6),labels = label_dollar())+
theme(plot.title = element_text(hjust=.5))+
labs(x="Waterfront", y="Price", title = "Price by Waterfront")+
geom_boxplot(fill="lightblue")
grade <- ggplot(train, aes(x=grade, y=price))+
scale_y_continuous(breaks = breaks_extended(6),labels = label_dollar())+
theme(plot.title = element_text(hjust=.5))+
labs(x="Grade", y="Price", title = "Price by Grade")+
stat_summary(geom="bar", fill="green")
condition <- ggplot(train, aes(x=condition, y=price))+
scale_y_continuous(breaks = breaks_extended(6),labels = label_dollar())+
theme(plot.title = element_text(hjust=.5))+
labs(x="Condition", y="Price", title = "Price by Condition")+
stat_summary(geom="bar", fill="green")
grid.arrange(grade,condition,ncol=2,nrow=2)
# 'Region' was derived from 'zipcode'
ggplot(train, aes(x=region, y=log(price)))+
scale_y_continuous(breaks = breaks_extended(6),labels = label_dollar())+
theme(plot.title = element_text(hjust=.5))+
labs(x="Region", y="log(Price)", title = "log(Price) by Region")+
geom_boxplot(fill="lavender")
dist <-ggplot(train, aes(x=distance_to_downtown, color=region))+
theme(plot.title = element_text(hjust=.5))+
labs(x="Distance to Downtown (decimal-degrees)",
title = "Density Plot of Distance to Downtown")+
geom_density()
built <- ggplot(train,aes(x=yr_built, color=region))+
theme(plot.title = element_text(hjust=.5))+
labs(x="Year Built",
title = "Density Plot of Distance to Downtown")+
geom_density()
grid.arrange(dist,built,ncol=2,nrow=2)
ggplot(train, aes(x=distance_to_downtown, y=log(price), color=region))+
theme(plot.title=element_text(hjust=.5))+
scale_y_continuous(breaks = breaks_extended(6),labels = label_dollar())+
labs(x="Distance to Downtown (decimal-degrees)",
y= "Log(price)",
color = "Region",
title = "Log(price) Against Distance to Downtown by Region")+
geom_point()
ggplot(train, aes(x=distance_to_downtown, y=log(price), color=grade))+
scale_y_continuous(breaks = breaks_extended(6),labels = label_dollar())+
scale_color_gradient(low = "blue", high = "red")+
theme(plot.title=element_text(hjust=.5))+
labs(x="Distance to Downtown (decimal-degrees)",
y= "Log(price)",
color = "Grade",
title = "Log(price) Against Distance to Downtown by Grade")+
geom_point()
ggplot(train,aes(x=condition,y=log(price), color=yr_built))+
scale_y_continuous(breaks = breaks_extended(6),labels = label_dollar())+
scale_color_gradientn(colors=c("steelblue", "skyblue", "lightgreen", "gold", "tomato"))+
theme(plot.title = element_text(hjust=.5))+
labs(x="Condition",
y= "log(Price)",
color = "Year Built",
title = "log(Price) Against Condition by Year Built")+
geom_point()
ggplot(train,aes(x=distance_to_downtown,y=log(price), color=renovation_group))+
scale_y_continuous(breaks = breaks_extended(6),labels = label_dollar())+
theme(plot.title = element_text(hjust=.05))+
labs(x="Distance to Downtown (decimal-degrees)",
y= "log(Price)",
color = "Renovated",
title = "log(Price) Against Distance to Downtown by Renovated")+
geom_point(alpha=.5)
This section presents a linear regression model for predicting Price and outlines the methodology used to develop it.
# 21 predictor variables
full_model <- lm(price ~ bedrooms + bathrooms + sqft_living + sqft_lot + floors + waterfront + view + condition + grade + yr_built + yr_renovated + zipcode + lat + long + sqft_living15 + sqft_lot15 + year_sold + month_sold + region + renovation_group + distance_to_downtown, data = train)
summary(full_model)
##
## Call:
## lm(formula = price ~ bedrooms + bathrooms + sqft_living + sqft_lot +
## floors + waterfront + view + condition + grade + yr_built +
## yr_renovated + zipcode + lat + long + sqft_living15 + sqft_lot15 +
## year_sold + month_sold + region + renovation_group + distance_to_downtown,
## data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1217493 -96073 -10555 76541 4303148
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.860e+07 1.131e+07 4.298 1.73e-05 ***
## bedrooms -4.144e+04 2.131e+03 -19.440 < 2e-16 ***
## bathrooms 2.559e+04 3.469e+03 7.376 1.70e-13 ***
## sqft_living 1.860e+02 3.636e+00 51.156 < 2e-16 ***
## sqft_lot 2.201e-01 5.601e-02 3.930 8.51e-05 ***
## floors 4.230e+03 3.527e+03 1.199 0.230
## waterfront1 6.362e+05 1.885e+04 33.749 < 2e-16 ***
## view 5.193e+04 2.278e+03 22.800 < 2e-16 ***
## condition 2.822e+04 2.529e+03 11.158 < 2e-16 ***
## grade 8.793e+04 2.322e+03 37.868 < 2e-16 ***
## yr_built -1.825e+03 8.191e+01 -22.278 < 2e-16 ***
## yr_renovated 3.618e+03 6.332e+02 5.714 1.12e-08 ***
## zipcode -7.417e+02 3.883e+01 -19.099 < 2e-16 ***
## lat 2.832e+05 1.466e+04 19.318 < 2e-16 ***
## long 4.820e+05 2.394e+04 20.136 < 2e-16 ***
## sqft_living15 2.023e+01 3.690e+00 5.482 4.26e-08 ***
## sqft_lot15 -7.703e-02 7.968e-02 -0.967 0.334
## year_sold 3.615e+04 5.060e+03 7.144 9.44e-13 ***
## month_sold 1.051e+03 7.593e+02 1.384 0.166
## regionSuburb -2.805e+04 5.403e+03 -5.192 2.11e-07 ***
## regionRural -1.973e+04 4.657e+03 -4.237 2.28e-05 ***
## renovation_groupRecently Renovated -7.196e+06 1.273e+06 -5.653 1.60e-08 ***
## renovation_groupRenovated Long Ago -7.161e+06 1.259e+06 -5.690 1.29e-08 ***
## distance_to_downtown -1.092e+06 3.172e+04 -34.424 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 193500 on 17264 degrees of freedom
## Multiple R-squared: 0.7239, Adjusted R-squared: 0.7235
## F-statistic: 1968 on 23 and 17264 DF, p-value: < 2.2e-16
Note: ‘id’ and ‘date’ were excluded from the full model since ‘id’ is a unique identifier with no predictive value, and ‘date’ contains too many unique values to be effectively represented as dummy variables.
Based on the full model summary, the variables ‘floors’, ‘sqft_lot15’, and ‘month_sold’ are not statistically significant predictors of price and can be removed. Additionally, ‘distance_to_downtown’ renders the geographic coordinates redundant, while ‘region’ captures the variation in ‘zipcode’, allowing both to be excluded.
# Compute diagnostics in one cell
ext.student.res <- rstudent(full_model)
student_res_count <- sum(abs(ext.student.res) > 2)
std.res <- rstandard(full_model)
std_res_count <- sum(abs(std.res) > 2)
lev <- lm.influence(full_model)$hat
n <- nrow(data)
p <- 21
high_lev_count <- sum(lev > 2 * p / n)
DFFITS_vals <- dffits(full_model)
dffits_count <- sum(abs(DFFITS_vals) > 2 * sqrt(p / n))
COOKS_vals <- cooks.distance(full_model)
cooks_count <- sum(COOKS_vals > 1)
# Summary output
cat("The number of outliers according to studentized and standardized residuals is",
student_res_count, "and", std_res_count,
"respectively. There are", high_lev_count, "high leverage points,",
dffits_count, "influential points according to DFFITS, and",
cooks_count, "influential points according to Cook's distance.\n")
## The number of outliers according to studentized and standardized residuals is 592 and 592 respectively. There are 1860 high leverage points, 1162 influential points according to DFFITS, and 0 influential points according to Cook's distance.
# 15 predictor variables
model1 <- lm(price ~ bedrooms + bathrooms + sqft_living + sqft_lot + waterfront + view + condition + grade + yr_built + yr_renovated + sqft_living15 + year_sold + region + renovation_group + distance_to_downtown, data = train)
summary(model1)
##
## Call:
## lm(formula = price ~ bedrooms + bathrooms + sqft_living + sqft_lot +
## waterfront + view + condition + grade + yr_built + yr_renovated +
## sqft_living15 + year_sold + region + renovation_group + distance_to_downtown,
## data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1204941 -100791 -11106 80187 4290929
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5.582e+07 6.638e+06 -8.410 < 2e-16 ***
## bedrooms -4.388e+04 2.216e+03 -19.799 < 2e-16 ***
## bathrooms 3.474e+04 3.536e+03 9.822 < 2e-16 ***
## sqft_living 1.845e+02 3.786e+00 48.730 < 2e-16 ***
## sqft_lot 1.774e-01 4.170e-02 4.253 2.12e-05 ***
## waterfront1 5.975e+05 1.962e+04 30.448 < 2e-16 ***
## view 4.167e+04 2.356e+03 17.690 < 2e-16 ***
## condition 2.414e+04 2.610e+03 9.246 < 2e-16 ***
## grade 9.246e+04 2.387e+03 38.737 < 2e-16 ***
## yr_built -1.995e+03 8.179e+01 -24.388 < 2e-16 ***
## yr_renovated 3.616e+03 6.604e+02 5.475 4.43e-08 ***
## sqft_living15 3.039e+01 3.826e+00 7.944 2.07e-15 ***
## year_sold 2.942e+04 3.294e+03 8.931 < 2e-16 ***
## regionSuburb 3.749e+04 5.140e+03 7.294 3.13e-13 ***
## regionRural -5.206e+04 4.702e+03 -11.072 < 2e-16 ***
## renovation_groupRecently Renovated -7.199e+06 1.328e+06 -5.422 5.97e-08 ***
## renovation_groupRenovated Long Ago -7.157e+06 1.313e+06 -5.453 5.03e-08 ***
## distance_to_downtown -6.328e+05 1.674e+04 -37.793 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 201900 on 17270 degrees of freedom
## Multiple R-squared: 0.6993, Adjusted R-squared: 0.699
## F-statistic: 2362 on 17 and 17270 DF, p-value: < 2.2e-16
15 predictor variables is still generally too many for a linear regression model because it increases the risk of overfitting, makes the model harder to interpret, and can introduce multicollinearity, which affects the reliability of coefficient estimates.
Rewards goodness of fit while penalizing extra predictors
## (Intercept) bedrooms sqft_living
## 3480306.4878 -38274.5914 213.9105
## waterfront1 view grade
## 598443.4688 45215.5715 100663.8903
## yr_built regionRural distance_to_downtown
## -1975.7104 -81514.6571 -550557.6691
Identifies models with low bias and variance
## (Intercept) bedrooms sqft_living
## 3480306.4878 -38274.5914 213.9105
## waterfront1 view grade
## 598443.4688 45215.5715 100663.8903
## yr_built regionRural distance_to_downtown
## -1975.7104 -81514.6571 -550557.6691
Favors simpler models by heavily penalizing overfitting
## (Intercept) bedrooms sqft_living
## 3480306.4878 -38274.5914 213.9105
## waterfront1 view grade
## 598443.4688 45215.5715 100663.8903
## yr_built regionRural distance_to_downtown
## -1975.7104 -81514.6571 -550557.6691
The predictors that lead to a first-order model that have the best Adjusted R², Mallows’ Cp, and BIC are identical, therefore, our final model can be defined.
# 8 predictor variables
model2 <- lm(price ~ bedrooms + sqft_living + waterfront + view + grade + yr_built + region + distance_to_downtown, data = train)
summary(model2)
##
## Call:
## lm(formula = price ~ bedrooms + sqft_living + waterfront + view +
## grade + yr_built + region + distance_to_downtown, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1225336 -102723 -11980 80524 4244607
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3624927.83 126156.88 28.733 <2e-16 ***
## bedrooms -39281.91 2179.27 -18.025 <2e-16 ***
## sqft_living 212.55 3.17 67.044 <2e-16 ***
## waterfront1 589822.81 19734.88 29.887 <2e-16 ***
## view 46319.24 2354.44 19.673 <2e-16 ***
## grade 100239.55 2284.19 43.884 <2e-16 ***
## yr_built -2053.39 67.44 -30.447 <2e-16 ***
## regionSuburb 43390.53 5078.93 8.543 <2e-16 ***
## regionRural -52496.40 4742.72 -11.069 <2e-16 ***
## distance_to_downtown -600755.30 16030.20 -37.476 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 204100 on 17278 degrees of freedom
## Multiple R-squared: 0.6926, Adjusted R-squared: 0.6925
## F-statistic: 4326 on 9 and 17278 DF, p-value: < 2.2e-16
The residual plots clearly indicate violations of both homoscedasticity and normality assumptions. The non-constant spread of residuals in the Scale-Location plot and the deviation of the Q-Q plot’s tail from the line suggest these issues. As determined from the diagnostic analysis, the 592 outliers identified in the training set likely contribute to these violations.
# Removing 592 rows with the greatest residuals (because there are 592 outliers)
std_res <- rstandard(model2)
outlier_residuals <- order(abs(std_res), decreasing = TRUE)[1:592]
train <- train[-outlier_residuals, ]
# Define Final Model
final_model <- lm(price ~ bedrooms + sqft_living + waterfront + view + grade + yr_built + region + distance_to_downtown, data = train)
summary(final_model)
##
## Call:
## lm(formula = price ~ bedrooms + sqft_living + waterfront + view +
## grade + yr_built + region + distance_to_downtown, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -416578 -84575 -9119 74561 708921
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3154992.38 83116.54 37.96 <2e-16 ***
## bedrooms -25599.69 1454.72 -17.60 <2e-16 ***
## sqft_living 154.66 2.21 69.97 <2e-16 ***
## waterfront1 584909.84 17873.87 32.72 <2e-16 ***
## view 45371.29 1609.94 28.18 <2e-16 ***
## grade 96291.53 1522.16 63.26 <2e-16 ***
## yr_built -1782.66 44.45 -40.10 <2e-16 ***
## regionSuburb 41911.57 3347.57 12.52 <2e-16 ***
## regionRural -59862.33 3106.14 -19.27 <2e-16 ***
## distance_to_downtown -457496.64 10671.37 -42.87 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 131300 on 16686 degrees of freedom
## Multiple R-squared: 0.758, Adjusted R-squared: 0.7579
## F-statistic: 5807 on 9 and 16686 DF, p-value: < 2.2e-16
Final Regression Equation: price = 3154933.30 - 25591.98(bedrooms) + 154.65(sqft_living) + 584913.42(waterfront) + 45372.16(view) + 96291.07(grade) - 1782.64(yr_built) + 41909.01(regionSuburb) - 59865.43(regionRural) - 457494.79(distance_to_downtown)
Despite the initially heavy tails observed in the residual plots, the context of our house price dataset, where significant price variations are expected, allows us to conclude that the linear regression assumptions are met after outlier removal. Specifically: linearity is supported by the random scatter of residuals in the Residuals vs Fitted plot; independence is indicated by the randomness in the residuals plot; homoscedasticity is satisfied by the constant spread of residuals; and normality of errors is supported by the linear alignment of points in the Q-Q plot.
# Predict the values using the test data
predictions <- predict(final_model, newdata = test)
actual_prices <- test$price
# Assessment metric: RMSE
mse <- mean((predictions - actual_prices)^2) # MSE
rmse <- sqrt(mse)
print(paste("Root Mean Squared Error (RMSE):", rmse))
## [1] "Root Mean Squared Error (RMSE): 204831.391833837"
# R-squared
rss <- sum((predictions - actual_prices)^2) # Residual sum of squares
tss <- sum((actual_prices - mean(actual_prices))^2) # Total sum of squares
rsq <- 1 - rss / tss # R-squared
print(paste("R-squared:", rsq))
## [1] "R-squared: 0.682522595097385"
The RMSE of approximately 204,832 indicates that, on average, the model’s predicted house prices deviate from actual prices by about $205K. This suggests a relatively high level of error, especially if most homes in the dataset are moderately priced. The R-squared value of 0.683 means the model explains about 68.3% of the variation in house prices, showing moderate predictive power.
grid.arrange(
ggplot(train, aes(x = factor(grade))) +
geom_bar(fill = "steelblue") +
labs(title = "Distribution of Home Grades",
x = "Grade (1 = Low Quality, 13 = High Quality)",
y = "Number of Homes") +
theme_minimal(),
ggplot(train, aes(x = factor(condition))) +
geom_bar(fill = "darkorange") +
labs(title = "Distribution of Home Conditions",
x = "Condition (1 = Poor, 5 = Excellent)",
y = "Number of Homes") +
theme_minimal(),
ncol = 2
)
ggplot(train, aes(x = factor(good_quality))) +
geom_bar(fill = "lightsteelblue") +
theme(plot.title=element_text(hjust=.5))+
labs(
title = "Distribution of Good Quality Homes",
x = "Good Quality (1 = Yes, 0 = No)",
y = "Number of Homes"
)
ggplot(train, aes(x = factor(good_quality), y = price)) +
geom_boxplot(fill = "lightblue") +
theme(plot.title = element_text(hjust=.5)) +
labs(
title = "Price Distribution by Home Quality",
x = "Good Quality (1 = Yes, 0 = No)",
y = "Sale Price"
) +
scale_y_continuous(breaks = breaks_extended(6), labels = label_dollar()) +
theme_minimal()
### Bivariate - Region vs. Good Quality
ggplot(train, aes(x = region, fill = factor(good_quality))) +
geom_bar(position = "fill") +
scale_fill_manual(values = c("lightgray", "steelblue"),
labels = c("Not Good Quality", "Good Quality")) +
labs(
title = "Proportion of Good Quality Homes by Region",
x = "Region",
y = "Proportion of Homes",
fill = "Home Quality"
) +
theme_minimal()
ggplot(train, aes(x = factor(good_quality), y = distance_to_downtown)) +
geom_boxplot(fill = "darkseagreen") +
labs(
title = "Distance to Downtown by Home Quality",
x = "Good Quality (1 = Yes, 0 = No)",
y = "Distance to Downtown (miles)"
) +
theme_minimal()
ggplot(train, aes(x = sqft_living, y = price, color = factor(good_quality))) +
geom_point(alpha = 0.4) +
labs(
title = "Price vs. Living Area Colored by Good Quality",
x = "Sqft Living",
y = "Price",
color = "Good Quality"
) +
scale_y_continuous(labels = label_dollar()) +
theme_minimal()
ggplot(train, aes(x = distance_to_downtown, y = price, color = factor(good_quality))) +
geom_point(alpha = 0.5) +
labs(
title = "Price vs. Distance to Downtown Colored by Home Quality",
x = "Distance to Downtown (decimal-degrees)",
y = "Price",
color = "Good Quality"
) +
scale_color_manual(values = c("gray70", "steelblue")) +
scale_y_continuous(breaks = breaks_extended(6), labels = label_dollar()) +
theme_minimal()
ggplot(train, aes(x = distance_to_downtown, y = log(price))) +
geom_point(alpha = 0.2, color = "black", size = 1) + # Light black dots
geom_smooth(method = "loess", se = FALSE, color = "blue", size = 1.2) + # Trend line
labs(title = "Log(Price) vs Distance to Downtown",
x = "Distance to Downtown Seattle (decimal-degrees)",
y = "Log(Price)") +
theme_minimal()
chart1<-ggplot(train, aes(x=waterfront, fill=good_quality))+
geom_bar(position = "fill")+
labs(x="Waterfront Property", y="Proportion",
title="Proportion of Good Quaility Homes by Waterfront")
chart2<-ggplot(train, aes(x=region, fill=good_quality))+
geom_bar(position = "fill")+
labs(x="Region", y="Proportion",
title="Proportion of Good Quaility Homes by Region")
chart3<-ggplot(train, aes(x=renovation_group, fill=good_quality))+
geom_bar(position = "fill")+
labs(x="Renovation Group", y="Proportion",
title="Proportion of Good Quaility Homes by Renovation Group")
Based on these bar charts, roughly 2/3 of good quality homes have a waterfront, there are more good quality suburban homes than rural and urban homes, and there are few good quality homes that are also recently renovated.
dp1<-ggplot2::ggplot(train,aes(x=log(price), color=good_quality))+
geom_density()+
labs(title="Density of Price by Good Quality")+
theme(plot.title = element_text(size=10, hjust=0.5))
dp2<-ggplot2::ggplot(train,aes(x=sqft_living, color=good_quality))+
geom_density()+
labs(title="Density of Square Feet by Good Quality")+
theme(plot.title = element_text(size=10, hjust=0.5))
dp3<-ggplot2::ggplot(train,aes(x=floors, color=good_quality))+
geom_density()+
labs(title="Density of Number of Floors by Good Quality")+
theme(plot.title = element_text(size=10, hjust=0.5))
dp4<-ggplot2::ggplot(train,aes(x=yr_built, color=good_quality))+
geom_density()+
labs(title="Density of Year Built by Good Quality")+
theme(plot.title = element_text(size=10, hjust=0.5))
dp5<-ggplot2::ggplot(train,aes(x=distance_to_downtown, color=good_quality))+
geom_density()+
labs(x="Distance to Downtown (decimal-degrees)", title="Density of
Distance to Donwtown by Good Quality")+
theme(plot.title = element_text(size=10, hjust=0.5))
dp6<-ggplot2::ggplot(train,aes(x=bedrooms, color=good_quality))+
geom_density()+
labs(title="Density of Bedrooms by Goood Quality")+
theme(plot.title = element_text(size=10, hjust=0.5))
gridExtra::grid.arrange(dp1, dp2, ncol = 2, nrow = 1)
# Correlation matrix of quantitative predictors
round(cor(train[,c(3:8, 10, 13 ,24)], use= "complete.obs"),3)
## price bedrooms bathrooms sqft_living sqft_lot floors
## price 1.000 0.312 0.517 0.702 0.103 0.290
## bedrooms 0.312 1.000 0.514 0.594 0.036 0.169
## bathrooms 0.517 0.514 1.000 0.738 0.091 0.506
## sqft_living 0.702 0.594 0.738 1.000 0.187 0.353
## sqft_lot 0.103 0.036 0.091 0.187 1.000 -0.011
## floors 0.290 0.169 0.506 0.353 -0.011 1.000
## view 0.380 0.064 0.155 0.235 0.069 0.016
## yr_built 0.068 0.167 0.527 0.341 0.054 0.501
## distance_to_downtown -0.177 0.116 0.180 0.197 0.275 0.061
## view yr_built distance_to_downtown
## price 0.380 0.068 -0.177
## bedrooms 0.064 0.167 0.116
## bathrooms 0.155 0.527 0.180
## sqft_living 0.235 0.341 0.197
## sqft_lot 0.069 0.054 0.275
## floors 0.016 0.501 0.061
## view 1.000 -0.056 -0.062
## yr_built -0.056 1.000 0.436
## distance_to_downtown -0.062 0.436 1.000
Based on previous visualizations, waterfront, region,renovation, price, sqft_living, floors, yr_built, distance to downtown, and bedrooms may influence the a home being deemd good quality or not.
full_log <- glm(good_quality~price + sqft_living + yr_built + distance_to_downtown + waterfront + region + renovation_group , data = train, family = binomial())
summary(full_log)
##
## Call:
## glm(formula = good_quality ~ price + sqft_living + yr_built +
## distance_to_downtown + waterfront + region + renovation_group,
## family = binomial(), data = train)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 4.583e+01 2.360e+00 19.418 < 2e-16 ***
## price 1.712e-06 1.473e-07 11.623 < 2e-16 ***
## sqft_living 5.058e-04 4.820e-05 10.494 < 2e-16 ***
## yr_built -2.574e-02 1.234e-03 -20.853 < 2e-16 ***
## distance_to_downtown 4.982e-01 3.212e-01 1.551 0.121
## waterfront1 -1.445e+00 3.704e-01 -3.901 9.58e-05 ***
## regionSuburb 1.092e+00 8.882e-02 12.299 < 2e-16 ***
## regionRural 4.962e-01 8.396e-02 5.910 3.43e-09 ***
## renovation_groupRecently Renovated -3.125e+00 5.129e-01 -6.092 1.12e-09 ***
## renovation_groupRenovated Long Ago -1.037e+00 1.722e-01 -6.024 1.70e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 12037 on 16695 degrees of freedom
## Residual deviance: 10344 on 16686 degrees of freedom
## AIC: 10364
##
## Number of Fisher Scoring iterations: 6
## price sqft_living
## 25.794900 27.314048
## yr_built distance_to_downtown
## 21.912356 24.973604
## waterfront1 regionSuburb
## 8.200852 29.237259
## regionRural renovation_groupRecently Renovated
## 29.235019 59.679356
## renovation_groupRenovated Long Ago
## 12.143194
There seems to be a high degree of multicollinearity. From the correlation matrix, sqft_living, bedrooms, and bathrooms seemed to all be correlated to one another. Let’s drop distance to downtown and waterfront, as they are reported as not significant. I’d conduct a liklihood ratio test to see if we can drop these variables from the model. First I’d like to compute the accuracy and error rate from the confusion matrix for this model.
# Predicted probs for test data
preds<-predict(full_log,newdata=test, type="response")
# Confusion matrix with threshold of 0.5
table(test$good_quality, preds>0.5)
##
## FALSE TRUE
## no 3715 80
## yes 458 69
# Calculate accuracy and error rate
accuracy <- sum(diag(table(test$good_quality, preds > 0.5))) / sum(table(test$good_quality, preds > 0.5))
error_rate <- 1 - accuracy
cat("Accuracy:", round(accuracy * 100, 2), "%\nError Rate:", round(error_rate * 100, 2), "%")
## Accuracy: 87.55 %
## Error Rate: 12.45 %
Here we have what appears to be good accuracy and error_rate, but we must remember that most of the data set is already made up of homes that are considered not good_quality, meaning we are dealing with an unbalanced sample size of the response variable. Lets look at ROC and AUC.
# Produce the numbers associated with classification table
rates<-ROCR::prediction(preds, test$good_quality)
# Store the true positive and false positive rates
roc_result<-ROCR::performance(rates,measure="tpr", x.measure="fpr")
# Plot ROC curve and overlay the diagonal line for random guessing
plot(roc_result, main="ROC Curve for Full Model")
lines(x = c(0,1), y = c(0,1), col="red")
## [[1]]
## [1] 0.7865488
Now we see that our model that is full performs better than random guessing and has an AUC of 0.77. I’d like to see if a reduced model would perform better.
reduced_log <- glm(good_quality~price + sqft_living + yr_built + region , data = train, family = binomial())
summary(reduced_log)
##
## Call:
## glm(formula = good_quality ~ price + sqft_living + yr_built +
## region, family = binomial(), data = train)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 4.009e+01 2.118e+00 18.927 < 2e-16 ***
## price 1.351e-06 1.264e-07 10.686 < 2e-16 ***
## sqft_living 5.337e-04 4.474e-05 11.929 < 2e-16 ***
## yr_built -2.273e-02 1.103e-03 -20.606 < 2e-16 ***
## regionSuburb 1.121e+00 8.443e-02 13.278 < 2e-16 ***
## regionRural 4.913e-01 8.004e-02 6.138 8.34e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 12037 on 16695 degrees of freedom
## Residual deviance: 10499 on 16690 degrees of freedom
## AIC: 10511
##
## Number of Fisher Scoring iterations: 5
We see that the all predictors for this model are significant.
# Predicted probs for test data
preds<-predict(reduced_log,newdata=test, type="response")
# Confusion matrix with threshold of 0.5
table(test$good_quality, preds>0.5)
##
## FALSE TRUE
## no 3714 81
## yes 466 61
# Calculate accuracy and error rate
accuracy <- sum(diag(table(test$good_quality, preds > 0.5))) / sum(table(test$good_quality, preds > 0.5))
error_rate <- 1 - accuracy
cat("Accuracy:", round(accuracy * 100, 2), "%\nError Rate:", round(error_rate * 100, 2), "%")
## Accuracy: 87.34 %
## Error Rate: 12.66 %
# Produce the numbers associated with classification table
rates<-ROCR::prediction(preds, test$good_quality)
# Store the true positive and false positive rates
roc_result<-ROCR::performance(rates,measure="tpr", x.measure="fpr")
# Plot ROC curve and overlay the diagonal line for random guessing
plot(roc_result, main="ROC Curve for Reduced Model")
lines(x = c(0,1), y = c(0,1), col="red")
## [[1]]
## [1] 0.7818572
As we can see the AUC for the reduced model is very slightly higher than for the full model. Let’s conduct a likelihood ratio test to determine which model to use.
H0: B4 = B5 = B8 = 0 Ha: at least one beta in null is not zero.
# Residual deviances
DR <- reduced_log$deviance # Deviance of reduced model
DF <- full_log$deviance # Deviance of full model
delta_g <- DR - DF
CV <- qchisq(0.95,3)
delta_g > CV
## [1] TRUE
So we reject the null hypothesis, meaning the additional predictors significantly improve the model, and therefore should go forward with the full model over the reduced one.