Executive Summary

This project analyzes housing price data from King County, Washington, to build predictive models for home values. The analysis includes comprehensive data cleaning, exploratory data analysis, linear regression modeling, and logistic regression for home quality classification.

Data Cleaning and Preprocessing

This section includes steps to identify and fix data entry errors and transform problematic variables in the King County housing dataset.

Load and Inspect the Data

# Load the King County housing dataset
data <- read.csv("kc_house_data.csv", sep=",", header = TRUE)

# Data structure
# str(data)

# Check for missing values
cat("Missing values per column:\n")
## Missing values per column:
colSums(is.na(data))
##            id          date         price      bedrooms     bathrooms 
##             0             0             0             0             0 
##   sqft_living      sqft_lot        floors    waterfront          view 
##             0             0             0             0             0 
##     condition         grade    sqft_above sqft_basement      yr_built 
##             0             0             0             0             0 
##  yr_renovated       zipcode           lat          long sqft_living15 
##             0             0             0             0             0 
##    sqft_lot15 
##             0

Identify Potential Data Entry Errors

In this section, we identify observations that have clearly incorrect or suspicious values. These include homes with 0 or large numbers of bedrooms and bathrooms, 0 square footage, and any unusual pricing or year values.

Bedroom and Bathroom Anomalies

# Check for problematic bedroom and bathroom values
problem_rows <- data[
  (data$bedrooms == 0 | data$bedrooms == 33 | data$bathrooms == 0) &
  !is.na(data$bedrooms) & !is.na(data$bathrooms),
  c("id", "bedrooms", "bathrooms", "sqft_living", "price", "zipcode")
]

cat("Properties with suspicious bedroom/bathroom values:\n")
## Properties with suspicious bedroom/bathroom values:
print(problem_rows)
##               id bedrooms bathrooms sqft_living   price zipcode
## 876   6306400140        0      0.00        3064 1095000   98102
## 1150  3421079032        1      0.00         670   75000   98022
## 3120  3918400017        0      0.00        1470  380000   98133
## 3468  1453602309        0      1.50        1430  288000   98125
## 4869  6896300380        0      1.00         390  228000   98118
## 5833  5702500050        1      0.00         600  280000   98045
## 6995  2954400190        0      0.00        4810 1295650   98053
## 8478  2569500210        0      2.50        2290  339950   98042
## 8485  2310060040        0      2.50        1810  240000   98038
## 9774  3374500520        0      0.00        2460  355000   98031
## 9855  7849202190        0      0.00        1470  235000   98065
## 10482  203100435        1      0.00         690  484000   98053
## 12654 7849202299        0      2.50        1490  320000   98065
## 14424 9543000205        0      0.00         844  139950   98001
## 15871 2402100895       33      1.75        1620  640000   98103
## 18380 1222029077        0      0.75         384  265000   98070
## 19453 3980300371        0      0.00         290  142000   98024
cat("\nFound", nrow(problem_rows), "properties with extreme bedroom/bathroom values")
## 
## Found 17 properties with extreme bedroom/bathroom values

Manual Data Corrections

Based on external verification using King County Parcel Viewer, we apply targeted corrections to properties with verified data entry errors:

# Apply manual corrections based on external verification
corrections <- list(
  list(id = 6306400140, bedrooms = 5, bathrooms = 4.50),
  list(id = 3421079032, bedrooms = 3, bathrooms = 3.75),
  list(id = 3918400017, bedrooms = 3, bathrooms = 2.25),
  list(id = 6896300380, bedrooms = 3, bathrooms = NA), # bedroom only
  list(id = 2954400190, bedrooms = 4, bathrooms = 4),
  list(id = 2569500210, bedrooms = 4, bathrooms = NA),
  list(id = 2310060040, bedrooms = 4, bathrooms = NA),
  list(id = 7849202190, bedrooms = 3, bathrooms = 1.50),
  list(id = 7849202299, bedrooms = 0, bathrooms = NA), # verified studio
  list(id = 9543000205, bedrooms = 2, bathrooms = 1),
  list(id = 2402100895, bedrooms = 3, bathrooms = NA), # was 33 bedrooms
  list(id = 1222029077, bedrooms = 1, bathrooms = 1.50),
  list(id = 3374500520, bedrooms = 4, bathrooms = 3.5)
)

# Apply corrections
for (correction in corrections) {
  if (!is.na(correction$bedrooms)) {
    data[data$id == correction$id, "bedrooms"] <- correction$bedrooms
  }
  if (!is.na(correction$bathrooms)) {
    data[data$id == correction$id, "bathrooms"] <- correction$bathrooms
  }
}

# Remove unverifiable records
remove_ids <- c(5702500050, 203100435, 3980300371)
data <- data[!data$id %in% remove_ids, ]

cat("Applied corrections to", length(corrections), "properties")
## Applied corrections to 13 properties
cat("\nRemoved", length(remove_ids), "unverifiable records")
## 
## Removed 3 unverifiable records
cat("\nRemaining records:", nrow(data))
## 
## Remaining records: 21610

As part of the data cleaning process, we found 17 properties with unusual or clearly incorrect values for bedrooms and bathrooms. Some had 0 bedrooms or 0 bathrooms, and one even had 33 bedrooms, which is obviously unrealistic. Instead of deleting these rows right away, we looked each one up manually using the King County Parcel Viewer to verify what the correct values should be. For most of them, we was able to confirm the actual number of rooms and made the necessary corrections based on that. For instance, one home listed with 0 bedrooms and 0 bathrooms was corrected to 4 bedrooms and 4 bathrooms after verification, and the home listed with 33 bedrooms was updated to 3, which made more sense given its size. One property turned out to be a studio-style layout with 0 bedrooms and 1.5 bathrooms, so we kept that one as-is. In three cases, we couldn’t find any record of the property, and since the information couldn’t be confirmed and looked suspicious, we decided to remove those rows. Overall, this process helped us clean the dataset in a careful and meaningful way, using outside sources to make informed decisions instead of relying only on assumptions or automatic removals.

Additional Data Quality Checks

We performed several additional validation checks on the dataset: - Duplicate Property IDs: No problematic duplicates were found - Zero Living Area/Lot Size: No homes with 0 square footage were found - Invalid Prices: No homes with prices ≤ 0 were found
- Future Years: No homes built or renovated after 2015 were found

All validation checks confirmed the dataset integrity for these variables.

Feature Engineering

Date Feature Extraction

The date column is a character string and not useful for modeling. We’ll extract: year and month sold.

data$year_sold <- as.numeric(substr(data$date, 1, 4))
data$month_sold <- as.numeric(substr(data$date, 5, 6))

table(data$year_sold)
## 
##  2014  2015 
## 14630  6980
table(data$month_sold)
## 
##    1    2    3    4    5    6    7    8    9   10   11   12 
##  978 1250 1875 2231 2414 2180 2211 1940 1772 1878 1410 1471

Geographic Region Classification

zipcode has over 70 unique values. That’s too many for regression modeling, and need 70 dummy variables, which will clutters model and Increases risk of overfitting

city_zips <- c(98101, 98102, 98104, 98105, 98109, 98112, 98115, 98116, 98118, 98119, 98121, 98122, 98125, 98126, 98133, 98134,98136, 98144, 98154, 98164, 98174, 98195)

suburb_zips <- c(98004, 98005, 98006, 98007, 98008, 98027, 98029, 98033,98034, 98040, 98052, 98053, 98056, 98057, 98059, 98072,98074, 98075, 98092, 98070, 98028, 98019)

rural_zips <- setdiff(unique(data$zipcode), union(city_zips, suburb_zips))

data$region <- case_when(
  data$zipcode %in% city_zips ~ "City",
  data$zipcode %in% suburb_zips ~ "Suburb",
  data$zipcode %in% rural_zips ~ "Rural"
)

data$region <- factor(data$region, levels = c("City", "Suburb", "Rural"))

table(data$region)
## 
##   City Suburb  Rural 
##   4471   7266   9873

4471 are city home, 7266 are suburb, and 9873 are rural.

Grouping Homes by Renovation Recency

data$renovation_group <- case_when(
  data$yr_renovated == 0 ~ "Never Renovated",
  data$yr_renovated >= 2005 ~ "Recently Renovated",
  TRUE ~ "Renovated Long Ago"
)
data$renovation_group <- factor(data$renovation_group)
table(data$renovation_group)
## 
##    Never Renovated Recently Renovated Renovated Long Ago 
##              20696                320                594

20,699 homes were not renovated, 320 homes were recently renovated; and 594 are renovated long ago

Transform Latitude & Longitude into Distance to Downtown

# Convert waterfront to factor and calculate distance to downtown Seattle
data$waterfront <- factor(data$waterfront)

data$distance_to_downtown <- sqrt(
  (data$lat - 47.6062)^2 + (data$long + 122.3321)^2
)

summary(data$distance_to_downtown)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.01311 0.09646 0.17881 0.20076 0.28468 1.02269

Check for multicollinearity

# Check for multicollinearity among square footage variables
sqft_vars <- data[, c("sqft_living", "sqft_above", "sqft_basement", "sqft_living15")]
cor_matrix <- round(cor(sqft_vars), 2)
cor_matrix
##               sqft_living sqft_above sqft_basement sqft_living15
## sqft_living          1.00       0.88          0.43          0.76
## sqft_above           0.88       1.00         -0.05          0.73
## sqft_basement        0.43      -0.05          1.00          0.20
## sqft_living15        0.76       0.73          0.20          1.00
data$sqft_above <- NULL
data$sqft_basement <- NULL

Exploratory Data Analysis

This section contains visualizations that explore how price is related to the other factors.

# Split data into training and testing. We will only use the training data for the visualizations
set.seed(1)
sample<-sample.int(nrow(data), floor(.80*nrow(data)), replace = F)
train<-data[sample, ]
test<-data[-sample, ]

Density plot for price

# distribution of price
ggplot(train, aes(x=price))+
  scale_x_continuous(breaks = breaks_extended(6),labels = label_dollar())+
  theme(plot.title = element_text(hjust=.5))+
  labs(x="Price", y="Density", title = "Distrubution of Price")+
  geom_density()

### Boxplot of price

ggplot(train, aes(x= "", y=price))+
  geom_boxplot()+
  scale_y_continuous(breaks = breaks_extended(10),labels = label_dollar())+
  theme(plot.title = element_text(hjust=.5))+
  labs(x="Price", y="Price", title = "Summary of Price")

Bar chart of Average Price & Boxplot of Price by Number of Bedrooms

#bar chart and box plot that show how # of bedrooms are related to price
ggplot(train, aes(x=bedrooms, y=price))+
  scale_y_continuous(breaks = breaks_extended(6),labels = label_dollar())+
  theme(plot.title = element_text(hjust=.5))+
  labs(x="Bedrooms", y="Averge Price", title = "Average Price by Number of Bedrooms")+
  stat_summary(geom="bar", fill="red")

ggplot(train, aes(x=as.factor(bedrooms), y=price))+
  geom_boxplot(fill="lightblue")+
  scale_y_continuous(breaks = breaks_extended(6),labels = label_dollar())+
  theme(plot.title = element_text(hjust=.5))+
  labs(x="Bedrooms", y="Price", title = "Price Distribtion by Number of Bedrooms")

Bar chart of Average Price by Number of Bathrooms

#bar chart and box plot that show how # of bedrooms are related to price
ggplot(train, aes(x=cut(bathrooms, breaks=c(0,1,2,3,4,5,6,7,8)), y=price))+
  theme(plot.title = element_text(hjust=.5))+
  labs(x="Bathrooms", y="Averge Price", title = "Average Price by Number of Bathrooms")+
  scale_y_continuous(breaks = breaks_extended(6),labels = label_dollar())+
  scale_x_discrete(labels = c("0-1", "1-2", "2-3","3-4","4-5", "5-6", "6-7", "7+"))+
  stat_summary(geom="bar", fill="red")

Scatter plot of price against square footage

ggplot(train, aes(x=sqft_living, y=price))+
  scale_y_continuous(breaks = breaks_extended(6),labels = label_dollar())+
  theme(plot.title = element_text(hjust=.5))+
  labs(x="Living Square Footage", y="Price", title = "Price Against Living Square Footage")+
  geom_point()

Boxplots of price by waterfront

ggplot(train, aes(x=waterfront, y=price))+
  scale_y_continuous(breaks = breaks_extended(6),labels = label_dollar())+
  theme(plot.title = element_text(hjust=.5))+
  labs(x="Waterfront", y="Price", title = "Price by Waterfront")+
  geom_boxplot(fill="lightblue")

Barcharts of Price by Grade and Condition

grade <- ggplot(train, aes(x=grade, y=price))+
  scale_y_continuous(breaks = breaks_extended(6),labels = label_dollar())+
  theme(plot.title = element_text(hjust=.5))+
  labs(x="Grade", y="Price", title = "Price by Grade")+
  stat_summary(geom="bar", fill="green")

condition <- ggplot(train, aes(x=condition, y=price))+
  scale_y_continuous(breaks = breaks_extended(6),labels = label_dollar())+
  theme(plot.title = element_text(hjust=.5))+
  labs(x="Condition", y="Price", title = "Price by Condition")+
  stat_summary(geom="bar", fill="green")

grid.arrange(grade,condition,ncol=2,nrow=2)

Box pots of Log (Price) by Region

# 'Region' was derived from 'zipcode'
ggplot(train, aes(x=region, y=log(price)))+
  scale_y_continuous(breaks = breaks_extended(6),labels = label_dollar())+
  theme(plot.title = element_text(hjust=.5))+
  labs(x="Region", y="log(Price)", title = "log(Price) by Region")+
  geom_boxplot(fill="lavender")

Density plots of Distance to Downtown & Year Built by Region

dist <-ggplot(train, aes(x=distance_to_downtown, color=region))+
        theme(plot.title = element_text(hjust=.5))+
        labs(x="Distance to Downtown (decimal-degrees)", 
            title = "Density Plot of Distance to Downtown")+
        geom_density()

built <- ggplot(train,aes(x=yr_built, color=region))+
          theme(plot.title = element_text(hjust=.5))+
          labs(x="Year Built",
               title = "Density Plot of Distance to Downtown")+
          geom_density()

grid.arrange(dist,built,ncol=2,nrow=2)

Scatterplot of Log (Price) Against Distance to Downtown by Region

ggplot(train, aes(x=distance_to_downtown, y=log(price), color=region))+
  theme(plot.title=element_text(hjust=.5))+
  scale_y_continuous(breaks = breaks_extended(6),labels =    label_dollar())+
  labs(x="Distance to Downtown (decimal-degrees)",
       y= "Log(price)",
       color = "Region",
       title = "Log(price) Against Distance to Downtown by Region")+
  geom_point()

Scatterplot of Log (Price) Against Distance to Downtown by Grade

ggplot(train, aes(x=distance_to_downtown, y=log(price), color=grade))+
  scale_y_continuous(breaks = breaks_extended(6),labels = label_dollar())+
  scale_color_gradient(low = "blue", high = "red")+
  theme(plot.title=element_text(hjust=.5))+
  labs(x="Distance to Downtown (decimal-degrees)",
       y= "Log(price)",
       color = "Grade",
       title = "Log(price) Against Distance to Downtown by Grade")+
  geom_point()

Scatterplot of Log (Price) Against Condition by Year Built

ggplot(train,aes(x=condition,y=log(price), color=yr_built))+
  scale_y_continuous(breaks = breaks_extended(6),labels =    label_dollar())+
  scale_color_gradientn(colors=c("steelblue", "skyblue", "lightgreen", "gold", "tomato"))+
  theme(plot.title = element_text(hjust=.5))+
  labs(x="Condition",
       y= "log(Price)",
       color = "Year Built",
       title = "log(Price) Against Condition by Year Built")+
  geom_point()

Scatterplot of Log (Price) by Distance to Downtown by Renovated

ggplot(train,aes(x=distance_to_downtown,y=log(price), color=renovation_group))+
  scale_y_continuous(breaks = breaks_extended(6),labels =    label_dollar())+
  theme(plot.title = element_text(hjust=.05))+
  labs(x="Distance to Downtown (decimal-degrees)",
       y= "log(Price)",
       color = "Renovated",
       title = "log(Price) Against Distance to Downtown by Renovated")+
  geom_point(alpha=.5)

Linear Regression Analysis

This section presents a linear regression model for predicting Price and outlines the methodology used to develop it.

Define linear regression model with price as response variable and all other variables as predictor variables

# 21 predictor variables
full_model <- lm(price ~ bedrooms + bathrooms + sqft_living + sqft_lot + floors + waterfront + view + condition + grade + yr_built + yr_renovated + zipcode + lat + long + sqft_living15 + sqft_lot15 + year_sold + month_sold + region + renovation_group + distance_to_downtown, data = train)

summary(full_model)
## 
## Call:
## lm(formula = price ~ bedrooms + bathrooms + sqft_living + sqft_lot + 
##     floors + waterfront + view + condition + grade + yr_built + 
##     yr_renovated + zipcode + lat + long + sqft_living15 + sqft_lot15 + 
##     year_sold + month_sold + region + renovation_group + distance_to_downtown, 
##     data = train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1217493   -96073   -10555    76541  4303148 
## 
## Coefficients:
##                                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                         4.860e+07  1.131e+07   4.298 1.73e-05 ***
## bedrooms                           -4.144e+04  2.131e+03 -19.440  < 2e-16 ***
## bathrooms                           2.559e+04  3.469e+03   7.376 1.70e-13 ***
## sqft_living                         1.860e+02  3.636e+00  51.156  < 2e-16 ***
## sqft_lot                            2.201e-01  5.601e-02   3.930 8.51e-05 ***
## floors                              4.230e+03  3.527e+03   1.199    0.230    
## waterfront1                         6.362e+05  1.885e+04  33.749  < 2e-16 ***
## view                                5.193e+04  2.278e+03  22.800  < 2e-16 ***
## condition                           2.822e+04  2.529e+03  11.158  < 2e-16 ***
## grade                               8.793e+04  2.322e+03  37.868  < 2e-16 ***
## yr_built                           -1.825e+03  8.191e+01 -22.278  < 2e-16 ***
## yr_renovated                        3.618e+03  6.332e+02   5.714 1.12e-08 ***
## zipcode                            -7.417e+02  3.883e+01 -19.099  < 2e-16 ***
## lat                                 2.832e+05  1.466e+04  19.318  < 2e-16 ***
## long                                4.820e+05  2.394e+04  20.136  < 2e-16 ***
## sqft_living15                       2.023e+01  3.690e+00   5.482 4.26e-08 ***
## sqft_lot15                         -7.703e-02  7.968e-02  -0.967    0.334    
## year_sold                           3.615e+04  5.060e+03   7.144 9.44e-13 ***
## month_sold                          1.051e+03  7.593e+02   1.384    0.166    
## regionSuburb                       -2.805e+04  5.403e+03  -5.192 2.11e-07 ***
## regionRural                        -1.973e+04  4.657e+03  -4.237 2.28e-05 ***
## renovation_groupRecently Renovated -7.196e+06  1.273e+06  -5.653 1.60e-08 ***
## renovation_groupRenovated Long Ago -7.161e+06  1.259e+06  -5.690 1.29e-08 ***
## distance_to_downtown               -1.092e+06  3.172e+04 -34.424  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 193500 on 17264 degrees of freedom
## Multiple R-squared:  0.7239, Adjusted R-squared:  0.7235 
## F-statistic:  1968 on 23 and 17264 DF,  p-value: < 2.2e-16

Note: ‘id’ and ‘date’ were excluded from the full model since ‘id’ is a unique identifier with no predictive value, and ‘date’ contains too many unique values to be effectively represented as dummy variables.

Based on the full model summary, the variables ‘floors’, ‘sqft_lot15’, and ‘month_sold’ are not statistically significant predictors of price and can be removed. Additionally, ‘distance_to_downtown’ renders the geographic coordinates redundant, while ‘region’ captures the variation in ‘zipcode’, allowing both to be excluded.

Influential observations, high leverages observations, and outliers for the Full Model

# Compute diagnostics in one cell
ext.student.res <- rstudent(full_model)
student_res_count <- sum(abs(ext.student.res) > 2)

std.res <- rstandard(full_model)
std_res_count <- sum(abs(std.res) > 2)

lev <- lm.influence(full_model)$hat
n <- nrow(data)
p <- 21
high_lev_count <- sum(lev > 2 * p / n)

DFFITS_vals <- dffits(full_model)
dffits_count <- sum(abs(DFFITS_vals) > 2 * sqrt(p / n))

COOKS_vals <- cooks.distance(full_model)
cooks_count <- sum(COOKS_vals > 1)

# Summary output
cat("The number of outliers according to studentized and standardized residuals is", 
    student_res_count, "and", std_res_count, 
    "respectively. There are", high_lev_count, "high leverage points,", 
    dffits_count, "influential points according to DFFITS, and", 
    cooks_count, "influential points according to Cook's distance.\n")
## The number of outliers according to studentized and standardized residuals is 592 and 592 respectively. There are 1860 high leverage points, 1162 influential points according to DFFITS, and 0 influential points according to Cook's distance.

Define linear regression model with slightly less predictor variables

# 15 predictor variables
model1 <- lm(price ~ bedrooms + bathrooms + sqft_living + sqft_lot + waterfront + view + condition + grade + yr_built + yr_renovated + sqft_living15 + year_sold + region + renovation_group + distance_to_downtown, data = train)

summary(model1)
## 
## Call:
## lm(formula = price ~ bedrooms + bathrooms + sqft_living + sqft_lot + 
##     waterfront + view + condition + grade + yr_built + yr_renovated + 
##     sqft_living15 + year_sold + region + renovation_group + distance_to_downtown, 
##     data = train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1204941  -100791   -11106    80187  4290929 
## 
## Coefficients:
##                                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                        -5.582e+07  6.638e+06  -8.410  < 2e-16 ***
## bedrooms                           -4.388e+04  2.216e+03 -19.799  < 2e-16 ***
## bathrooms                           3.474e+04  3.536e+03   9.822  < 2e-16 ***
## sqft_living                         1.845e+02  3.786e+00  48.730  < 2e-16 ***
## sqft_lot                            1.774e-01  4.170e-02   4.253 2.12e-05 ***
## waterfront1                         5.975e+05  1.962e+04  30.448  < 2e-16 ***
## view                                4.167e+04  2.356e+03  17.690  < 2e-16 ***
## condition                           2.414e+04  2.610e+03   9.246  < 2e-16 ***
## grade                               9.246e+04  2.387e+03  38.737  < 2e-16 ***
## yr_built                           -1.995e+03  8.179e+01 -24.388  < 2e-16 ***
## yr_renovated                        3.616e+03  6.604e+02   5.475 4.43e-08 ***
## sqft_living15                       3.039e+01  3.826e+00   7.944 2.07e-15 ***
## year_sold                           2.942e+04  3.294e+03   8.931  < 2e-16 ***
## regionSuburb                        3.749e+04  5.140e+03   7.294 3.13e-13 ***
## regionRural                        -5.206e+04  4.702e+03 -11.072  < 2e-16 ***
## renovation_groupRecently Renovated -7.199e+06  1.328e+06  -5.422 5.97e-08 ***
## renovation_groupRenovated Long Ago -7.157e+06  1.313e+06  -5.453 5.03e-08 ***
## distance_to_downtown               -6.328e+05  1.674e+04 -37.793  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 201900 on 17270 degrees of freedom
## Multiple R-squared:  0.6993, Adjusted R-squared:  0.699 
## F-statistic:  2362 on 17 and 17270 DF,  p-value: < 2.2e-16

15 predictor variables is still generally too many for a linear regression model because it increases the risk of overfitting, makes the model harder to interpret, and can introduce multicollinearity, which affects the reliability of coefficient estimates.

Use regsubsets function to identify best combination of predictor variables

allreg <- regsubsets(price ~ bedrooms + bathrooms + sqft_living + sqft_lot + waterfront + view + condition + grade + yr_built + yr_renovated + sqft_living15 + year_sold + region + renovation_group + distance_to_downtown, data = train, nbest=1)

Adjusted R²

Rewards goodness of fit while penalizing extra predictors

coef(allreg, which.max(summary(allreg)$adjr2))
##          (Intercept)             bedrooms          sqft_living 
##         3480306.4878          -38274.5914             213.9105 
##          waterfront1                 view                grade 
##          598443.4688           45215.5715          100663.8903 
##             yr_built          regionRural distance_to_downtown 
##           -1975.7104          -81514.6571         -550557.6691

Mallows’ Cp

Identifies models with low bias and variance

coef(allreg, which.min(summary(allreg)$cp))
##          (Intercept)             bedrooms          sqft_living 
##         3480306.4878          -38274.5914             213.9105 
##          waterfront1                 view                grade 
##          598443.4688           45215.5715          100663.8903 
##             yr_built          regionRural distance_to_downtown 
##           -1975.7104          -81514.6571         -550557.6691

BIC

Favors simpler models by heavily penalizing overfitting

coef(allreg, which.min(summary(allreg)$bic))
##          (Intercept)             bedrooms          sqft_living 
##         3480306.4878          -38274.5914             213.9105 
##          waterfront1                 view                grade 
##          598443.4688           45215.5715          100663.8903 
##             yr_built          regionRural distance_to_downtown 
##           -1975.7104          -81514.6571         -550557.6691

The predictors that lead to a first-order model that have the best Adjusted R², Mallows’ Cp, and BIC are identical, therefore, our final model can be defined.

Define Model with final predictors

# 8 predictor variables
model2 <- lm(price ~ bedrooms + sqft_living + waterfront + view + grade + yr_built + region + distance_to_downtown, data = train)

summary(model2)
## 
## Call:
## lm(formula = price ~ bedrooms + sqft_living + waterfront + view + 
##     grade + yr_built + region + distance_to_downtown, data = train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1225336  -102723   -11980    80524  4244607 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          3624927.83  126156.88  28.733   <2e-16 ***
## bedrooms              -39281.91    2179.27 -18.025   <2e-16 ***
## sqft_living              212.55       3.17  67.044   <2e-16 ***
## waterfront1           589822.81   19734.88  29.887   <2e-16 ***
## view                   46319.24    2354.44  19.673   <2e-16 ***
## grade                 100239.55    2284.19  43.884   <2e-16 ***
## yr_built               -2053.39      67.44 -30.447   <2e-16 ***
## regionSuburb           43390.53    5078.93   8.543   <2e-16 ***
## regionRural           -52496.40    4742.72 -11.069   <2e-16 ***
## distance_to_downtown -600755.30   16030.20 -37.476   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 204100 on 17278 degrees of freedom
## Multiple R-squared:  0.6926, Adjusted R-squared:  0.6925 
## F-statistic:  4326 on 9 and 17278 DF,  p-value: < 2.2e-16

Check model meets linear regression assumptions

plot(model2, which = 1) # Linearity

plot(residuals(model2), type = "l", main = "Residuals Plot") # Independence of Errors

plot(model2, which = 3) # Homoscedasticity

plot(model2, which = 2) # Normality of Errors

The residual plots clearly indicate violations of both homoscedasticity and normality assumptions. The non-constant spread of residuals in the Scale-Location plot and the deviation of the Q-Q plot’s tail from the line suggest these issues. As determined from the diagnostic analysis, the 592 outliers identified in the training set likely contribute to these violations.

Remove outliers from train set, define Final Model, and verify assumptions are met

# Removing 592 rows with the greatest residuals (because there are 592 outliers)
std_res <- rstandard(model2)
outlier_residuals <- order(abs(std_res), decreasing = TRUE)[1:592]
train <- train[-outlier_residuals, ]

# Define Final Model
final_model <- lm(price ~ bedrooms + sqft_living + waterfront + view + grade + yr_built + region + distance_to_downtown, data = train)

summary(final_model)
## 
## Call:
## lm(formula = price ~ bedrooms + sqft_living + waterfront + view + 
##     grade + yr_built + region + distance_to_downtown, data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -416578  -84575   -9119   74561  708921 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          3154992.38   83116.54   37.96   <2e-16 ***
## bedrooms              -25599.69    1454.72  -17.60   <2e-16 ***
## sqft_living              154.66       2.21   69.97   <2e-16 ***
## waterfront1           584909.84   17873.87   32.72   <2e-16 ***
## view                   45371.29    1609.94   28.18   <2e-16 ***
## grade                  96291.53    1522.16   63.26   <2e-16 ***
## yr_built               -1782.66      44.45  -40.10   <2e-16 ***
## regionSuburb           41911.57    3347.57   12.52   <2e-16 ***
## regionRural           -59862.33    3106.14  -19.27   <2e-16 ***
## distance_to_downtown -457496.64   10671.37  -42.87   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 131300 on 16686 degrees of freedom
## Multiple R-squared:  0.758,  Adjusted R-squared:  0.7579 
## F-statistic:  5807 on 9 and 16686 DF,  p-value: < 2.2e-16

Final Regression Equation: price = 3154933.30 - 25591.98(bedrooms) + 154.65(sqft_living) + 584913.42(waterfront) + 45372.16(view) + 96291.07(grade) - 1782.64(yr_built) + 41909.01(regionSuburb) - 59865.43(regionRural) - 457494.79(distance_to_downtown)

Check Final Model meets linear regression assumptions

plot(final_model, which = 1) # Linearity

plot(residuals(final_model), type = "l", main = "Residuals Plot") # Independence of Errors

plot(final_model, which = 3) # Homoscedasticity

plot(final_model, which = 2) # Normality of Errors

Despite the initially heavy tails observed in the residual plots, the context of our house price dataset, where significant price variations are expected, allows us to conclude that the linear regression assumptions are met after outlier removal. Specifically: linearity is supported by the random scatter of residuals in the Residuals vs Fitted plot; independence is indicated by the randomness in the residuals plot; homoscedasticity is satisfied by the constant spread of residuals; and normality of errors is supported by the linear alignment of points in the Q-Q plot.

Assess Final Model’s predicitive ability on test data

# Predict the values using the test data
predictions <- predict(final_model, newdata = test)
actual_prices <- test$price

# Assessment metric: RMSE
mse <- mean((predictions - actual_prices)^2) # MSE
rmse <- sqrt(mse)
print(paste("Root Mean Squared Error (RMSE):", rmse))
## [1] "Root Mean Squared Error (RMSE): 204831.391833837"
# R-squared
rss <- sum((predictions - actual_prices)^2)  # Residual sum of squares
tss <- sum((actual_prices - mean(actual_prices))^2)  # Total sum of squares
rsq <- 1 - rss / tss  # R-squared
print(paste("R-squared:", rsq))
## [1] "R-squared: 0.682522595097385"

The RMSE of approximately 204,832 indicates that, on average, the model’s predicted house prices deviate from actual prices by about $205K. This suggests a relatively high level of error, especially if most homes in the dataset are moderately priced. The R-squared value of 0.683 means the model explains about 68.3% of the variation in house prices, showing moderate predictive power.

Home Quality Analysis

Univariate – Distribution of Grade and Condition

grid.arrange(
  ggplot(train, aes(x = factor(grade))) +
    geom_bar(fill = "steelblue") +
    labs(title = "Distribution of Home Grades",
         x = "Grade (1 = Low Quality, 13 = High Quality)",
         y = "Number of Homes") +
    theme_minimal(),

  ggplot(train, aes(x = factor(condition))) +
    geom_bar(fill = "darkorange") +
    labs(title = "Distribution of Home Conditions",
         x = "Condition (1 = Poor, 5 = Excellent)",
         y = "Number of Homes") +
    theme_minimal(),

  ncol = 2
)

train <- train %>% 
  mutate(good_quality = ifelse(condition > 3 & grade > 7, "yes", "no"))


test <- test %>% 
  mutate(good_quality = ifelse(condition > 3 & grade > 7, "yes", "no"))

Univariate - Class Distribution of Good Quality

ggplot(train, aes(x = factor(good_quality))) +
  geom_bar(fill = "lightsteelblue") +
  theme(plot.title=element_text(hjust=.5))+
  labs(
    title = "Distribution of Good Quality Homes",
    x = "Good Quality (1 = Yes, 0 = No)",
    y = "Number of Homes"
  ) 

Bivariate - Distribution of Sale Price by Home Quality Group

ggplot(train, aes(x = factor(good_quality), y = price)) +
  geom_boxplot(fill = "lightblue") +
  theme(plot.title = element_text(hjust=.5)) +
  labs(
    title = "Price Distribution by Home Quality",
    x = "Good Quality (1 = Yes, 0 = No)",
    y = "Sale Price"
  ) +
  scale_y_continuous(breaks = breaks_extended(6), labels = label_dollar()) +
  theme_minimal()

### Bivariate - Region vs. Good Quality

ggplot(train, aes(x = region, fill = factor(good_quality))) +
  geom_bar(position = "fill") +
  scale_fill_manual(values = c("lightgray", "steelblue"),
                    labels = c("Not Good Quality", "Good Quality")) +
  labs(
    title = "Proportion of Good Quality Homes by Region",
    x = "Region",
    y = "Proportion of Homes",
    fill = "Home Quality"
  ) +
  theme_minimal()

Bivariate - Region vs. Good Quality

ggplot(train, aes(x = factor(good_quality), y = distance_to_downtown)) +
  geom_boxplot(fill = "darkseagreen") +
  labs(
    title = "Distance to Downtown by Home Quality",
    x = "Good Quality (1 = Yes, 0 = No)",
    y = "Distance to Downtown (miles)"
  ) +
  theme_minimal()

Multivariate - Price vs. sqfr_living, by Good Quality

ggplot(train, aes(x = sqft_living, y = price, color = factor(good_quality))) +
  geom_point(alpha = 0.4) +
  labs(
    title = "Price vs. Living Area Colored by Good Quality",
    x = "Sqft Living",
    y = "Price",
    color = "Good Quality"
  ) +
  scale_y_continuous(labels = label_dollar()) +
  theme_minimal()

Multivariate - Price vs. Distance to Downtown, by Good Quality

ggplot(train, aes(x = distance_to_downtown, y = price, color = factor(good_quality))) +
  geom_point(alpha = 0.5) +
  labs(
    title = "Price vs. Distance to Downtown Colored by Home Quality",
    x = "Distance to Downtown (decimal-degrees)",
    y = "Price",
    color = "Good Quality"
  ) +
  scale_color_manual(values = c("gray70", "steelblue")) +
  scale_y_continuous(breaks = breaks_extended(6), labels = label_dollar()) +
  theme_minimal()

Scatterplot of Log(Price) vs Distance to Downtown

ggplot(train, aes(x = distance_to_downtown, y = log(price))) +
  geom_point(alpha = 0.2, color = "black", size = 1) +   # Light black dots
  geom_smooth(method = "loess", se = FALSE, color = "blue", size = 1.2) +  # Trend line
  labs(title = "Log(Price) vs Distance to Downtown",
       x = "Distance to Downtown Seattle (decimal-degrees)",
       y = "Log(Price)") +
  theme_minimal()

Logistic Regression for Home Quality Classification

Data Preparation for Classification

train$good_quality <- factor(train$good_quality)
test$good_quality <- factor(test$good_quality)

Exploratory Analysis for Classification Variables

chart1<-ggplot(train, aes(x=waterfront, fill=good_quality))+
  geom_bar(position = "fill")+
  labs(x="Waterfront Property", y="Proportion",
       title="Proportion of Good Quaility Homes by Waterfront")

chart2<-ggplot(train, aes(x=region, fill=good_quality))+
  geom_bar(position = "fill")+
  labs(x="Region", y="Proportion",
       title="Proportion of Good Quaility Homes by Region")

chart3<-ggplot(train, aes(x=renovation_group, fill=good_quality))+
  geom_bar(position = "fill")+
  labs(x="Renovation Group", y="Proportion",
       title="Proportion of Good Quaility Homes by Renovation Group")

Proportion of Good Quaility Homes by Waterfront

chart1

Proportion of Good Quaility Homes by Region

chart2

Proportion of Good Quaility Homes by Renovation Group

chart3

Based on these bar charts, roughly 2/3 of good quality homes have a waterfront, there are more good quality suburban homes than rural and urban homes, and there are few good quality homes that are also recently renovated.

dp1<-ggplot2::ggplot(train,aes(x=log(price), color=good_quality))+
  geom_density()+
  labs(title="Density of Price by Good Quality")+
  theme(plot.title = element_text(size=10, hjust=0.5))

dp2<-ggplot2::ggplot(train,aes(x=sqft_living, color=good_quality))+
  geom_density()+
  labs(title="Density of Square Feet by Good Quality")+
  theme(plot.title = element_text(size=10, hjust=0.5))

dp3<-ggplot2::ggplot(train,aes(x=floors, color=good_quality))+
  geom_density()+
  labs(title="Density of Number of Floors by Good Quality")+
  theme(plot.title = element_text(size=10, hjust=0.5))

dp4<-ggplot2::ggplot(train,aes(x=yr_built, color=good_quality))+
  geom_density()+
  labs(title="Density of Year Built by Good Quality")+
  theme(plot.title = element_text(size=10, hjust=0.5))

dp5<-ggplot2::ggplot(train,aes(x=distance_to_downtown, color=good_quality))+
  geom_density()+
  labs(x="Distance to Downtown (decimal-degrees)", title="Density of 
       Distance to Donwtown by Good Quality")+
  theme(plot.title = element_text(size=10, hjust=0.5))

dp6<-ggplot2::ggplot(train,aes(x=bedrooms, color=good_quality))+
  geom_density()+
  labs(title="Density of Bedrooms by Goood Quality")+
  theme(plot.title = element_text(size=10, hjust=0.5))

gridExtra::grid.arrange(dp1, dp2, ncol = 2, nrow = 1)

gridExtra::grid.arrange(dp3, dp4, ncol = 2, nrow = 1)

gridExtra::grid.arrange(dp5, dp6, ncol = 2, nrow = 1)

Correlation matrix of predictor variables

# Correlation matrix of quantitative predictors
round(cor(train[,c(3:8, 10, 13 ,24)], use= "complete.obs"),3)
##                       price bedrooms bathrooms sqft_living sqft_lot floors
## price                 1.000    0.312     0.517       0.702    0.103  0.290
## bedrooms              0.312    1.000     0.514       0.594    0.036  0.169
## bathrooms             0.517    0.514     1.000       0.738    0.091  0.506
## sqft_living           0.702    0.594     0.738       1.000    0.187  0.353
## sqft_lot              0.103    0.036     0.091       0.187    1.000 -0.011
## floors                0.290    0.169     0.506       0.353   -0.011  1.000
## view                  0.380    0.064     0.155       0.235    0.069  0.016
## yr_built              0.068    0.167     0.527       0.341    0.054  0.501
## distance_to_downtown -0.177    0.116     0.180       0.197    0.275  0.061
##                        view yr_built distance_to_downtown
## price                 0.380    0.068               -0.177
## bedrooms              0.064    0.167                0.116
## bathrooms             0.155    0.527                0.180
## sqft_living           0.235    0.341                0.197
## sqft_lot              0.069    0.054                0.275
## floors                0.016    0.501                0.061
## view                  1.000   -0.056               -0.062
## yr_built             -0.056    1.000                0.436
## distance_to_downtown -0.062    0.436                1.000

Based on previous visualizations, waterfront, region,renovation, price, sqft_living, floors, yr_built, distance to downtown, and bedrooms may influence the a home being deemd good quality or not.

Define full logistic regression model

full_log <- glm(good_quality~price + sqft_living + yr_built + distance_to_downtown + waterfront + region + renovation_group , data = train, family = binomial())

summary(full_log)
## 
## Call:
## glm(formula = good_quality ~ price + sqft_living + yr_built + 
##     distance_to_downtown + waterfront + region + renovation_group, 
##     family = binomial(), data = train)
## 
## Coefficients:
##                                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                         4.583e+01  2.360e+00  19.418  < 2e-16 ***
## price                               1.712e-06  1.473e-07  11.623  < 2e-16 ***
## sqft_living                         5.058e-04  4.820e-05  10.494  < 2e-16 ***
## yr_built                           -2.574e-02  1.234e-03 -20.853  < 2e-16 ***
## distance_to_downtown                4.982e-01  3.212e-01   1.551    0.121    
## waterfront1                        -1.445e+00  3.704e-01  -3.901 9.58e-05 ***
## regionSuburb                        1.092e+00  8.882e-02  12.299  < 2e-16 ***
## regionRural                         4.962e-01  8.396e-02   5.910 3.43e-09 ***
## renovation_groupRecently Renovated -3.125e+00  5.129e-01  -6.092 1.12e-09 ***
## renovation_groupRenovated Long Ago -1.037e+00  1.722e-01  -6.024 1.70e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 12037  on 16695  degrees of freedom
## Residual deviance: 10344  on 16686  degrees of freedom
## AIC: 10364
## 
## Number of Fisher Scoring iterations: 6

Checking if there is multicollinearity among the predictor variables

vif(full_log)
##                              price                        sqft_living 
##                          25.794900                          27.314048 
##                           yr_built               distance_to_downtown 
##                          21.912356                          24.973604 
##                        waterfront1                       regionSuburb 
##                           8.200852                          29.237259 
##                        regionRural renovation_groupRecently Renovated 
##                          29.235019                          59.679356 
## renovation_groupRenovated Long Ago 
##                          12.143194

There seems to be a high degree of multicollinearity. From the correlation matrix, sqft_living, bedrooms, and bathrooms seemed to all be correlated to one another. Let’s drop distance to downtown and waterfront, as they are reported as not significant. I’d conduct a liklihood ratio test to see if we can drop these variables from the model. First I’d like to compute the accuracy and error rate from the confusion matrix for this model.

Assess model’s predicitive ability on test data

# Predicted probs for test data
preds<-predict(full_log,newdata=test, type="response")

# Confusion matrix with threshold of 0.5
table(test$good_quality, preds>0.5)
##      
##       FALSE TRUE
##   no   3715   80
##   yes   458   69
# Calculate accuracy and error rate
accuracy <- sum(diag(table(test$good_quality, preds > 0.5))) / sum(table(test$good_quality, preds > 0.5))
error_rate <- 1 - accuracy
cat("Accuracy:", round(accuracy * 100, 2), "%\nError Rate:", round(error_rate * 100, 2), "%")
## Accuracy: 87.55 %
## Error Rate: 12.45 %

Here we have what appears to be good accuracy and error_rate, but we must remember that most of the data set is already made up of homes that are considered not good_quality, meaning we are dealing with an unbalanced sample size of the response variable. Lets look at ROC and AUC.

ROC and AUC

# Produce the numbers associated with classification table
rates<-ROCR::prediction(preds, test$good_quality)

# Store the true positive and false positive rates
roc_result<-ROCR::performance(rates,measure="tpr", x.measure="fpr")

# Plot ROC curve and overlay the diagonal line for random guessing
plot(roc_result, main="ROC Curve for Full Model")
lines(x = c(0,1), y = c(0,1), col="red")

# Compute the AUC
auc<-performance(rates, measure = "auc")
auc@y.values
## [[1]]
## [1] 0.7865488

Now we see that our model that is full performs better than random guessing and has an AUC of 0.77. I’d like to see if a reduced model would perform better.

Define reduced logistic regression model

reduced_log <- glm(good_quality~price + sqft_living  + yr_built + region , data = train, family = binomial())

summary(reduced_log)
## 
## Call:
## glm(formula = good_quality ~ price + sqft_living + yr_built + 
##     region, family = binomial(), data = train)
## 
## Coefficients:
##                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   4.009e+01  2.118e+00  18.927  < 2e-16 ***
## price         1.351e-06  1.264e-07  10.686  < 2e-16 ***
## sqft_living   5.337e-04  4.474e-05  11.929  < 2e-16 ***
## yr_built     -2.273e-02  1.103e-03 -20.606  < 2e-16 ***
## regionSuburb  1.121e+00  8.443e-02  13.278  < 2e-16 ***
## regionRural   4.913e-01  8.004e-02   6.138 8.34e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 12037  on 16695  degrees of freedom
## Residual deviance: 10499  on 16690  degrees of freedom
## AIC: 10511
## 
## Number of Fisher Scoring iterations: 5

We see that the all predictors for this model are significant.

Assess reduced model’s predicitive ability on test data

# Predicted probs for test data
preds<-predict(reduced_log,newdata=test, type="response")

# Confusion matrix with threshold of 0.5
table(test$good_quality, preds>0.5)
##      
##       FALSE TRUE
##   no   3714   81
##   yes   466   61
# Calculate accuracy and error rate
accuracy <- sum(diag(table(test$good_quality, preds > 0.5))) / sum(table(test$good_quality, preds > 0.5))
error_rate <- 1 - accuracy
cat("Accuracy:", round(accuracy * 100, 2), "%\nError Rate:", round(error_rate * 100, 2), "%")
## Accuracy: 87.34 %
## Error Rate: 12.66 %

ROC and AUC

# Produce the numbers associated with classification table
rates<-ROCR::prediction(preds, test$good_quality)

# Store the true positive and false positive rates
roc_result<-ROCR::performance(rates,measure="tpr", x.measure="fpr")

# Plot ROC curve and overlay the diagonal line for random guessing
plot(roc_result, main="ROC Curve for Reduced Model")
lines(x = c(0,1), y = c(0,1), col="red")

auc<-performance(rates, measure = "auc")
auc@y.values
## [[1]]
## [1] 0.7818572

As we can see the AUC for the reduced model is very slightly higher than for the full model. Let’s conduct a likelihood ratio test to determine which model to use.

H0: B4 = B5 = B8 = 0 Ha: at least one beta in null is not zero.

# Residual deviances
DR <- reduced_log$deviance  # Deviance of reduced model
DF <- full_log$deviance     # Deviance of full model      

delta_g <- DR - DF

CV <- qchisq(0.95,3)

delta_g > CV
## [1] TRUE

So we reject the null hypothesis, meaning the additional predictors significantly improve the model, and therefore should go forward with the full model over the reduced one.