Housing Prices in Boston

Questions:
1. Is there evidence to suggest that the crime rate influences house prices in Boston?
2. How would you gauge the degree of dependency between variables? Consider using correlation and linear model.
3. Establish whether incorporating additional variables into your analysis provides a better prediction of house prices in Boston.
4. Interpret the relationship between housing prices and your chosen variables.

boston = read.csv("Boston.csv")
head(boston)

# removing the 'X' column
boston = boston[,-12]
head(boston)

layout(matrix(c(1, 
                2, 
                3), ncol = 1, nrow = 3, byrow = TRUE))

max.crim = which.max(boston$crim)
cat(sep = "", "Row ", max.crim, " of the neighborhoods has the highest \"crim\".")

## Row 381 of the neighborhoods has the highest "crim".

plot(density(boston$crim), main = "Density plot of crim", xlab = "crim")
rug(boston$crim)
points(boston$crim[max.crim], 0, col = "red", pch = 1, cex = 1.5)
legend(x = "topright", legend = "Neighborhood with highest crime rate", col = "red", pch = 1)

plot(density(boston$med.value), main = "Density plot of medv", xlab = "medv")
rug(boston$med.value)
points(boston$med.value[max.crim], 0, col = "red", pch = 1, cex = 1.5)
legend(x = "topright", legend = "Neighborhood with highest crime rate", col = "red", pch = 1)

plot(density(boston$pupil.teacher), main = "Density plot of ptratio", xlab = "ptratio")
rug(boston$pupil.teacher)
points(boston$pupil.teacher[max.crim], 0, col = "red", pch = 1, cex = 1.5)
legend(x = "topleft", legend = "Neighborhood with highest crime rate", col = "red", pch = 1)

#layout(1)

any(is.na(boston))

## [1] FALSE

Question 1

plot(boston$med.value, boston$crim)

lm_simple <- lm(med.value ~ crim, data = boston)
summary(lm_simple)

## 
## Call:
## lm(formula = med.value ~ crim, data = boston)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -16.957  -5.449  -2.007   2.512  29.800 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 24.03311    0.40914   58.74   <2e-16 ***
## crim        -0.41519    0.04389   -9.46   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.484 on 504 degrees of freedom
## Multiple R-squared:  0.1508, Adjusted R-squared:  0.1491 
## F-statistic: 89.49 on 1 and 504 DF,  p-value: < 2.2e-16

lm_full = lm(med.value ~ ., data = boston)
summary(lm_full)

## 
## Call:
## lm(formula = med.value ~ ., data = boston)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15.1304  -2.7673  -0.5814   1.9414  26.2526 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    41.617270   4.936039   8.431 3.79e-16 ***
## crim           -0.121389   0.033000  -3.678 0.000261 ***
## zn              0.046963   0.013879   3.384 0.000772 ***
## indus           0.013468   0.062145   0.217 0.828520    
## chas            2.839993   0.870007   3.264 0.001173 ** 
## nox           -18.758022   3.851355  -4.870 1.50e-06 ***
## rooms           3.658119   0.420246   8.705  < 2e-16 ***
## age             0.003611   0.013329   0.271 0.786595    
## distance       -1.490754   0.201623  -7.394 6.17e-13 ***
## highway         0.289405   0.066908   4.325 1.84e-05 ***
## tax            -0.012682   0.003801  -3.337 0.000912 ***
## pupil.teacher  -0.937533   0.132206  -7.091 4.63e-12 ***
## lstat          -0.552019   0.050659 -10.897  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.798 on 493 degrees of freedom
## Multiple R-squared:  0.7343, Adjusted R-squared:  0.7278 
## F-statistic: 113.5 on 12 and 493 DF,  p-value: < 2.2e-16

The summary of linear model indicates that the crim variable (crime rate) is significant (p < 0.05), with an estimate of -0.121389. This suggests that for every unit increase in the crime rate, the median value of houses decreases by approximately 0.121389 thousand dollars, holding all other variables constant.

plot(lm_full)

T-test for Coefficients: The summary of our linear models (both full and simple) will include t-tests for each coefficient. The t-test assesses whether each coefficient is significantly different from zero, i.e., whether each predictor has a significant linear relationship with the outcome variable.

anova(lm_simple, lm_full)

ANOVA for Model Comparison: We can compare MLR and SLR using an ANOVA. This test will tell you whether the additional variables in the full model significantly improve the model’s fit compared to the simple model.

The Sum of Sq (24926) is the reduction in the RSS when moving from Model 1 to Model 2. This large decrease indicates that the additional predictors in Model 2 explain a significant amount of variation in house values that is not captured by the crime rate alone.

F-Statistic and P-Value: The F-statistic (98.432) is a measure of the overall significance of the model. This high value indicates that the model fit improves significantly when additional predictors are added. The P-value (< 2.2e-16) is extremely small, well below any conventional significance level (like 0.05). This indicates that the improvement in model fit from Model 1 to Model 2 is statistically significant.

Conclusion The ANOVA results strongly suggest that including additional variables (as in Model 2) provides a significantly better prediction of house prices in Boston compared to a model that only includes the crime rate (Model 1). The additional predictors contribute meaningful information that is not captured by the crime rate alone.

Question 2

#install.packages("car")
library(car)

## Loading required package: carData

vif(lm_full)

##          crim            zn         indus          chas           nox 
##      1.767486      2.298459      3.987181      1.071168      4.369093 
##         rooms           age      distance       highway           tax 
##      1.912532      3.088232      3.954037      7.445301      9.002158 
## pupil.teacher         lstat 
##      1.797060      2.870777

Multicollinearity refers to the situation where predictor variables in a regression model are highly correlated, which can cause problems in evaluating the individual effect of predictors. Generally, a VIF above 5 indicates a problematic amount of multicollinearity. In our output, tax and highway show high VIF scores, suggesting that these variables may be highly correlated with other variables in your model.

cor_matrix <- cor(boston[-which(names(boston) == "med.value")])
cor_matrix

##                      crim          zn       indus         chas         nox
## crim           1.00000000 -0.20046922  0.40658341 -0.055891582  0.42097171
## zn            -0.20046922  1.00000000 -0.53382819 -0.042696719 -0.51660371
## indus          0.40658341 -0.53382819  1.00000000  0.062938027  0.76365145
## chas          -0.05589158 -0.04269672  0.06293803  1.000000000  0.09120281
## nox            0.42097171 -0.51660371  0.76365145  0.091202807  1.00000000
## rooms         -0.21924670  0.31199059 -0.39167585  0.091251225 -0.30218819
## age            0.35273425 -0.56953734  0.64477851  0.086517774  0.73147010
## distance      -0.37967009  0.66440822 -0.70802699 -0.099175780 -0.76923011
## highway        0.62550515 -0.31194783  0.59512927 -0.007368241  0.61144056
## tax            0.58276431 -0.31456332  0.72076018 -0.035586518  0.66802320
## pupil.teacher  0.28994558 -0.39167855  0.38324756 -0.121515174  0.18893268
## lstat          0.45562148 -0.41299457  0.60379972 -0.053929298  0.59087892
##                     rooms         age    distance      highway         tax
## crim          -0.21924670  0.35273425 -0.37967009  0.625505145  0.58276431
## zn             0.31199059 -0.56953734  0.66440822 -0.311947826 -0.31456332
## indus         -0.39167585  0.64477851 -0.70802699  0.595129275  0.72076018
## chas           0.09125123  0.08651777 -0.09917578 -0.007368241 -0.03558652
## nox           -0.30218819  0.73147010 -0.76923011  0.611440563  0.66802320
## rooms          1.00000000 -0.24026493  0.20524621 -0.209846668 -0.29204783
## age           -0.24026493  1.00000000 -0.74788054  0.456022452  0.50645559
## distance       0.20524621 -0.74788054  1.00000000 -0.494587930 -0.53443158
## highway       -0.20984667  0.45602245 -0.49458793  1.000000000  0.91022819
## tax           -0.29204783  0.50645559 -0.53443158  0.910228189  1.00000000
## pupil.teacher -0.35550149  0.26151501 -0.23247054  0.464741179  0.46085304
## lstat         -0.61380827  0.60233853 -0.49699583  0.488676335  0.54399341
##               pupil.teacher      lstat
## crim              0.2899456  0.4556215
## zn               -0.3916785 -0.4129946
## indus             0.3832476  0.6037997
## chas             -0.1215152 -0.0539293
## nox               0.1889327  0.5908789
## rooms            -0.3555015 -0.6138083
## age               0.2615150  0.6023385
## distance         -0.2324705 -0.4969958
## highway           0.4647412  0.4886763
## tax               0.4608530  0.5439934
## pupil.teacher     1.0000000  0.3740443
## lstat             0.3740443  1.0000000

Question 3

# Stepwise selection using both directions

step_model <- step(lm_full, direction = "both")

## Start:  AIC=1599.85
## med.value ~ crim + zn + indus + chas + nox + rooms + age + distance + 
##     highway + tax + pupil.teacher + lstat
## 
##                 Df Sum of Sq   RSS    AIC
## - indus          1      1.08 11350 1597.9
## - age            1      1.69 11351 1597.9
## <none>                       11349 1599.8
## - chas           1    245.31 11595 1608.7
## - tax            1    256.28 11606 1609.2
## - zn             1    263.59 11613 1609.5
## - crim           1    311.49 11661 1611.6
## - highway        1    430.71 11780 1616.7
## - nox            1    546.10 11896 1621.6
## - pupil.teacher  1   1157.70 12507 1647.0
## - distance       1   1258.52 12608 1651.1
## - rooms          1   1744.36 13094 1670.2
## - lstat          1   2733.54 14083 1707.0
## 
## Step:  AIC=1597.9
## med.value ~ crim + zn + chas + nox + rooms + age + distance + 
##     highway + tax + pupil.teacher + lstat
## 
##                 Df Sum of Sq   RSS    AIC
## - age            1      1.69 11352 1596.0
## <none>                       11350 1597.9
## + indus          1      1.08 11349 1599.8
## - chas           1    251.21 11602 1607.0
## - zn             1    262.99 11614 1607.5
## - tax            1    299.68 11650 1609.1
## - crim           1    313.07 11664 1609.7
## - highway        1    453.61 11804 1615.7
## - nox            1    574.23 11925 1620.9
## - pupil.teacher  1   1168.01 12518 1645.5
## - distance       1   1333.19 12684 1652.1
## - rooms          1   1750.50 13101 1668.5
## - lstat          1   2743.21 14094 1705.4
## 
## Step:  AIC=1595.98
## med.value ~ crim + zn + chas + nox + rooms + distance + highway + 
##     tax + pupil.teacher + lstat
## 
##                 Df Sum of Sq   RSS    AIC
## <none>                       11352 1596.0
## + age            1      1.69 11350 1597.9
## + indus          1      1.08 11351 1597.9
## - chas           1    254.21 11606 1605.2
## - zn             1    261.75 11614 1605.5
## - tax            1    298.57 11651 1607.1
## - crim           1    313.27 11666 1607.8
## - highway        1    452.16 11804 1613.7
## - nox            1    601.74 11954 1620.1
## - pupil.teacher  1   1168.51 12521 1643.5
## - distance       1   1496.35 12848 1656.6
## - rooms          1   1848.38 13201 1670.3
## - lstat          1   3043.23 14395 1714.2

summary(step_model)

## 
## Call:
## lm(formula = med.value ~ crim + zn + chas + nox + rooms + distance + 
##     highway + tax + pupil.teacher + lstat, data = boston)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15.1814  -2.7625  -0.6243   1.8448  26.3920 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    41.451747   4.903283   8.454 3.18e-16 ***
## crim           -0.121665   0.032919  -3.696 0.000244 ***
## zn              0.046191   0.013673   3.378 0.000787 ***
## chas            2.871873   0.862591   3.329 0.000935 ***
## nox           -18.262427   3.565247  -5.122 4.33e-07 ***
## rooms           3.672957   0.409127   8.978  < 2e-16 ***
## distance       -1.515951   0.187675  -8.078 5.08e-15 ***
## highway         0.283932   0.063945   4.440 1.11e-05 ***
## tax            -0.012292   0.003407  -3.608 0.000340 ***
## pupil.teacher  -0.930961   0.130423  -7.138 3.39e-12 ***
## lstat          -0.546509   0.047442 -11.519  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.789 on 495 degrees of freedom
## Multiple R-squared:  0.7342, Adjusted R-squared:  0.7289 
## F-statistic: 136.8 on 10 and 495 DF,  p-value: < 2.2e-16

To address this question, we could compare models with different subsets of variables to see if adding or removing variables improves the model’s predictive power. This could involve using stepwise regression techniques like forward selection, backward elimination, or both (stepwise selection), and comparing the AIC (Akaike Information Criterion), BIC (Bayesian Information Criterion), or adjusted R-squared values for each model.

Final Model Interpretation The final model includes the variables crim, zn, chas, nox, rooms, distance, highway, tax, pupil.teacher, and lstat. The coefficients for these variables, along with their significance levels, are as follows:

crim: Negative coefficient (-0.121665), significant. Indicates that higher crime rates are associated with lower house values. zn: Positive coefficient (0.046191), significant. Suggests that larger residential zones are associated with higher house values.

The Adjusted R-squared value of 0.7289 indicates that about 72.89% of the variability in median house values is explained by the model. The F-statistic is significant, suggesting that the model is a good fit for the data.

This model is an improvement over the initial model as indicated by a lower AIC, and it provides a more parsimonious explanation of the relationship between the predictor variables and median house values in Boston.

Question 4

For example, the rooms coefficient is positive and significant, suggesting that houses with more rooms tend to have higher median values. Conversely, the nox variable, which represents nitrogen oxides pollution, has a negative coefficient, indicating that higher pollution levels are associated with lower house values.