Section 42 Model Selection: Best Subset Selection
42.1 Statistical Model
\[ \large y_{i} = \beta_0 + \beta_1 x_{1i} + \beta_2 x_{2i} + ... + \beta_p x_{pi} + \epsilon_{i} \]
\[ i = 1,...,n; \space p = \space number \space of \space predictors \]
42.2 Steps
- Fit a Null model \(M_0\) which contains no predictors. This model simply predicts the sample mean for each observation. 
- For \(k = 1,2,...,p\) - Fit all \(p \choose k\) models that contain exactly \(k\) predictors. 
- Choose the best model \(M_k\) among these \(p \choose k\) models based on the smallest RSS, or equivalently, largest \(R^2\). 
 
- Select a single best model from among \(M_0, M_1, ..., M_p\) models using the criteria Adjusted \(R^2\), Mallow’s Cp, AIC or BIC. 
42.4 R implementation
# par(mar=c(0.5,0.5,0.5,0.5))
require(leaps)
BP <- read.csv('data/BP.csv')
# Remove NA values from all variables
BP <- na.omit(BP)
BP$DM <- as.factor(BP$DM)
# Subset the data
BP <- BP[,-1]
# Best subset selection
sub.fm <- regsubsets(SBP ~ ., data=BP, 
                      nbest=1, nvmax=6, intercept=TRUE,
                      method='exhaustive')
sub.summary <- summary(sub.fm)
# names(sub.summary)
sub.summarySubset selection object
Call: regsubsets.formula(SBP ~ ., data = BP, nbest = 1, nvmax = 6, 
    intercept = TRUE, method = "exhaustive")
6 Variables  (and intercept)
                Forced in Forced out
EthnicAsian         FALSE      FALSE
EthnicCaucasian     FALSE      FALSE
Age                 FALSE      FALSE
Income              FALSE      FALSE
BMI                 FALSE      FALSE
DM2                 FALSE      FALSE
1 subsets of each size up to 6
Selection Algorithm: exhaustive
         EthnicAsian EthnicCaucasian Age Income BMI DM2
1  ( 1 ) " "         " "             " " " "    "*" " "
2  ( 1 ) " "         " "             " " " "    "*" "*"
3  ( 1 ) " "         " "             "*" " "    "*" "*"
4  ( 1 ) " "         " "             "*" "*"    "*" "*"
5  ( 1 ) " "         "*"             "*" "*"    "*" "*"
6  ( 1 ) "*"         "*"             "*" "*"    "*" "*"    (Intercept)     EthnicAsian EthnicCaucasian             Age          Income             BMI             DM2 
   -3.858060268    -0.002753277    -0.013373729     0.903319577    -0.003045831     2.348833049     4.073927322 
# Customised plot
par(mfrow=c(2,2))
xlab <- 'Number of X-variables'
plot(x = sub.summary$rss, 
     xlab=xlab, ylab='RSS', 
     type='l', col='blue')
index <- which.min(sub.summary$rss)
points(x = index, y = sub.summary$rss[index], 
       col='red', cex=2, pch=20)
plot(x=sub.summary$adjr2, 
     xlab=xlab, ylab='Adjusted R2', 
     type='l', col='blue')
index <- which.max(sub.summary$adjr2)
points(x = index, y = sub.summary$adjr2[index], 
       col='red', cex=2, pch=20)
plot(x=sub.summary$cp, 
     xlab=xlab, ylab='Mallow Cp', 
     type='l', col='blue')
index <- which.min(sub.summary$cp)
points(x = index, y = sub.summary$cp[index], 
       col='red', cex=2, pch=20)
plot(x=sub.summary$bic, 
     xlab=xlab, ylab='BIC', 
     type='l', col='blue')
index <- which.min(sub.summary$bic)
points(x = index, y = sub.summary$bic[index], 
       col='red', cex=2, pch=20)
(Intercept)         Age         BMI         DM2 
 -3.9018887   0.9022749   2.3487465   4.0731999 
42.3 Comments
There are \(2^p\) possibilities to select the best model using this approach.
The best subset models for each level may not be nested.
The best subset selection is a simple and conceptually appealing approach.
However, it suffers from computational limitations due to its large search space.
The enormouse search space can also lead to overfitting and higher variance of the coefficient estimates.