Section 41 MLR: Model Selection
Multiple Linear Regression: Model Selection
41.1 Statistical Model
\[ \large y_{i} = \beta_0 + \beta_1 x_{1i} + \beta_2 x_{2i} + ... + \beta_p x_{pi} + \epsilon_{i} \]
\[ i = 1,...,n; \space p = \space number \space of \space predictors \]
41.2 Optimal Model Selection
- As the model with all predictors will always show the smallest RSS and highest \(R^2\), therefore, RSS and \(R^2\) are not suitable for selecting the best model among a collection of models with different numbers of predictors. 
- However, it is not a good idea to develop a model including all available predictors. 
- The model with smaller number of variables that fits the data adquately is generally the best. 
- The parsimonious model excludes correlated predictors, decreases overfitting, decreases noise to the prediction and enhances prediction accuracy and model interpretability. 
- The parsimonious model is also less time and resource intensive since few predictors need to be recorded. 
- Variable selection or model selection, therefore, is an important step in the model development. 
- Criteria used to choose an optimal model by adjusting the training error are: Adjusted \(R^2\), AIC, BIC, Mallow’s Cp statistics. 
- Other criteria used for adjusting both training and testing error: Validation and Cross-validation. 
- The above crieteria are followed while implementing different methods of model selection like Subset selection, Forward and Backward Stepwise selection. 
- Other methods of parsimonious model development are: Shrinkage (regularisation) (Ridge, Lasso, Elastic Net), Dimension reduction (PCR, PLS) 
41.3 Adjusted Coefficient of Determination (Adjusted R2)
\[ \large Adj.R^2 = 1 - \frac{Residual \space SS \space / df_{resdual}}{Total \space SS / \space df_{total}} \]
\[ \large Adj.R^2 = 1 - \frac{RSS \space / (n-p)}{TSS / \space (n-1)}\]
- A model with large value of adjusted \(R^2\) indicates the model has a small training (test) error - larger is better.
41.4 Mallow’s Cp statistic
\[ \large C_p = \frac{1}{n} (RSS + 2p\hat\sigma^2) \]
- A model with smallest \(C_p\) value indicates the model has the smallest training (test) error amongst all models - smaller is better.
41.5 Akaike Information Content (AIC)
\[ \large AIC = -2logL + 2p \]
- For Gaussian error with least squares, AIC can be estimated as:
\[ \large AIC = \frac{1}{n\hat\sigma^2} (RSS + 2p\hat\sigma^2) + constant\]
- A model with smallest \(AIC\) value indicates the model has the smallest training (test) error amongst all models - smaller is better.
41.6 Bayesian Information Content (BIC)
\[ \large BIC = -2logL + log(n)\space p \]
- For Gaussian error with least squares, BIC can be estimated as:
\[ \large BIC = \frac{1}{n\hat\sigma^2} (RSS + log(n) \space p\hat\sigma^2) + constant\]
- A model with smallest \(BIC\) value indicates the model has the smallest training (test) error amongst all models - smaller is better.