Section 41 MLR: Model Selection
Multiple Linear Regression: Model Selection
41.1 Statistical Model
\[ \large y_{i} = \beta_0 + \beta_1 x_{1i} + \beta_2 x_{2i} + ... + \beta_p x_{pi} + \epsilon_{i} \]
\[ i = 1,...,n; \space p = \space number \space of \space predictors \]
41.2 Optimal Model Selection
As the model with all predictors will always show the smallest RSS and highest \(R^2\), therefore, RSS and \(R^2\) are not suitable for selecting the best model among a collection of models with different numbers of predictors.
However, it is not a good idea to develop a model including all available predictors.
The model with smaller number of variables that fits the data adquately is generally the best.
The parsimonious model excludes correlated predictors, decreases overfitting, decreases noise to the prediction and enhances prediction accuracy and model interpretability.
The parsimonious model is also less time and resource intensive since few predictors need to be recorded.
Variable selection or model selection, therefore, is an important step in the model development.
Criteria used to choose an optimal model by adjusting the training error are: Adjusted \(R^2\), AIC, BIC, Mallow’s Cp statistics.
Other criteria used for adjusting both training and testing error: Validation and Cross-validation.
The above crieteria are followed while implementing different methods of model selection like Subset selection, Forward and Backward Stepwise selection.
Other methods of parsimonious model development are: Shrinkage (regularisation) (Ridge, Lasso, Elastic Net), Dimension reduction (PCR, PLS)
41.3 Adjusted Coefficient of Determination (Adjusted R2)
\[ \large Adj.R^2 = 1 - \frac{Residual \space SS \space / df_{resdual}}{Total \space SS / \space df_{total}} \]
\[ \large Adj.R^2 = 1 - \frac{RSS \space / (n-p)}{TSS / \space (n-1)}\]
- A model with large value of adjusted \(R^2\) indicates the model has a small training (test) error - larger is better.
41.4 Mallow’s Cp statistic
\[ \large C_p = \frac{1}{n} (RSS + 2p\hat\sigma^2) \]
- A model with smallest \(C_p\) value indicates the model has the smallest training (test) error amongst all models - smaller is better.
41.5 Akaike Information Content (AIC)
\[ \large AIC = -2logL + 2p \]
- For Gaussian error with least squares, AIC can be estimated as:
\[ \large AIC = \frac{1}{n\hat\sigma^2} (RSS + 2p\hat\sigma^2) + constant\]
- A model with smallest \(AIC\) value indicates the model has the smallest training (test) error amongst all models - smaller is better.
41.6 Bayesian Information Content (BIC)
\[ \large BIC = -2logL + log(n)\space p \]
- For Gaussian error with least squares, BIC can be estimated as:
\[ \large BIC = \frac{1}{n\hat\sigma^2} (RSS + log(n) \space p\hat\sigma^2) + constant\]
- A model with smallest \(BIC\) value indicates the model has the smallest training (test) error amongst all models - smaller is better.