Section 41 MLR: Model Selection

Multiple Linear Regression: Model Selection


41.1 Statistical Model

\[ \large y_{i} = \beta_0 + \beta_1 x_{1i} + \beta_2 x_{2i} + ... + \beta_p x_{pi} + \epsilon_{i} \]

\[ i = 1,...,n; \space p = \space number \space of \space predictors \]


41.2 Optimal Model Selection

  • As the model with all predictors will always show the smallest RSS and highest \(R^2\), therefore, RSS and \(R^2\) are not suitable for selecting the best model among a collection of models with different numbers of predictors.

  • However, it is not a good idea to develop a model including all available predictors.

  • The model with smaller number of variables that fits the data adquately is generally the best.

  • The parsimonious model excludes correlated predictors, decreases overfitting, decreases noise to the prediction and enhances prediction accuracy and model interpretability.

  • The parsimonious model is also less time and resource intensive since few predictors need to be recorded.

  • Variable selection or model selection, therefore, is an important step in the model development.

  • Criteria used to choose an optimal model by adjusting the training error are: Adjusted \(R^2\), AIC, BIC, Mallow’s Cp statistics.

  • Other criteria used for adjusting both training and testing error: Validation and Cross-validation.

  • The above crieteria are followed while implementing different methods of model selection like Subset selection, Forward and Backward Stepwise selection.

  • Other methods of parsimonious model development are: Shrinkage (regularisation) (Ridge, Lasso, Elastic Net), Dimension reduction (PCR, PLS)

41.3 Adjusted Coefficient of Determination (Adjusted R2)


\[ \large Adj.R^2 = 1 - \frac{Residual \space SS \space / df_{resdual}}{Total \space SS / \space df_{total}} \]

\[ \large Adj.R^2 = 1 - \frac{RSS \space / (n-p)}{TSS / \space (n-1)}\]

  • A model with large value of adjusted \(R^2\) indicates the model has a small training (test) error - larger is better.


41.4 Mallow’s Cp statistic

\[ \large C_p = \frac{1}{n} (RSS + 2p\hat\sigma^2) \]

  • A model with smallest \(C_p\) value indicates the model has the smallest training (test) error amongst all models - smaller is better.


41.5 Akaike Information Content (AIC)

\[ \large AIC = -2logL + 2p \]

  • For Gaussian error with least squares, AIC can be estimated as:

\[ \large AIC = \frac{1}{n\hat\sigma^2} (RSS + 2p\hat\sigma^2) + constant\]

  • A model with smallest \(AIC\) value indicates the model has the smallest training (test) error amongst all models - smaller is better.


41.6 Bayesian Information Content (BIC)

\[ \large BIC = -2logL + log(n)\space p \]

  • For Gaussian error with least squares, BIC can be estimated as:

\[ \large BIC = \frac{1}{n\hat\sigma^2} (RSS + log(n) \space p\hat\sigma^2) + constant\]

  • A model with smallest \(BIC\) value indicates the model has the smallest training (test) error amongst all models - smaller is better.