Section 24 MLR: \(R^2\)

Multiple Linear Regression: \(R^2\)


24.1 Estimates: Variances


\[ \large fm \leftarrow lm(SBP \sim BMI + Age, \space data=BP) \]

\[ \large summary(fm) \]



Call:
lm(formula = SBP ~ cBMI + cAge, data = BP)

Residuals:
    Min      1Q  Median      3Q     Max 
-8.6030 -2.0345  0.1196  1.9800  6.8630 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 101.97870    0.11777 865.899  < 2e-16 ***
cBMI          2.33147    0.07108  32.799  < 2e-16 ***
cAge          1.09032    0.15678   6.954 1.12e-11 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.633 on 497 degrees of freedom
Multiple R-squared:  0.8175,    Adjusted R-squared:  0.8168 
F-statistic:  1113 on 2 and 497 DF,  p-value: < 2.2e-16


24.2 Explanation

Statistical Model

\[ \large y_{i} = a + \beta_1 x_{1i} + \beta_2 x_{2i} + \epsilon_{i} \]


Error Variance = Residual Mean Square

\[ \large \hat\sigma^2 = Var(\hat\epsilon) = MSE \]


Coefficient of Determination (R2)


\[ \large R^2 = \frac{Treatment \space SS}{Total \space SS} = 1 - \frac{Residual \space SS}{Total \space SS}\]


R-squared quantifies the proportion of variance that is explained by the explanatory variable(s) in a linear regression model. It is a measure of predictive power of the model.

We can also compute the estiamte of R2 from the ANOVA table.

If R-squared is high (close to one) then this indicates that the predictor variable explains (describes) a lot of the variation in the data i.e. that there is a high signal-to-noise ratio.


Adjusted Coefficient of Determination (Adjusted R2)


\[ \large Adj.R^2 = 1 - \frac{Residual \space SS \space / df_{resdual}}{Total \space SS / \space df_{total}}\]

When multiple predictors are included in the model, R^2 increases monotonically. Adjusted R^2 accounts for both the extra parameters in the model and additional variability explained by an extended model.

Hence the adjustment has the effect of offsetting the tendency for R^2 to increase with additional explanatory variables in multiple regression (i.e. more than one X variable), even when they have no explanatory power.