Section 34 Summary of Simple Linear Regression

34.1 Estimates: Effects

\[ \large fm \leftarrow lm(SBP \sim BMI, \space data=BP) \]

\[ \large anova(fm) \]

\[ \large summary(fm) \]

34.2 Estimates

Statistical Model

\[ \large y_{i} = \beta_0 + \beta_1 x_{i} + \epsilon_{i} \]

\(\beta_1\) Estimates & SE

\[ \large \hat \beta_1 = \frac{\sum\limits_{i=1}^{n}(x_i-\bar x)(y_i-\bar y)}{\sum\limits_{i=1}^{n}(x_i-\bar x)^2} = \frac{S_{xy}}{S_{xx}} \]

\[ \large Var(\hat \beta_1) = \frac{\hat\sigma^2}{\sum\limits_{i=1}^{n}(x_i-\bar x)^2} = \frac{\hat\sigma^2}{S_{xx}} \]

\[ \large SE(\hat \beta_1) = \sqrt {Var(\hat \beta_1)} \]

95% Confidence Interval

\[ \large CI_{0.95}(\hat \beta_1) = \left[ \hat\beta_1 \pm t_{0.025, df_{residual}} * SE(\hat\beta_1) \right]\]

\(\beta_0\) Estimates & SE

\[ \large \hat \beta_0 = \bar y - \hat \beta_1 \bar x \]

\[ \large Var(\hat \beta_0) = \hat\sigma^2 \left[ \frac{1}{n} + \frac{\bar x^2}{\sum\limits_{i=1}^{n}(x_i-\bar x)^2} \right] = \hat\sigma^2 \left[ \frac{1}{n} + \frac{\bar x^2}{S_{xx}} \right] \]

\[ \large SE(\hat \beta_0) = \sqrt {Var(\hat \beta_0)} \]

95% Confidence Interval

\[ \large CI_{0.95}(\hat \beta_0) = \left[ \hat\beta_0 \pm t_{0.025, df_{residual}} * SE(\hat\beta_0) \right]\]

Here,

\[ \large \bar{x} = \frac{1}{n}\sum\limits_{i=1}^{n} x_{i} \]

\[ \large \bar{y} = \frac{1}{n}\sum\limits_{i=1}^{n} y_{i} \]

\[ \large \hat\sigma^2 = Var(\hat\epsilon) \]

34.3 ANOVA

Degrees of freedom (df)

\(\large n\) = Total number of observations

Regression df = BMI df = \(\large 1\)

Residual df = \(\large n - 1 - 1\)

Total df = Regression df + Residual df = \(\large n - 1\)

Total Sum of Squares (TSS)

\[ \large TSS = \sum\limits_{i=1}^{n} (y_i-\bar y)^2 = S_{yy}\]

Sum of Squares due to Regression (SSb)

\[ \large SSb = \hat\beta_1\sum\limits_{i=1}^{n} (x_i-\bar x)(y_i-\bar y) = \hat\beta_1S_{xy}\]

Residual Sum of Squares (RSS)

\[ \large RSS = TSS - SSb = S_{yy} - \hat\beta_1S_{xy} \]

Mean Squares

Mean square = Sum of squares / degrees of freedom

\(\large MS = SS / df\)

F-value (Variance Ratio)

F value = Regression MS / Residual MS

Pr(>F)

P-value: the probability of obtaining a variance ratio this large under the null hypothesis that the coefficient equals to zero.

Under the null hypothesis the variance ratio has an F distribution.

Error Variance = Residual Mean Square

\(\large \hat\sigma^2 = Residual \space MS\)

Coefficient of Determination (R²)

\[ \large R^2 = \frac{Treatment \space SS}{Total \space SS} = 1 - \frac{Residual \space SS}{Total \space SS}\]

Adjusted Coefficient of Determination (Adjusted R²)

\[ \large Adj.R^2 = 1 - \frac{Residual \space SS \space / df_{resdual}}{Total \space SS / \space df_{total}}\]

34.4 Prediction

Prediction

\[ \large E(y|X=x^*) = \hat y^* = \hat \beta_0 + \hat \beta_1 x^* \]

Confidence Intervals for the Population Regression Line

\[ \large Var(\hat y^*) = \hat\sigma^2 \left[ \frac{1}{n} + \frac{(x^* -\bar x)^2}{S_{xx}} \right] \]

\[ \large SE(\hat y^*) = \sqrt {Var(\hat y^*)} \]

\[ \large CI_{0.95}(\hat y^*) = \left[ \hat y^* \pm t_{0.025, df_{residual}} * SE(\hat y^*) \right]\]

Prediction Intervals for the y.* i.e. Actual Value of y

The variability in the error for predicting a single value of y (y.) will exceed the variability for estimating the expected value of y because of the random error.

\[ \large Var(\hat y.^*) = \hat\sigma^2 \left[ 1 + \frac{1}{n} + \frac{(x^* -\bar x)^2}{S_{xx}} \right] \]

\[ \large SE(\hat y.^*) = \sqrt {Var(\hat y^*)} \]

\[ \large CI_{0.95}(\hat y.^*) = \left[ \hat y.^* \pm t_{0.025, df_{residual}} * SE(\hat y.^*) \right]\]

34.5 Some Technical Bits

The estimate of residual variance \(\hat\sigma^2\) is an unbiased estimated of unknown \(\sigma^2\).
Under the assumptions of residuals for the linear model:

\[ \large E(\hat\beta|X=x) = \beta \] \[ \large Var(\hat\beta|X=x) = \frac{\sigma^2}{S_{xx}} \]