Gerrit Gort
We will use the ChickWeight data:
summary(ChickWeight)## weight Time Chick Diet
## Min. : 35.0 Min. : 0.00 13 : 12 1:220
## 1st Qu.: 63.0 1st Qu.: 4.00 9 : 12 2:120
## Median :103.0 Median :10.00 20 : 12 3:120
## Mean :121.8 Mean :10.72 10 : 12 4:118
## 3rd Qu.:163.8 3rd Qu.:16.00 17 : 12
## Max. :373.0 Max. :21.00 19 : 12
## (Other):506
Plot weight versus time with a color for each diet (using package ggplot2):
ggplot(data = ChickWeight,aes(x = Time,y = weight,colour = Diet,shape = Diet)) +
geom_point() + theme(legend.position = c(0.1,0.8))In R, we fit a linear model using the function lm. Arguments of lm are:
args(lm)## function (formula, data, subset, weights, na.action, method = "qr",
## model = TRUE, x = FALSE, y = FALSE, qr = TRUE, singular.ok = TRUE,
## contrasts = NULL, offset, ...)
## NULL
Remember type ?functionName for help and examples
From the scatterplot it was clear that there is a relationship between Time and Chick weight. We may wonder whether the relationship is linear in time, and whether all further needed assumptions, like independent errors, are fulfilled, but as a starter we just fit the simple linear regression model to ChickWeight:
\(weight_{ij} = \beta_0 + \beta_1\cdot Time_i + \epsilon_{ij}\)
(lmW1 <- lm(weight~Time,ChickWeight))##
## Call:
## lm(formula = weight ~ Time, data = ChickWeight)
##
## Coefficients:
## (Intercept) Time
## 27.467 8.803
Remember classes and objects; the R language is object-oriented:
lmW1 is an object with two attributes (names and class). Its class is lm:
attributes(lmW1)## $names
## [1] "coefficients" "residuals" "effects" "rank"
## [5] "fitted.values" "assign" "qr" "df.residual"
## [9] "xlevels" "call" "terms" "model"
##
## $class
## [1] "lm"
class(lmW1)## [1] "lm"
An object of class lm is a list containing at the least the components mentioned under the names attribute. These list elements can be extracted in the usual way or using special functions.
Access the first element of lmW1 (by order):
lmW1[[1]]## (Intercept) Time
## 27.467425 8.803039
We can also access the coefficients list element by name:
lmW1$coefficients## (Intercept) Time
## 27.467425 8.803039
A generic function is a function defined for different classes:
head(methods(summary))## [1] "summary.aov" "summary.aovlist"
## [3] "summary.aspell" "summary.check_packages_in_dir"
## [5] "summary.connection" "summary.corAR1"
args(summary.lm)## function (object, correlation = FALSE, symbolic.cor = FALSE,
## ...)
## NULL
Generic functions that can be applied to objects of class lm:
| Function | Use |
|---|---|
| summary | returns summary information about regression |
| plot | makes diagnostic plots |
| coef | returns the coefficients |
| residuals | returns the residuals |
| fitted | returns fitted values |
| deviance | returns residual sum of squares |
| predict | performs predictions |
| anova | finds various sums of squares, produces anova table |
| AIC | used for model selection |
Time for trying them
The generic function summary summarizes an object of class lm:
summary(lmW1)##
## Call:
## lm(formula = weight ~ Time, data = ChickWeight)
##
## Residuals:
## Min 1Q Median 3Q Max
## -138.331 -14.536 0.926 13.533 160.669
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 27.4674 3.0365 9.046 <2e-16 ***
## Time 8.8030 0.2397 36.725 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 38.91 on 576 degrees of freedom
## Multiple R-squared: 0.7007, Adjusted R-squared: 0.7002
## F-statistic: 1349 on 1 and 576 DF, p-value: < 2.2e-16
Using basic plotting functions
Package ggplot2 also has facilities. We can use function stat_smooth to add a regression line to the plot:
And we can also use the coefficients from the lmW1 object to plot:
\(H_0: \beta_i = 0\):
(betas <- summary(lmW1)$coefficients)## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 27.467425 3.036464 9.045859 2.261555e-18
## Time 8.803039 0.239700 36.725235 5.020771e-153
Reproduce the t-test for \(\beta_1\):
b1 <- betas[2,1]
seb1 <- betas[2,2]
tb1 <- b1/seb1
(pval <- 2*pt(abs(tb1),lmW1$df.residual,lower.tail=FALSE))## [1] 5.020771e-153
(aTab <- anova(lmW1))## Analysis of Variance Table
##
## Response: weight
## Df Sum Sq Mean Sq F value Pr(>F)
## Time 1 2042344 2042344 1348.7 < 2.2e-16 ***
## Residuals 576 872212 1514
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Reproduce the F-test, here testing \(H_0: \beta_1=0\):
SSreg <- aTab[1,2]
SSres <- aTab[2,2]
dfreg <- aTab[1,1]
dfres <- aTab[2,1]
F <- (SSreg/dfreg)/(SSres/dfres)
pf(F,dfreg, dfres, lower.tail = FALSE)## [1] 5.020771e-153
The confidence interval of a parameter is the interval of values which are in agreement with the data. The confidence level \(1-\alpha\) specifies the strength of this agreement. The interval has often the form:
\(b_i\;\pm \; s.e.(b_i) * t_{df}(1-\alpha)\)
Use the function confint:
confint(lmW1,level = 0.95)## 2.5 % 97.5 %
## (Intercept) 21.503534 33.431317
## Time 8.332247 9.273832
Reproduce them:
lmW1s <- summary(lmW1)
beta <- lmW1s$coefficients[,1]
se <- lmW1s$coefficients[,2]
cbind(beta - qt(0.975,lmW1$df.residual)*se, beta + qt(0.975,lmW1$df.residual)*se)## [,1] [,2]
## (Intercept) 21.503534 33.431317
## Time 8.332247 9.273832
Generic function fitted returns the fitted values for weight:
pred <- fitted(lmW1)
pred[1:10]## 1 2 3 4 5 6 7
## 27.46743 45.07350 62.67958 80.28566 97.89174 115.49782 133.10390
## 8 9 10
## 150.70997 168.31605 185.92213
Predict new observations with function predict, we need to make a data.frame of the new data. Then, we can predict weight for the new observations.
newDat <- data.frame(Time = 2:4)
(pred <- predict(lmW1,newDat,se = T))## $fit
## 1 2 3
## 45.07350 53.87654 62.67958
##
## $se.fit
## 1 2 3
## 2.643233 2.458116 2.283177
##
## $df
## [1] 576
##
## $residual.scale
## [1] 38.91346
Assumptions in regression / linear models are:
Residuals are used to check the assumptions, often graphically:
df <- data.frame(res = resid(lmW1),fitted = fitted(lmW1),Chick = ChickWeight$Chick)
ggplot(data = df,aes(x= fitted,y = res,colour = Chick)) +
geom_point() + geom_hline(aes(yintercept = 0)) + theme(legend.position = 'none')Make a QQ (quantile-quantile) plot of the residuals to check normality, adding a line through quartiles 1 and 3:
qqnorm(df$res)
qqline(df$res)Or with a function from ggplot2:
ggplot(data = df,colour = df$Chick,aes(sample = res)) + geom_point(stat = 'qq')Standard diagnosis plots for objects of class lm using generic function plot:
oldPar <- par(mfrow = c(2,2),mar=c(3,3,1.5,1), cex=0.9, mgp=c(1.5,0.3,0), tck=-0.015, cex.main=1)
plot(lmW1)par(oldPar)\(Var(residual) = \sigma^2[\mathbf{I} - \mathbf{H}]\)
Obtain the leverages \(h_{ii}\) by lm.influence(lmobject)$hat or by hatvalues()
require(MASS)
lev <- lm.influence(lmW1)$hat
head(lev)## 1 2 3 4 5 6
## 0.006088868 0.004613933 0.003442545 0.002574704 0.002010410 0.001749664
Variances of the residuals are not constant. To make residuals better comparable, it may be better to standardize them:
\(e_i' = \frac{e_i}{s\sqrt{1-h_{ii}}}\)
In R: stdres(lmobject (from the MASS package)
Observations with larger leverage have smaller variance. It may be insightful to calculate the (standardized) residual of an observation without using the observation itself: jackknifed residuals or deleted residuals. In R these are called studentized residuals:
\(e_i^* = \frac{y_i - \hat{y}_{-i}}{\sqrt var\left(y_i - \hat{y}_{-i}\right)}\)
In R: studres(lmobject) (from the MASS package)
If not specified, by default the intercept is included in a fitted model. Intercept can be left out the model in different ways:
lmW21 <- lm(weight ~ 0 + Time, ChickWeight)
lmW22 <- lm(weight ~ -1 + Time, ChickWeight)
coef(lmW21)## Time
## 10.6376
The linear model in matrix notation is
\(\mathbf{y} = \mathbf{X}\beta + \mathbf{e}\)
\(\mathbf{e} \sim \mathbf{N}(0,\sigma^2_e \mathbf{I})\)
The matrix X is called the design matrix. Here it has 2 columns: a column of ones and column Time. The least squares solution is:
\(\mathbf{b} = \mathbf{(X^TX)^{-1}X^Ty}\)
This is easily programmed in R
X <- model.matrix(~Time,ChickWeight)
y <- ChickWeight$weight
b <- solve(crossprod(X),crossprod(X,y))
b## [,1]
## (Intercept) 27.467425
## Time 8.803039