Linear Regression

Gerrit Gort

Data

We will use the ChickWeight data:

summary(ChickWeight)
##      weight           Time           Chick     Diet   
##  Min.   : 35.0   Min.   : 0.00   13     : 12   1:220  
##  1st Qu.: 63.0   1st Qu.: 4.00   9      : 12   2:120  
##  Median :103.0   Median :10.00   20     : 12   3:120  
##  Mean   :121.8   Mean   :10.72   10     : 12   4:118  
##  3rd Qu.:163.8   3rd Qu.:16.00   17     : 12          
##  Max.   :373.0   Max.   :21.00   19     : 12          
##                                  (Other):506

Plotting the data

Plot weight versus time with a color for each diet (using package ggplot2):

ggplot(data = ChickWeight,aes(x = Time,y = weight,colour = Diet,shape = Diet)) +
   geom_point() + theme(legend.position = c(0.1,0.8))

Plot of ChickWeight data

Linear models

In R, we fit a linear model using the function lm. Arguments of lm are:

args(lm)
## function (formula, data, subset, weights, na.action, method = "qr", 
##     model = TRUE, x = FALSE, y = FALSE, qr = TRUE, singular.ok = TRUE, 
##     contrasts = NULL, offset, ...) 
## NULL

Remember type ?functionName for help and examples

Simple linear regression

From the scatterplot it was clear that there is a relationship between Time and Chick weight. We may wonder whether the relationship is linear in time, and whether all further needed assumptions, like independent errors, are fulfilled, but as a starter we just fit the simple linear regression model to ChickWeight:

\(weight_{ij} = \beta_0 + \beta_1\cdot Time_i + \epsilon_{ij}\)

(lmW1 <- lm(weight~Time,ChickWeight))
## 
## Call:
## lm(formula = weight ~ Time, data = ChickWeight)
## 
## Coefficients:
## (Intercept)         Time  
##      27.467        8.803

The lm class

Remember classes and objects; the R language is object-oriented:

lmW1 is an object with two attributes (names and class). Its class is lm:

attributes(lmW1)
## $names
##  [1] "coefficients"  "residuals"     "effects"       "rank"         
##  [5] "fitted.values" "assign"        "qr"            "df.residual"  
##  [9] "xlevels"       "call"          "terms"         "model"        
## 
## $class
## [1] "lm"
class(lmW1)
## [1] "lm"

Accessing elements of an lm object

An object of class lm is a list containing at the least the components mentioned under the names attribute. These list elements can be extracted in the usual way or using special functions.

Access the first element of lmW1 (by order):

lmW1[[1]]
## (Intercept)        Time 
##   27.467425    8.803039

We can also access the coefficients list element by name:

lmW1$coefficients
## (Intercept)        Time 
##   27.467425    8.803039

Generic functions

A generic function is a function defined for different classes:

head(methods(summary))
## [1] "summary.aov"                   "summary.aovlist"              
## [3] "summary.aspell"                "summary.check_packages_in_dir"
## [5] "summary.connection"            "summary.corAR1"
args(summary.lm)
## function (object, correlation = FALSE, symbolic.cor = FALSE, 
##     ...) 
## NULL

Generic functions that can be applied to objects of class lm:

Function Use
summary returns summary information about regression
plot makes diagnostic plots
coef returns the coefficients
residuals returns the residuals
fitted returns fitted values
deviance returns residual sum of squares
predict performs predictions
anova finds various sums of squares, produces anova table
AIC used for model selection

Time for trying them

Exploring the fitted model

The generic function summary summarizes an object of class lm:

summary(lmW1)
## 
## Call:
## lm(formula = weight ~ Time, data = ChickWeight)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -138.331  -14.536    0.926   13.533  160.669 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  27.4674     3.0365   9.046   <2e-16 ***
## Time          8.8030     0.2397  36.725   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 38.91 on 576 degrees of freedom
## Multiple R-squared:  0.7007, Adjusted R-squared:  0.7002 
## F-statistic:  1349 on 1 and 576 DF,  p-value: < 2.2e-16

Plotting the regression line

Using basic plotting functions

ChickWeight data with regression line

Package ggplot2 also has facilities. We can use function stat_smooth to add a regression line to the plot:

ChickWeight data with regression line using ggplot

And we can also use the coefficients from the lmW1 object to plot:

ChickWeight data with regression line using ggplot

t-tests for coefficients

\(H_0: \beta_i = 0\):

(betas <- summary(lmW1)$coefficients)
##              Estimate Std. Error   t value      Pr(>|t|)
## (Intercept) 27.467425   3.036464  9.045859  2.261555e-18
## Time         8.803039   0.239700 36.725235 5.020771e-153

Reproduce the t-test for \(\beta_1\):

b1 <- betas[2,1]
seb1 <- betas[2,2]
tb1 <- b1/seb1
(pval <- 2*pt(abs(tb1),lmW1$df.residual,lower.tail=FALSE))
## [1] 5.020771e-153

ANOVA table

(aTab <- anova(lmW1))
## Analysis of Variance Table
## 
## Response: weight
##            Df  Sum Sq Mean Sq F value    Pr(>F)    
## Time        1 2042344 2042344  1348.7 < 2.2e-16 ***
## Residuals 576  872212    1514                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Reproduce the F-test, here testing \(H_0: \beta_1=0\):

SSreg  <- aTab[1,2]
SSres <- aTab[2,2]
dfreg  <- aTab[1,1]
dfres <- aTab[2,1]
F <- (SSreg/dfreg)/(SSres/dfres)
pf(F,dfreg, dfres, lower.tail = FALSE)
## [1] 5.020771e-153

Confidence Interval

The confidence interval of a parameter is the interval of values which are in agreement with the data. The confidence level \(1-\alpha\) specifies the strength of this agreement. The interval has often the form:

\(b_i\;\pm \; s.e.(b_i) * t_{df}(1-\alpha)\)

Use the function confint:

confint(lmW1,level = 0.95)
##                 2.5 %    97.5 %
## (Intercept) 21.503534 33.431317
## Time         8.332247  9.273832

Reproduce them:

lmW1s <- summary(lmW1)
beta <- lmW1s$coefficients[,1]
se   <- lmW1s$coefficients[,2]
cbind(beta - qt(0.975,lmW1$df.residual)*se, beta + qt(0.975,lmW1$df.residual)*se)
##                  [,1]      [,2]
## (Intercept) 21.503534 33.431317
## Time         8.332247  9.273832

Prediction

Generic function fitted returns the fitted values for weight:

pred <- fitted(lmW1)
pred[1:10]
##         1         2         3         4         5         6         7 
##  27.46743  45.07350  62.67958  80.28566  97.89174 115.49782 133.10390 
##         8         9        10 
## 150.70997 168.31605 185.92213

Predict new observations with function predict, we need to make a data.frame of the new data. Then, we can predict weight for the new observations.

newDat <- data.frame(Time = 2:4)
(pred <- predict(lmW1,newDat,se = T))
## $fit
##        1        2        3 
## 45.07350 53.87654 62.67958 
## 
## $se.fit
##        1        2        3 
## 2.643233 2.458116 2.283177 
## 
## $df
## [1] 576
## 
## $residual.scale
## [1] 38.91346

Residuals and diagnostic plots

Assumptions in regression / linear models are:

Residuals are used to check the assumptions, often graphically:

df <- data.frame(res = resid(lmW1),fitted = fitted(lmW1),Chick = ChickWeight$Chick)
ggplot(data = df,aes(x= fitted,y = res,colour = Chick)) +
   geom_point() + geom_hline(aes(yintercept = 0)) + theme(legend.position = 'none')

Residual plot: residuals vs fitted values

Checking normality

Make a QQ (quantile-quantile) plot of the residuals to check normality, adding a line through quartiles 1 and 3:

qqnorm(df$res)
qqline(df$res)

Basic QQ-plot

Or with a function from ggplot2:

ggplot(data = df,colour = df$Chick,aes(sample = res)) + geom_point(stat = 'qq')

QQ-plot using ggplot

The standard diagnostic plots

Standard diagnosis plots for objects of class lm using generic function plot:

oldPar <- par(mfrow = c(2,2),mar=c(3,3,1.5,1), cex=0.9,  mgp=c(1.5,0.3,0), tck=-0.015, cex.main=1)
plot(lmW1)

Diagnostic panel

par(oldPar)

Leverage

\(Var(residual) = \sigma^2[\mathbf{I} - \mathbf{H}]\)

Obtain the leverages \(h_{ii}\) by lm.influence(lmobject)$hat or by hatvalues()

require(MASS)
lev <- lm.influence(lmW1)$hat
head(lev)
##           1           2           3           4           5           6 
## 0.006088868 0.004613933 0.003442545 0.002574704 0.002010410 0.001749664

Studentized residuals and leverages

Variances of the residuals are not constant. To make residuals better comparable, it may be better to standardize them:

\(e_i' = \frac{e_i}{s\sqrt{1-h_{ii}}}\)

In R: stdres(lmobject (from the MASS package)

Observations with larger leverage have smaller variance. It may be insightful to calculate the (standardized) residual of an observation without using the observation itself: jackknifed residuals or deleted residuals. In R these are called studentized residuals:

\(e_i^* = \frac{y_i - \hat{y}_{-i}}{\sqrt var\left(y_i - \hat{y}_{-i}\right)}\)

In R: studres(lmobject) (from the MASS package)

More diagnostic plots

With or without intercept

If not specified, by default the intercept is included in a fitted model. Intercept can be left out the model in different ways:

lmW21 <- lm(weight ~ 0 + Time, ChickWeight)
lmW22 <- lm(weight ~ -1 + Time, ChickWeight)
coef(lmW21)
##    Time 
## 10.6376

Linear model in matrix notation

The linear model in matrix notation is

\(\mathbf{y} = \mathbf{X}\beta + \mathbf{e}\)

\(\mathbf{e} \sim \mathbf{N}(0,\sigma^2_e \mathbf{I})\)

The matrix X is called the design matrix. Here it has 2 columns: a column of ones and column Time. The least squares solution is:

\(\mathbf{b} = \mathbf{(X^TX)^{-1}X^Ty}\)

This is easily programmed in R

X <- model.matrix(~Time,ChickWeight)
y <- ChickWeight$weight
b <- solve(crossprod(X),crossprod(X,y))
b
##                  [,1]
## (Intercept) 27.467425
## Time         8.803039

End