27  Simple Linear Regression - Examples

Author

Brian Powers

Published

March 21, 2024

27.1 Creating Linear Data

We can simulate data from a linear model \[Y_i = 5 + 5X+\epsilon\] where \(\epsilon \overset{iid}{\sim}N(\mu,\sigma^2)\).

#There is no assumption that X follows any particular distribution
X <- rpois(30, lambda=3)
e <- rnorm(30, mean=0, sd=4)

Y <- 5 + 5*X + e

plot(X, Y)

27.2 Atlanta Lead Example

lead_data <- read.csv("data/lead.csv")
atlanta_lead <- subset(lead_data, city=="Atlanta")
summary(atlanta_lead)
     city           air.pb.metric.tons aggr.assault.per.million
 Length:36          Min.   : 421.0     Min.   : 431.0          
 Class :character   1st Qu.: 603.8     1st Qu.: 924.5          
 Mode  :character   Median : 813.5     Median :1295.5          
                    Mean   : 919.9     Mean   :1399.2          
                    3rd Qu.:1247.8     3rd Qu.:1850.2          
                    Max.   :1516.0     Max.   :2301.0          
# if we call lm without storing it to a object
lm(aggr.assault.per.million ~ 1 + air.pb.metric.tons, data=atlanta_lead)

Call:
lm(formula = aggr.assault.per.million ~ 1 + air.pb.metric.tons, 
    data = atlanta_lead)

Coefficients:
       (Intercept)  air.pb.metric.tons  
           107.943               1.404  
atlanta_lead_lm <- lm(aggr.assault.per.million ~ 1 + air.pb.metric.tons, data=atlanta_lead)
summary(atlanta_lead_lm)

Call:
lm(formula = aggr.assault.per.million ~ 1 + air.pb.metric.tons, 
    data = atlanta_lead)

Residuals:
    Min      1Q  Median      3Q     Max 
-356.36  -84.55    6.89  122.93  382.88 

Coefficients:
                    Estimate Std. Error t value Pr(>|t|)    
(Intercept)        107.94276   80.46409   1.342    0.189    
air.pb.metric.tons   1.40375    0.08112  17.305   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 180.6 on 34 degrees of freedom
Multiple R-squared:  0.898, Adjusted R-squared:  0.895 
F-statistic: 299.4 on 1 and 34 DF,  p-value: < 2.2e-16

Scatter Plot using Base R

plot(aggr.assault.per.million ~ air.pb.metric.tons, data=atlanta_lead)
abline(atlanta_lead_lm, col="red")

points(y=fitted(atlanta_lead_lm), x=atlanta_lead$air.pb.metric.tons, pch=16 )

atlanta_lead_lm$residuals
          1           2           3           4           5           6 
 330.080025  226.850054  155.793859 -216.454844  256.930171  102.756395 
          7           8           9          10          11          12 
-356.355996 -149.487118  382.879164 -273.899218 -268.626594 -280.180195 
         13          14          15          16          17          18 
 175.359863 -294.175006   23.787530  109.576291  -97.442440  -80.249933 
         19          20          21          22          23          24 
  -8.530910   45.337967  -43.830619  116.062179  -55.317646  -70.964906 
         25          26          27          28          29          30 
-129.344731   14.387834  -58.799484  143.529335  148.229626    9.807148 
         31          32          33          34          35          36 
 199.737410  -64.487372   16.648940   14.188998  -27.773538    3.977759 
# or resid(atlanta_lead_lm)
# or residuals(atlanta_lead_lm)

A proper residual plot

plot(y=atlanta_lead_lm$residuals,
     x=fitted(atlanta_lead_lm))

The wrong way to make a residual plot

plot(y=atlanta_lead_lm$residuals,
     x=atlanta_lead$aggr.assault.per.million)

Checking for auto-correlation among residuals

This is when the residual values tend to be associated with the ‘next’ residual.

#we look at the correlation between residuals and the 
# next residual
plot(x=resid(atlanta_lead_lm)[-length(resid(atlanta_lead_lm))], y= resid(atlanta_lead_lm)[-1])

#we're looking for independence between the residual values - no shape 

QQ plot

n <- nrow(atlanta_lead)

# Let's look at what 36 equally spaced quantiles would be
(1:36)/(37)
 [1] 0.02702703 0.05405405 0.08108108 0.10810811 0.13513514 0.16216216
 [7] 0.18918919 0.21621622 0.24324324 0.27027027 0.29729730 0.32432432
[13] 0.35135135 0.37837838 0.40540541 0.43243243 0.45945946 0.48648649
[19] 0.51351351 0.54054054 0.56756757 0.59459459 0.62162162 0.64864865
[25] 0.67567568 0.70270270 0.72972973 0.75675676 0.78378378 0.81081081
[31] 0.83783784 0.86486486 0.89189189 0.91891892 0.94594595 0.97297297
qs <- qnorm((1:36)/(37))
stripchart(qs)

The 4 pre-fab plots of the linear model object

plot(atlanta_lead_lm)

#You can run this to get just the first two plots
plot(atlanta_lead_lm, which=1:2)

27.2.1 looking at qqplots

Let’s just explore the type of variation we see with normal QQ plots for this sample size so we know what tolerable deviation from the diagonal looks like

par(mar=c(0,0,0,0))
par(mfrow=c(3,3))
qqnorm(rnorm(n))
qqnorm(rnorm(n))
qqnorm(rnorm(n))
qqnorm(rnorm(n))
qqnorm(rnorm(n))
qqnorm(rnorm(n))
qqnorm(rnorm(n))
qqnorm(rnorm(n))
qqnorm(rnorm(n))

par(mfrow=c(1,1))

#also the qqPlot function in the car library adds a 95% tolerance band
library(car)
Loading required package: carData
qqPlot(residuals(atlanta_lead_lm))

[1] 9 7

27.3 Inference on the coefficients

summary(atlanta_lead_lm)

Call:
lm(formula = aggr.assault.per.million ~ 1 + air.pb.metric.tons, 
    data = atlanta_lead)

Residuals:
    Min      1Q  Median      3Q     Max 
-356.36  -84.55    6.89  122.93  382.88 

Coefficients:
                    Estimate Std. Error t value Pr(>|t|)    
(Intercept)        107.94276   80.46409   1.342    0.189    
air.pb.metric.tons   1.40375    0.08112  17.305   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 180.6 on 34 degrees of freedom
Multiple R-squared:  0.898, Adjusted R-squared:  0.895 
F-statistic: 299.4 on 1 and 34 DF,  p-value: < 2.2e-16

If for example I want to test \(H_0: \beta_1 = 1.3\) vs \(H_A: \beta_1 > 1.3\) The t statistic would be \[t = \dfrac{1.40375 - 1.3}{.08112}=1.278969\]

t <- (1.40375-1.3)/.08112

The \(p\)-value is going to be the right -tail probability from a T distribtuion with 34 degrees of freedom

pt(t, df=34, lower.tail=FALSE)
[1] 0.1047842

27.4 Prediction

Point estimate

predict(atlanta_lead_lm, newdata=data.frame(air.pb.metric.tons=1300))
       1 
1932.813 

A confidence interval for \(E(Y|X=1300)\) This estimates the location of the regression line with 95% confidence.

predict(atlanta_lead_lm, newdata=data.frame(air.pb.metric.tons=1300), interval="confidence", level=.95)
       fit      lwr      upr
1 1932.813 1845.229 2020.397

A ‘prediction’ interval for \(\hat{Y}|X=1300)\) This estimates the range of values for a individual data point line with 95% confidence.

predict(atlanta_lead_lm, newdata=data.frame(air.pb.metric.tons=1300), interval="prediction")
       fit      lwr      upr
1 1932.813 1555.393 2310.233

Demonstration of the difference between the two types of intervals.

NMC <- 1000
b0 <- 0
b1 <- 1
plot(x=NA, y=NA, xlim=c(400,1400), ylim=c(0, 2000))
for(i in 1:NMC){
  #Generate data from a linear model
  X <- runif(30, 400, 1400)
  err <- rnorm(30, 0, sd=100)
  Y <- 100 + 1.4 * X + err

  myFit <- lm(Y ~ X)
  b0[i] <- unname(coefficients(myFit)[1])
  b1[i] <- unname(coefficients(myFit)[2])
  abline(myFit, col=rgb(0,0,0,.25))
}

y.est <- b0 + 1300*b1
quantile(y.est, c(.025, .975))
    2.5%    97.5% 
1861.168 1979.547 

For the second type of confidence interval

X <- rep(1300, 1000)
y.pred <- 100 + 1.4 * X + rnorm(1000, 0, sd=100)
quantile(y.pred, c(.025, .975))
    2.5%    97.5% 
1732.464 2115.994 

27.5 Uncertainty in Prediction

Consider a linear relationship \[y = 10 + 2x + \epsilon\]

Consider if X ranges from .1 to 3, 1 to 30 or 10 to 300, and \(\sigma = .5, 5, 50\)

par(mfrow=c(3,3))
par(mar=c(2,2,1,0))
scalar <- c(.1,1,10)
for(i in 1:3){
  for(j in 1:3){
    X <- seq(scalar[i]*1, scalar[i]*30, length.out=30)
    Y <- 10 + 2 * X + rnorm(length(X), 0, sd=5*scalar[j])
    plot(X, Y, main=ifelse(i==1, paste0("sigma=",5*scalar[j]),""))
    abline(lm(Y~X))    
  }
}

So you can see that the fit of the linear model is related to the range of the \(X\) data as well as the variance of the error term.

27.5.1 Uncertainty of the linear model estimate

Let’s look at the uncertainty of the line of best fit. We’ll sample 30 points from this population and fit a line and repeat.

\(Y = 10 + 2X + \epsilon\)

slopes <- 0; intercepts <- 0; #empty vectors
set.seed(1)
for(i in 1:500){
  X <- runif(30, min=10, max=20)
  Y <- 10 + 2 * X + rnorm(30, mean=0, sd=5)
  
  if(i==1){
    plot(X,Y)
  }
  linearFit <- lm(Y~X)
  abline(linearFit, col=rgb(0,0,0,.1))
  slopes[i] <- coef(linearFit)[2]
  intercepts[i] <- coef(linearFit)[1]
}
abline(a=10, b=2, col="red")

What is the standard deviation of the intercept and slope estimates?

sd(intercepts)
[1] 5.022983
sd(slopes)
[1] 0.332713

Look at the summary output from the linear model (this is the very last model that was fit):

summary(linearFit)$coefficients
            Estimate Std. Error  t value    Pr(>|t|)
(Intercept) 18.66863  6.6534833 2.805844 0.009027105
X            1.46134  0.4236239 3.449615 0.001796701

These are not the exact same as the standard errors, but they are in the ballpark. These standard errors are related to this particular model fit with one set of \((X,Y)\) data. If we averaged the standard error over many model fits we’d probably see the values match much better.

27.5.2 Two types of interval estimates for our linear model: Confidence intervals and prediction intervals.

The standard error (i.e. uncertainty) of the slope and intercept estimates is captured by the idea of a “confidence interval” in prediction. Using the last sampled \((X,Y)\) data we fit

Ylm <- lm(Y~X)
xrng <- seq(10,20,.5)

Using the predict function we can estimate \(E(Y|X=x)\) for some particular value of \(X\). For example, if \(X=16\) a point estimate is given by

predict(Ylm, newdata=data.frame(X=16))
       1 
42.05007 

A confidence interval for \(E(Y|X=16)\) is given by adding the interval="confidence" option

predict(Ylm, newdata=data.frame(X=16), interval="confidence")
       fit      lwr      upr
1 42.05007 39.96838 44.13175

We would say “I am 95% confident that the expected value of \(Y\) when \(x=16\) is between 39.97 and 44.13”.

We can actually plot the confidence interval band over the data range to see it visualized.

CIconf <- predict(Ylm, newdata=data.frame(X=xrng), interval="confidence")
plot(X,Y)
abline(Ylm)
abline(10,2, col="red")
lines(x=xrng, y=CIconf[,2], col="black", lty=2)
lines(x=xrng, y=CIconf[,3], col="black", lty=2)

In a sense, we are 95% certain that the true line is within this interval, at least within the range of data that we have observed. But the better way to interpret it is that for any particular \(X\) value, we are 95% certain that the the expected value of \(Y\) (true line value in red) is within the interval.

For example, take \(x=16\). What is the 95% confidence interval for \(E(\hat Y|X=16)\)?

predict(Ylm, newdata=data.frame(X=16), interval="confidence")
       fit      lwr      upr
1 42.05007 39.96838 44.13175

And what is the true \(E(Y|X=16)\)?

10+2*16
[1] 42

The 95% confidence manifests itself if we were to repeat this many times, sampling a different sample each time and checking whether our interval covers the true expected value of 42.

contains42 <- 0
NMC <- 1000
for(i in 1:NMC){
  X <- runif(30, min=10, max=20)
  Y <- 10 + 2 * X + rnorm(30, mean=0, sd=5)
  Ylm <- lm(Y~X)
  CI <- predict(Ylm, newdata=data.frame(X=16), interval="confidence")[,2:3]
  contains42[i] <- CI[1]<=42 & CI[2]>=42
}
mean(contains42) #proportion of TRUEs in the results
[1] 0.95

On the other hand, a prediction interval is wider. Let’s just plot it and then talk about what it actually means.

Ylm <- lm(Y~X)
xrng <- seq(10,20,.5)
CIconf <- predict(Ylm, newdata=data.frame(X=xrng), interval="confidence")
CIpred <- predict(Ylm, newdata=data.frame(X=xrng), interval="prediction")
plot(X,Y)
abline(Ylm)
abline(10,2, col="red")
lines(x=xrng, y=CIconf[,2], col="black", lty=2)
lines(x=xrng, y=CIconf[,3], col="black", lty=2)
lines(x=xrng, y=CIpred[,2], col="blue", lty=2)
lines(x=xrng, y=CIpred[,3], col="blue", lty=2)

This interval predicts the next observed point from the population. For example, suppose we observe 1000 more points with \(X=16\)

newX <- rep(16,1000)
newY <- 10 + 2 * newX + rnorm(1000, mean=0, sd=5)

Ylm <- lm(Y~X)
xrng <- seq(10,20,.5)
CIpred <- predict(Ylm, newdata=data.frame(X=xrng), interval="prediction")
plot(X,Y, ylim=range(newY))
abline(Ylm)
abline(10,2, col="red")
lines(x=xrng, y=CIpred[,2], col="blue", lty=2)
lines(x=xrng, y=CIpred[,3], col="blue", lty=2)
points(newX, newY, col=rgb(0,0,1,.1))

Most of these new observations are within the prediction interval, but not all of them. What proportion of them?

CI <- predict(Ylm, newdata=data.frame(X=16), interval="prediction")
mean(newY >= CI[2] & newY <= CI[3])
[1] 0.963

Not a huge surprise - It’s a 95% prediction interval. How high or low this proportion is is dependent on the particular 30 data points that we used to fit the data - if our slope and intercept estimates are better, we’ll get closer to 95%, and if our standard error of residuals is higher this will increase the percentage because each prediction interval will be wider.

A better way to measure the predictive power of the interval is to repeat with new data on every model fit:

containsY <- 0
NMC <- 1000
for(i in 1:NMC){
  X <- runif(30, min=10, max=20)
  Y <- 10 + 2 * X + rnorm(30, mean=0, sd=5)
  Ylm <- lm(Y~X)
  newY <- 10 + 2 * 16 + rnorm(1, mean=0, sd=5)
  CI <- predict(Ylm, newdata=data.frame(X=16), interval="prediction")[,2:3]
  containsY[i] <- CI[1]<=newY & CI[2]>=newY
}
mean(containsY) #proportion of TRUEs in the results
[1] 0.956

27.5.3 Contrasting the interpretations

Our interpretation of a prediction interval is like this: “I am 95% confident (certain) that for a new observation with \(X=16\), \(Y\) will be within the interval”

Our interpretation of a confidence interval is : “I am 95% confident that the mean \(Y\) value for all observations with \(X=16\) is within the interval”

27.5.4 Why the difference in size?

The standard error of a confidence interval for \(E(Y|x)\) is \[ SE(\overline{Y}|x) = s_e \sqrt{\frac1n + \frac{(x-\bar{x})^2}{SS_{xx}}} \] The standard error of a prediction interval for \(\hat{Y}|x\) is \[ SE(\hat{Y}|x) = s_e \sqrt{1+\frac1n + \frac{(x-\bar{x})^2}{SS_{xx}}} \] Thus the prediction interval will be wider (notice the “1+” in the formula).

Why is the prediction interval wider than the confidence interval? There are two sources of uncertainty. The first is that we only estimated the placement of the model line - the slope and intercept are both estimates. As you saw before, new data would give us new estimates and there is some natural uncertainty.

The second source of uncertainty is that there are factors not in the model which give variation captured in the \(\sigma^2_\epsilon\) term in the model.