24 Interval Estimation Practice

25 Practice Problems

For the next few exercises, we will use actual polls from the 2016 election. You can load the data from the dslabs package.

library(dslabs)
data("polls_us_election_2016")

Specifically, we will use all the national polls that ended within one week before the election.

library(tidyverse)

polls <- polls_us_election_2016 %>%

filter(enddate >= "2016-10-31" & state == "U.S.")

25.1 Polling Part 1

For the first poll, you can obtain the samples size and estimated Clinton percentage with:

(N<- polls$samplesize[1])

[1] 2220

(x_hat <- polls$rawpoll_clinton[1]/100)

[1] 0.47

Assume there are only two candidates and construct a 95% confidence interval for the election night proportion \(p\).

25.2 Polling Part 2

Now use dplyr to add a confidence interval as two columns, call them lower and upper, to the object polls. Then use select to show the pollster, enddate, x_hat,lower, upper variables. Hint: define temporary columns x_hat and se_hat.

Solution

x_hat <- polls$rawpoll_clinton/100
se_hat <- sqrt(x_hat*(1-x_hat)/polls$samplesize)
polls$lower <- x_hat-1.96*se_hat
polls$upper <- x_hat+1.96*se_hat

25.3 Polling Part 3

The final tally for the popular vote was Clinton 48.2% and Trump 46.1%. Add a column, call it hit, to the previous table stating if the confidence interval included the true proportion \(p = 0.482\) or not.

Solution

polls$hit <- 0.482 >=polls$lower & 0.482 <= polls$upper

25.4 Polling Part 4

For the table you just created, what proportion of confidence intervals included \(p\)?

Solution

mean(polls$hit)

[1] 0.3142857

25.5 Polling Part 5

If these confidence intervals are constructed correctly, and the theory holds up, what proportion should include \(p\)?

Solution

95% of the confidnece intervals should include p.

25.6 Polling Part 6

A much smaller proportion of the polls than expected produce confidence intervals containing \(p\). If you look closely at the table, you will see that most polls that fail to include \(p\) are underestimating. The reason for this is undecided voters, individuals polled that do not yet know who they will vote for or do not want to say. Because, historically, undecideds divide evenly between the two main candidates on election day, it is more informative to estimate the spread or the difference between the proportion of two candidates d, which in this election was \(0.482 − 0.461 = 0.021\). Assume that there are only two parties and that \(d = 2p − 1\), redefine polls as below and re-do exercise 1, but for the difference.

Solution

polls <- polls_us_election_2016 %>%
filter(enddate >= "2016-10-31" & state == "U.S.") %>%

mutate(d_hat = rawpoll_clinton / 100 - rawpoll_trump / 100)

(N<- polls$samplesize[1])

[1] 2220

(x_hat <- polls$d_hat[1])

[1] 0.04

25.7 Polling Part 7

Now repeat Polling Part 3, but for the difference.

Solution

x_hat <- polls$rawpoll_clinton/100
se_hat <- sqrt(abs(x_hat*(1-x_hat))/polls$samplesize)
polls$lower <- polls$d_hat-1.96*se_hat
polls$upper <- polls$d_hat+1.96*se_hat

polls$hit <- .021 >=polls$lower & .021 <= polls$upper

25.8 Polling Part 8

Now repeat part 4, but for the difference.

Solution

mean(polls$hit)

[1] 0.6

#Polling Part 9

Although the proportion of confidence intervals goes up substantially, it is still lower than 0.95. In the next chapter, we learn the reason for this. To motivate this, make a plot of the error, the difference between each poll’s estimate and the actual d = 0.021. Stratify by pollster.

#Polling Part 10 Redo the plot that you made for exercise 9, but only for pollsters that took five or more polls.

26 Beyond STAT 340

These problems are excellent practice but they are beyond the material we cover in STAT 340.