class: center, middle, inverse, title-slide .title[ # More on Confidence Intervals ] .subtitle[ ##
STA35A: Statistical Data Science 1 ] .author[ ### Xiao Hui Tai ] .date[ ### November 25, 2024 ] --- layout: true <!-- <div class="my-footer"> --> <!-- <span> --> <!-- <a href="https://datasciencebox.org" target="_blank">datasciencebox.org</a> --> <!-- </span> --> <!-- </div> --> --- <style type="text/css"> .tiny .remark-code { font-size: 70%; } .small .remark-code { font-size: 80%; } .tiny { font-size: 60%; } .small { font-size: 80%; } </style> ## Today - More on confidence intervals: - Construction using Central Limit Theorem - Changing n and alpha - Simulation example - `\(\sigma^2\)` unknown - Confidence interval for population proportion --- ## Proof $$ `\begin{aligned} & P\left(\overline{X}-1.96\frac{\sigma}{\sqrt{n}} \leq \mu \leq \overline{X} + 1.96\frac{\sigma}{\sqrt{n}}\right) \\ &= P\left(\overline{X}-1.96\frac{\sigma}{\sqrt{n}} \leq \mu \text{ and } \mu \leq \overline{X} + 1.96\frac{\sigma}{\sqrt{n}} \right) \\ &= P\left(\overline{X}-\mu \leq 1.96\frac{\sigma}{\sqrt{n}} \text{ and } \overline{X} - \mu \geq -1.96\frac{\sigma}{\sqrt{n}} \right) \\ &= P\left( \frac{\overline{X}-\mu}{\frac{\sigma}{\sqrt{n}}} \leq 1.96 \text{ and } \frac{\overline{X} - \mu}{\frac{\sigma}{\sqrt{n}}} \geq -1.96 \right) \\ &= P\left( -1.96 \leq \frac{\overline{X}-\mu}{\frac{\sigma}{\sqrt{n}}} \leq 1.96\right) \\ &\approx P\left( -1.96 \leq Z \leq 1.96\right) \text{ by the Central Limit Theorem} \\ &= .95 \end{aligned}` $$ --- ## Confidence Interval for Population Mean - Collect sample - Compute sample mean `\(\bar{x}\)` - An `\(\alpha\)`-level confidence interval for `\(\mu\)` is `$$\left(\overline{x}-z_{\frac{\alpha}{2}}\frac{\sigma}{\sqrt{n}}, \overline{x}+z_{\frac{\alpha}{2}}\frac{\sigma}{\sqrt{n}} \right)$$` Notice that the random variable `\(\overline{X}\)` has been replaced by the sample statistic `\(\overline{x}\)` --- ## Example Let `\(X_1, X_2, ..., X_{200}\)` be independent `\(N(\mu, 4)\)` random variables. We collect the sample of size 200, and the resulting sample mean, `\(\overline{x}\)`, is `\(\overline{x} = 24\)`. What is a 95% confidence interval for `\(\mu\)`? --- ## Confidence Interval Width: changing n `$$\left(\overline{x}-z_{\frac{\alpha}{2}}\frac{\sigma}{\sqrt{n}}, \overline{x}+z_{\frac{\alpha}{2}}\frac{\sigma}{\sqrt{n}} \right)$$` - For a given `\(z_{\frac{\alpha}{2}}\)`, confidence intervals that are **narrower** indicate **greater certainty** in estimated values - We can get narrower intervals by increasing `\(n\)`, the sample size --- ## Earlier example We had `\(n = 200\)`, `\(\overline{x} = 24\)`, `\(\sigma = 2\)`, and our 95% confidence interval for `\(\mu\)` was (23.72, 24.28). What about when `\(n = 2000\)`? -- `$$\left(\overline{x}-z_{\frac{\alpha}{2}}\frac{\sigma}{\sqrt{n}}, \overline{x}+z_{\frac{\alpha}{2}}\frac{\sigma}{\sqrt{n}} \right)$$` Substitute `\(\overline{x} = 24\)`, `\(z_{\frac{\alpha}{2}} = 1.96\)`, `\(\sigma = 2\)`, `\(n = 2000\)`, giving `\(\left(24-1.96*2/ \sqrt{2000}, 24 + 1.96*2/ \sqrt{2000} \right)\)` or (23.91, 24.09) --- ## Confidence Interval Width: changing alpha - 95% confidence intervals are the most common - Simple to generate other intervals, by **replacing the critical value** - E.g., for a 99% interval, need the z-score that cuts off the upper 0.005 of the distribution ``` r qnorm(.995) ``` ``` ## [1] 2.575829 ``` -- Earlier example: - `\(n = 200\)`, `\(\overline{x} = 24\)`, `\(\sigma = 2\)`; 95% confidence interval for `\(\mu\)` was (23.72, 24.28) - When `\(\alpha = .01\)`, a 99% confidence interval is `\(\left(24-2.58*2/ \sqrt{200}, 24 + 2.58*2/ \sqrt{200} \right)\)` or (23.64, 24.36). --- ## Confidence Interval Width: changing alpha In a recent study of 50 randomly selected statistics students, they were asked the number of hours per week they spend studying for their statistics classes. The results were used to estimate the mean time for all statistics students with 90%, 95% and 99% confidence intervals. These were (not necessarily in the same order): `$$(7.5, 8.5) ~~~ (7.6, 8.4) ~~~(7.7, 8.3).$$` Which interval is which? --- ## Simulation example Let `\(X\)` be normally distributed with mean `\(\mu = 3\)` and variance `\(\sigma^2 = 25\)`, i.e., `\(X \sim N(3, 5^2)\)` ``` r myDraws <- rnorm(1000, mean = 3, sd = 5) xBar <- mean(myDraws) n <- length(myDraws) halfWidth <- qnorm(.975)*5/sqrt(n) (lower <- xBar - halfWidth) ``` ``` ## [1] 2.557116 ``` ``` r (upper <- xBar + halfWidth) ``` ``` ## [1] 3.176912 ``` Hence a 95% confidence interval for `\(\mu\)` is (2.5571165, 3.1769115). --- ## Simulation example: 5000 confidence intervals ``` r set.seed(0) CIsDF <- data.frame(lower = rep(NA, 5000), upper = rep(NA, 5000)) for (i in 1:nrow(CIsDF)) { myDraws <- rnorm(1000, mean = 3, sd = 5) xBar <- mean(myDraws) n <- length(myDraws) halfWidth <- qnorm(.975)*5/sqrt(n) CIsDF[i, "lower"] <- xBar - halfWidth CIsDF[i, "upper"] <- xBar + halfWidth } ``` --- ## Simulation example: 5000 confidence intervals How many of the 5000 CIs do we expect include the true population mean, `\(\mu = 3\)`? -- - Recall interpretation of confidence intervals: If we were to **repeat this procedure** a large number of times, sampling and constructing confidence intervals in the same way, **95% of constructed intervals would contain the true population parameter**. - If we repeat the experiment 5,000 times, i.e., draw samples and construct 5,000 confidence intervals, we would expect 4,750 of these to contain the true population parameter .small[ .pull-left[ ``` r CIsDF %>% mutate(index = 1:nrow(CIsDF), estimate = (lower + upper)/2) %>% slice(1:20) %>% ggplot(aes(estimate, index)) + geom_pointrange(aes(xmin = lower, xmax = upper)) + geom_vline(xintercept = 3, colour = "grey60", linetype = 2) + labs(title = "First 20 CIs for population mean", x = "Estimate", y = "Index") ``` ] ] .pull-right[ <img src="lecture22_files/figure-html/unnamed-chunk-7-1.png" width="100%" style="display: block; margin: auto;" /> ] --- ## Simulation example: 5000 confidence intervals ``` r sum(CIsDF$lower <= 3 & CIsDF$upper >= 3) ``` ``` ## [1] 4769 ``` ``` r sum(CIsDF$lower <= 3 & CIsDF$upper >= 3) / nrow(CIsDF) ``` ``` ## [1] 0.9538 ``` --- ## Simulation example: 5000 confidence intervals, 90% level of confidence ``` r set.seed(0) CIsDF <- data.frame(lower = rep(NA, 5000), upper = rep(NA, 5000)) for (i in 1:nrow(CIsDF)) { myDraws <- rnorm(1000, mean = 3, sd = 5) xBar <- mean(myDraws) n <- length(myDraws) halfWidth <- qnorm(.95)*5/sqrt(n) CIsDF[i, "lower"] <- xBar - halfWidth CIsDF[i, "upper"] <- xBar + halfWidth } sum(CIsDF$lower <= 3 & CIsDF$upper >= 3) ``` ``` ## [1] 4542 ``` ``` r sum(CIsDF$lower <= 3 & CIsDF$upper >= 3) / nrow(CIsDF) ``` ``` ## [1] 0.9084 ``` --- ## What happens if the population variance is unknown? - The confidence interval involves `\(\sigma^2\)`; most of the time, this is unknown - Recall that we can **use the sample variance**, `\(s^2=\frac{1}{n-1}\sum_{i=1}^n(x_i-\bar{x})^2\)`, to estimate `\(\sigma^2\)` - Before collecting the data: `\(S^2=\frac{1}{n-1}\sum_{i=1}^n(X_i-\overline{X})^2\)` - If we estimate `\(\sigma^2\)` using `\(S^2\)`, then we can't use the Central Limit Theorem in the same way - Two ways we can make progress: 1. When `\(n\)` is large 2. When `\(n\)` is small (we're going to skip this) --- ## First case (n large) - If `\(n\)` is large, CLT still holds (for the `\(\sigma^2\)` version), and using another theorem (out of scope for this class), we can prove that `\(\frac{\overline{X} - \mu}{S / \sqrt{n}} \approx N(0, 1)\)` - Only difference: replaced `\(\sigma\)` by `\(S\)` - An `\(\alpha\)`-level confidence interval for `\(\mu\)` is `\(\left(\overline{x}-z_{\frac{\alpha}{2}}\frac{s}{\sqrt{n}}, \overline{x}+z_{\frac{\alpha}{2}}\frac{s}{\sqrt{n}} \right)\)` --- ## Previous example: `\(\sigma\)` known Let `\(X_1, X_2, ..., X_{200}\)` be independent `\(N(\mu, 4)\)` random variables. We collect the sample of size 200, and the resulting sample mean, `\(\overline{x}\)`, is `\(\overline{x} = 24\)`. What is a 95% confidence interval for `\(\mu\)`? -- Since `\(n = 200\)` is large, by CLT, a 95% confidence interval for `\(\mu\)` is given by `$$\left(\overline{x}-z_{\frac{\alpha}{2}}\frac{\sigma}{\sqrt{n}}, \overline{x}+z_{\frac{\alpha}{2}}\frac{\sigma}{\sqrt{n}} \right)$$` Substitute `\(\overline{x} = 24\)`, `\(z_{\frac{\alpha}{2}} = 1.96\)`, `\(\sigma = 2\)`, `\(n = 200\)`, giving `\(\left(24-1.96*2/ \sqrt{200}, 24 + 1.96*2/ \sqrt{200} \right)\)` or (23.72, 24.28) We are 95% confident that `\(\mu\)` falls within the interval (23.72, 24.28). --- ## Previous example: `\(\sigma\)` unknown Let `\(X_1, X_2, ..., X_{200}\)` be independent `\(N(\mu, \sigma^2)\)` random variables. We collect the sample of size 200, and the resulting sample mean, `\(\overline{x}\)`, is `\(\overline{x} = 24\)`. `\(\sigma^2\)` is unknown, but using our sample, we calculate the sample variance, `\(s^2\)` to be 4.1. What is a 95% confidence interval for `\(\mu\)`? --- ## Example: Lead in Flint, MI From April 25, 2014 to October 15, 2015, the water supply source for Flint, MI was switched to the Flint River from the Detroit water system. Without corrosion inhibitors, the Flint River water, which is high in chloride, caused lead from aging pipes to leach into the water supply. We have data from Flint collected as part of a citizen-science project involving Virginia Tech researchers. <img src="img/flintdetroit.jpg" width="60%" style="display: block; margin: auto;" /> --- ## Example: Lead in Flint, MI .small[ ``` r flint <- readxl::read_excel("./data/Flint-Samples.xlsx", sheet = 1) %>% rename("Pb_initial"="Pb Bottle 1 (ppb) - First Draw") str(flint) ``` ``` ## tibble [271 × 7] (S3: tbl_df/tbl/data.frame) ## $ SampleID : num [1:271] 1 2 4 5 6 7 8 9 12 13 ... ## $ Zip Code : num [1:271] 48504 48507 48504 48507 48505 ... ## $ Ward : num [1:271] 6 9 1 8 3 9 9 5 9 3 ... ## $ Pb_initial : num [1:271] 0.344 8.133 1.111 8.007 1.951 ... ## $ Pb Bottle 2 (ppb) - 45 secs flushing: num [1:271] 0.226 10.77 0.11 7.446 0.048 ... ## $ Pb Bottle 3 (ppb) - 2 mins flushing : num [1:271] 0.145 2.761 0.123 3.384 0.035 ... ## $ Notes : chr [1:271] NA NA NA NA ... ``` ] - Each row is a water sample; lead level in `Pb_initial` column (units are parts per billion (ppb)) - Want a confidence interval for the unknown population mean lead level in Flint water - What is the sample size `\(n\)`? What confidence interval should we use? --- ## Calculating a Confidence Interval for Flint Lead - 271 water samples, so `\(n\)` qualifies as large - For a 95% confidence interval: `$$\left(\overline{x}-z_{\frac{\alpha}{2}}\frac{s}{\sqrt{n}}, \overline{x}+z_{\frac{\alpha}{2}}\frac{s}{\sqrt{n}} \right)$$` .small[ ``` r xBar <- mean(flint$Pb_initial) sigma2 <- var(flint$Pb_initial) n <- length(flint$Pb_initial) halfWidth <- qnorm(.975)*sqrt(sigma2/n) (lower <- xBar - halfWidth) ``` ``` ## [1] 8.078981 ``` ``` r (upper <- xBar + halfWidth) ``` ``` ## [1] 13.213 ``` ] Hence a 95% confidence interval for the mean lead level in water in Flint is (8.0789808, 13.2130044) --- ## Confidence interval for population proportion - Recall: when `\(X \sim\)` Bernoulli, sample mean `\(\overline{X} = \frac{\sum_{i = 1}^n{X_i}}{n}\)` is the same as the sample proportion `\(\hat{P}\)` - We have `\(\overline{X} = \hat{P} \approx N(p, \frac{p(1-p)}{n})\)` by the Central Limit Theorem, when `\(n\)` is large - `\(p\)` is unknown, but we can replace it by `\(\hat{p}\)`, calculated from the sample, in the same way we replaced `\(\sigma\)` by `\(s\)` for the population mean (case 1, when `\(n\)` large), so an `\(\alpha\)`-level confidence interval for `\(p\)` is `$$\left(\hat{p}-z_{\frac{\alpha}{2}}\frac{\sqrt{\hat{p}(1-\hat{p})}}{\sqrt{n}}, \hat{p}+z_{\frac{\alpha}{2}}\frac{\sqrt{\hat{p}(1-\hat{p})}}{\sqrt{n}} \right)$$` --- ## Example: Approval ratings <img src="img/approval.png" width="70%" style="display: block; margin: auto;" /> .tiny[ Source: https://www.rasmussenreports.com/public_content/politics/biden_administration/prez_track_sep23 ] -- - Margin of error = `\(z_{\frac{\alpha}{2}}\frac{\sqrt{\hat{p}(1-\hat{p})}}{\sqrt{n}}\)` ``` r qnorm(.975)*sqrt(.5^2/1500) ``` ``` ## [1] 0.02530303 ``` --- ## Example: Flint - The EPA action level for lead in public water supplies is 15 ppb - What is the estimated proportion of water samples (homes) with lead levels over 15 ppb? ``` r flint <- flint %>% mutate(Pbover15 = Pb_initial > 15) pHat <- mean(flint$Pbover15) var <- pHat*(1 - pHat)/length(flint$Pbover15) halfWidth <- qnorm(.975)*sqrt(var) (lower <- pHat - halfWidth) ``` ``` ## [1] 0.1217465 ``` ``` r (upper <- pHat + halfWidth) ``` ``` ## [1] 0.2103569 ``` A 95% confidence interval for the proportion of homes in Flint with lead level above 15 ppb is (0.1217465, 0.2103569). --- ## Summary - More on confidence intervals: - Construction using Central Limit Theorem: `\(\left(\overline{x}-z_{\frac{\alpha}{2}}\frac{\sigma}{\sqrt{n}}, \overline{x}+z_{\frac{\alpha}{2}}\frac{\sigma}{\sqrt{n}} \right)\)` - Changing n and alpha - Simulation example - `\(\sigma^2\)` unknown: - `\(n\)` large: `\(\left(\overline{x}-z_{\frac{\alpha}{2}}\frac{s}{\sqrt{n}}, \overline{x}+z_{\frac{\alpha}{2}}\frac{s}{\sqrt{n}} \right)\)` <!-- - `\(n\)` small and `\(X_i\)` normal: `\(\left(\overline{x}-t_{n-1,\frac{\alpha}{2}}\frac{s}{\sqrt{n}}, \overline{x}+t_{n-1,\frac{\alpha}{2}}\frac{s}{\sqrt{n}} \right)\)` --> - Confidence interval for population proportion: `\(\left(\hat{p}-z_{\frac{\alpha}{2}}\frac{\sqrt{\hat{p}(1-\hat{p})}}{\sqrt{n}}, \hat{p}+z_{\frac{\alpha}{2}}\frac{\sqrt{\hat{p}(1-\hat{p})}}{\sqrt{n}} \right)\)`