Sample proportion and Confidence intervals

class: center, middle, inverse, title-slide

.title[
# Sample proportion and Confidence intervals
]
.subtitle[
## <br><br> STA35A: Statistical Data Science 1
]
.author[
### Xiao Hui Tai
]
.date[
### November 22, 2024
]

---

layout: true

---

## Today
- Sampling distribution of the sample proportion

- Normal approximation to binomial

- Confidence intervals:

- Introduction and interpretation 
  
  - Construction using Central Limit Theorem

---

## Example: Bernoulli data

Assume that 67% of the population has had a prior COVID infection. Define the variable `$X_i$` to take value 1 if the `$i$`th randomly sampled resident has been infected, and let it be 0 otherwise. Assume the samples are independent.

What distribution does `$X_i$` follow? What are the parameters? What is the mean? What is the variance?

--
- The random variable `$X_i$` follows a Bernoulli distribution. A Bernoulli random variable has mean `$p$` and variance `$p(1- p)$`. Here `$p=0.67$`.

For a random sample, the sample mean `$\overline{x}=\hat{p}$` is an **estimate** of this probability (fraction of 1's)

**Sample proportion** is `$\hat{P} = \frac{1}{n}\sum_{i = 1}^n X_i$`, where `$X_i \sim$` Bernoulli(p)

---

## Simulation: Prior Infections

- For repeated samples of residents, if we compute the proportion with prior infection in each, what values will we see? Will we get 0.67 each time?

- What is the effect of the sample size `$n$`?

---

## Simulation: Prior Infections

Again, we perform 10000 experiments where each time we sample from a Bernoulli distribution with `prob = 0.67` and different values of `n`.

``` r
set.seed(0)
n10 <- t(replicate(n = 10000, rbinom(10, size = 1, prob = .67)))
n10 <- rowMeans(n10)

set.seed(0)
n50 <- t(replicate(n = 10000, rbinom(50, size = 1, prob = .67)))
n50 <- rowMeans(n50)

set.seed(0)
n100 <- t(replicate(n = 10000, rbinom(100, size = 1, prob = .67)))
n100 <- rowMeans(n100)

set.seed(0)
n500 <- t(replicate(n = 10000, rbinom(500, size = 1, prob = .67)))
n500 <- rowMeans(n500)
```

---
## Simulation: Prior Infections

.small[
.pull-left[

``` r
data.frame(propCovid = n10, sampleSize = 10) %>%
  bind_rows(
    data.frame(n50, sampleSize = 50) %>%
      rename(propCovid = n50)
  ) %>%
  bind_rows(
    data.frame(n100, sampleSize = 100) %>%
      rename(propCovid = n100)
  )%>%
  bind_rows(
    data.frame(n500, sampleSize = 500) %>%
      rename(propCovid = n500)
  ) %>%
  ggplot(aes(x = propCovid,
             fill = as.factor(sampleSize))) +
  geom_density(alpha = .5) +
  labs(x = "Proportion of people with prior infection",
       y = "Density",
       title = "Sampling distribution from Bernoulli(.67)",
       fill = "Sample size")  +
  scale_fill_viridis_d()
```
]
]

.pull-right[
<img src="lecture21_files/figure-html/unnamed-chunk-5-1.png" width="100%" style="display: block; margin: auto;" />
]

---
## Sampling distribution of the sample proportion

- Sample proportion is the **same as the sample mean**, when the **distribution is Bernoulli**

- Sample proportion `$\hat{P} = \frac{1}{n}\sum_{i = 1}^n X_i$`, where `$X_i \sim$` Bernoulli(p)

---
## Earlier example

- The random variable `$X$` is Bernoulli. A Bernoulli random variable has mean `$p$` and variance `$p(1- p)$`. Here `$p=0.67$`.

- The sample proportion that had a prior infection is just the fraction of 1's

- Sample mean defined as `$\overline{X} = \frac{\sum_{i = 1}^n{X_i}}{n}$`. When `$X_i$` is Bernoulli, it can only take the values 0 and 1, and the sample mean is the same as the sample proportion. `$\overline{X} = \hat{P}$`.

---
## Sampling distribution of the sample proportion

- Recall: the Central Limit Theorem does not restrict the distribution of the underlying `$X_1, ..., X_n$`

- `$\overline{X}  \approx N(\mu, \frac{\sigma^2}{n})$`

- We get the sampling distribution of the sample proportion for free!

- When `$X_i \sim$` Bernoulli(p), `$\mu = p$` and `$\sigma^2 = p(1-p)$`.

- By the Central Limit Theorem,
`$\overline{X} = \hat{P} \approx N(p, \frac{p(1-p)}{n})$`

---
## Exercise
We ask a random sample of 435 people if they like cats. Assume that the probability that a randomly selected person likes cats is 75%. Let `$X_i$` be the Bernoulli random variable representing whether or not the `$i$`th person likes cats. What is the sampling distribution of the sample proportion? What is the probability that the sample proportion is smaller than .7?

---
## Normal approximation to the Binomial distribution
- Consider `$Y \sim$` Binomial(n, p)

- As the number of trials increases, the binomial distribution becomes symmetric and bell-shaped

- For large `$n$`, `$Y \approx N(np, np(1-p))$`

- Commonly used rule of thumb: `$np > 5$` and `$n(1-p) > 5$`

- This approximation comes from the Central Limit Theorem: `$\overline{X} = \hat{P} = \frac{\sum_{i = 1}^n{X_i}}{n} \approx N(p, \frac{p(1-p)}{n})$` where `$X_i \sim$` Bernoulli(p).

- Now, `$Y = \sum_{i = 1}^n{X_i} = n \overline{X}$`

- By the rules for expectation and variance, `$E(n\overline{X}) = nE(\overline{X}) = np$` and `$Var(n\overline{X}) = n^2 Var(\overline{X}) = np(1-p)$`

---
## Exercise
We ask a random sample of 435 people if they like cats. Assume that the probability that a randomly selected person likes cats is 75%.

Find the probability that at least 325 people like cats.

---
## Exercise
We ask a random sample of 435 people if they like cats. Assume that the probability that a randomly selected person likes cats is 75%.

Using a normal approximation, find the probability that at least 325 people like cats.

---

## Course content

1. Fundamentals of R
  - Overview of data types and structures
  - Data manipulation and data visualization tools

2. Descriptive statistics for numerical and categorical data

3. Probability
  - Rules of probability computation; conditional probability
  - Basic probability models: Binomial, Normal and Poisson

4. Statistical inference
  - Sampling distributions of sample mean and sample proportion
  - **Hypothesis testing and confidence intervals for population mean and population proportion**

---
## Estimation and testing

- **Estimation**: "guess" for the unknown parameter

- **Point estimate**: "Best guess", or estimate, for the population parameter. "Based on a survey of 300 UC Davis students, we estimate that 50% of students ride bikes on campus."
  
  - **Interval estimate**: "Based on a survey of 300 UC Davis students, we are 95% confident that the proportion of students riding bikes on campus is (.45, .55)."

- **Testing**: evaluating whether our observed sample provides evidence against some claim about the population

- "We reject the hypothesis that 10% of UC Davis students ride bikes on campus"

---

## Point estimates vs. confidence intervals

- **Point estimate**: single value

- For example, the sample mean `$\overline{X}$` is often used as an estimator for the population mean `$\mu$`

- In R: `mean()` calculates the estimate (sample statistic) from a given sample

- Point estimates vs. confidence intervals: If you want to catch a fish, do you prefer a spear or a net?

.pull-left[
<img src="img/spear.png" width="70%" style="display: block; margin: auto;" />
]
.pull-right[
<img src="img/net.png" width="70%" style="display: block; margin: auto;" />
]

---

## Point estimates vs. confidence intervals

If you want to estimate a population parameter, do you prefer to report a range of values or a single value?

.pull-left[
<img src="img/spear.png" width="400" style="display: block; margin: auto;" />
]
.pull-right[
<img src="img/net.png" width="400" style="display: block; margin: auto;" />
]

---
## What is a Confidence Interval?
- Range of reasonable values that are intended to **contain the parameter of interest** with a certain **degree of confidence**

- Written as (point estimate - margin of error, point estimate + margin of error).

- Accompanied by a **level of confidence** (often 95%)

- "We are 95% **confident** that the true population parameter falls within the interval (point estimate - margin of error, point estimate + margin of error)"

---
## Example: Approval ratings

.tiny[
Source: https://www.rasmussenreports.com/public_content/politics/biden_administration/prez_track_sep23
]

What is the population parameter of interest? What is the sample statistic?

**Population parameter**: Proportion of likely voters that strongly disapprove of President Biden (red line; alternatively, strongly approve (green line))

**Sample statistic**: Proportion based on 1500 likely voters (on a given day)

**95% confidence interval**: (37%, 43%) for strongly disapprove
---
## Example: Vaccine efficacy

"The uncertainty around a point estimate can be small or large. Scientists represent this uncertainty by calculating a range of possibilities, which they call a confidence interval."

---
## Elements and interpretation

.pull-left[
<img src="img/vaccine.png" width="75%" style="display: block; margin: auto;" />

"We are 95% **confident** that the true population parameter falls within the interval (.85, .95)."
]

.pull-right[
**Elements**:
- Confidence level
- Point estimate
- Lower bound
- Upper bound
]

**Interpretation**: If we were to **repeat this procedure** a large number of times, sampling and constructing confidence intervals in the same way, **95% of constructed intervals would contain the true population parameter**.

---
## Interpretation 
"We are 95% **confident** that the true population parameter falls within the interval (.85, .95)."

- If we repeat the experiment 10,000 times, i.e., draw samples and construct 10,000 confidence intervals, we would expect 9,500 of these to contain the true population parameter

- **Incorrect interpretation**: *Observed interval* has a 95% *probability* of containing the population parameter

- Do not confuse **confidence** with **probability**

- Once we **observe an interval** (collected a sample), there is no more variability. The observed interval either contains the population parameter, or it does not.

---
## Confidence Level: Procedure, not Specific Realization

- The **confidence level** reflects the measure of confidence in the **procedure** that led to the confidence interval

- In our fishing example, it is a statement about the properties of the net (e.g., size, quality) and your fishing technique. Once you've made an attempt (cast net into water), you either caught fish or did not.

- Another analogy: a game of horseshoes

.pull-left[

**Correct**: Based on your throwing technique, 95% of the horseshoes you throw will encircle the stake

**Incorrect**: After you throw a horseshoe, it has a 95% *probability* of encircling the stake. It does not! It either encircles the stake or does not
]

.pull-right[
<img src="img/horseshoes.jpg" width="100%" style="display: block; margin: auto;" />

.tiny[
Source: https://www.istockphoto.com/photo/pitching-horseshoes-gm140471913-3413882
]
]
---

## Constructing confidence intervals

- Need to quantify the **variability of our sample statistic**

- Use the **sampling distribution of the sample mean**, which we derived using the **Central Limit Theorem**

- In the following results, assume independent samples and `$n$` large enough, so that the Central Limit Theorem holds

- Recall: z-scores corresponding to particular probabilities are often written as `$z_p$`, where `$p$` denotes the probability in the right tail, e.g., `$z_{.025} \approx 1.96$`
  - (These z-scores are called critical values)

---

## Confidence intervals using CLT

- What we want: "If we were to **repeat this procedure** a large number of times, sampling and constructing confidence intervals in the same way, **95% of constructed intervals would contain the true population parameter**"

- Translates to P(CI contains true parameter) = .95

- **Set up**: 
  - `$X_i$` are independent and identically distributed with population mean `$\mu$` and variance `$\sigma^2$`
  - Want confidence interval for the unknown population parameter `$\mu$`
  - We use the sample mean, `$\overline{X}$`, constructed from a sample of size `$n$`, as an estimator for the population mean
  - Assume `$n$` is large

---
## Confidence intervals using CLT

- Final ingredient: `$\alpha$`, which is the **significance level**

- **Confidence level** = `$100(1 - \alpha)$`%, i.e., a **95% confidence interval** will need `$\alpha = .05$`
  
  - P(CI contains true parameter) = `$1 - \alpha$`

- `$z_{\frac{\alpha}{2}}$` is the z-score corresponding to a probability of `$\frac{\alpha}{2}$` in the right tail
  - For `$\alpha = .05$`, `$z_{\frac{\alpha}{2}}$` is 1.96, i.e., `$P(Z > 1.96) \approx .025$`.

---
## Confidence intervals using CLT

- Consider the interval

`$$\left(\overline{X}-z_{\frac{\alpha}{2}}\frac{\sigma}{\sqrt{n}}, \overline{X}+z_{\frac{\alpha}{2}}\frac{\sigma}{\sqrt{n}} \right)$$`

- We will prove that P(CI contains true parameter) = `$1 - \alpha$`, i.e., `$P(\overline{X}-z_{\frac{\alpha}{2}}\frac{\sigma}{\sqrt{n}} \leq \mu \leq \overline{X}+z_{\frac{\alpha}{2}}\frac{\sigma}{\sqrt{n}}) = 1 - \alpha$`

- For simplicity, we will use `$\alpha = .05$`, and replace `$z_{\frac{\alpha}{2}}$` by 1.96 in the proof, but note that this will work for any `$\alpha$` between 0 and 1

---
## Proof

---
## Summary
- Sampling distribution of the sample proportion: `$\overline{X} = \hat{P} \approx N(p, \frac{p(1-p)}{n})$`

- Normal approximation to binomial: For `$Y \sim$` Binomial(n, p), when `$np > 5$` and `$n(1-p) > 5$`, `$Y \approx N(np, np(1-p))$`

- Confidence intervals:

- Introduction and interpretation 
  
  - Construction using Central Limit Theorem: `$\left(\overline{x}-z_{\frac{\alpha}{2}}\frac{\sigma}{\sqrt{n}}, \overline{x}+z_{\frac{\alpha}{2}}\frac{\sigma}{\sqrt{n}} \right)$`