class: center, middle, inverse, title-slide .title[ # Sample proportion and Confidence intervals ] .subtitle[ ##
STA35A: Statistical Data Science 1 ] .author[ ### Xiao Hui Tai ] .date[ ### November 22, 2024 ] --- layout: true <!-- <div class="my-footer"> --> <!-- <span> --> <!-- <a href="https://datasciencebox.org" target="_blank">datasciencebox.org</a> --> <!-- </span> --> <!-- </div> --> --- <style type="text/css"> .tiny .remark-code { font-size: 70%; } .small .remark-code { font-size: 80%; } .tiny { font-size: 60%; } .small { font-size: 80%; } </style> ## Today - Sampling distribution of the sample proportion - Normal approximation to binomial - Confidence intervals: - Introduction and interpretation - Construction using Central Limit Theorem --- ## Example: Bernoulli data Assume that 67% of the population has had a prior COVID infection. Define the variable `\(X_i\)` to take value 1 if the `\(i\)`th randomly sampled resident has been infected, and let it be 0 otherwise. Assume the samples are independent. What distribution does `\(X_i\)` follow? What are the parameters? What is the mean? What is the variance? -- - The random variable `\(X_i\)` follows a Bernoulli distribution. A Bernoulli random variable has mean `\(p\)` and variance `\(p(1- p)\)`. Here `\(p=0.67\)`. -- For a random sample, the sample mean `\(\overline{x}=\hat{p}\)` is an **estimate** of this probability (fraction of 1's) **Sample proportion** is `\(\hat{P} = \frac{1}{n}\sum_{i = 1}^n X_i\)`, where `\(X_i \sim\)` Bernoulli(p) --- ## Simulation: Prior Infections - For repeated samples of residents, if we compute the proportion with prior infection in each, what values will we see? Will we get 0.67 each time? - What is the effect of the sample size `\(n\)`? --- ## Simulation: Prior Infections Again, we perform 10000 experiments where each time we sample from a Bernoulli distribution with `prob = 0.67` and different values of `n`. ``` r set.seed(0) n10 <- t(replicate(n = 10000, rbinom(10, size = 1, prob = .67))) n10 <- rowMeans(n10) set.seed(0) n50 <- t(replicate(n = 10000, rbinom(50, size = 1, prob = .67))) n50 <- rowMeans(n50) set.seed(0) n100 <- t(replicate(n = 10000, rbinom(100, size = 1, prob = .67))) n100 <- rowMeans(n100) set.seed(0) n500 <- t(replicate(n = 10000, rbinom(500, size = 1, prob = .67))) n500 <- rowMeans(n500) ``` --- ## Simulation: Prior Infections .small[ .pull-left[ ``` r data.frame(propCovid = n10, sampleSize = 10) %>% bind_rows( data.frame(n50, sampleSize = 50) %>% rename(propCovid = n50) ) %>% bind_rows( data.frame(n100, sampleSize = 100) %>% rename(propCovid = n100) )%>% bind_rows( data.frame(n500, sampleSize = 500) %>% rename(propCovid = n500) ) %>% ggplot(aes(x = propCovid, fill = as.factor(sampleSize))) + geom_density(alpha = .5) + labs(x = "Proportion of people with prior infection", y = "Density", title = "Sampling distribution from Bernoulli(.67)", fill = "Sample size") + scale_fill_viridis_d() ``` ] ] .pull-right[ <img src="lecture21_files/figure-html/unnamed-chunk-5-1.png" width="100%" style="display: block; margin: auto;" /> ] --- ## Sampling distribution of the sample proportion - Sample proportion is the **same as the sample mean**, when the **distribution is Bernoulli** - Sample proportion `\(\hat{P} = \frac{1}{n}\sum_{i = 1}^n X_i\)`, where `\(X_i \sim\)` Bernoulli(p) --- ## Earlier example Assume that 67% of the population has had a prior COVID infection. Define the variable `\(X_i\)` to take value 1 if the `\(i\)`th randomly sampled resident has been infected, and let it be 0 otherwise. Assume the samples are independent. - The random variable `\(X\)` is Bernoulli. A Bernoulli random variable has mean `\(p\)` and variance `\(p(1- p)\)`. Here `\(p=0.67\)`. - The sample proportion that had a prior infection is just the fraction of 1's - Sample mean defined as `\(\overline{X} = \frac{\sum_{i = 1}^n{X_i}}{n}\)`. When `\(X_i\)` is Bernoulli, it can only take the values 0 and 1, and the sample mean is the same as the sample proportion. `\(\overline{X} = \hat{P}\)`. --- ## Sampling distribution of the sample proportion - Recall: the Central Limit Theorem does not restrict the distribution of the underlying `\(X_1, ..., X_n\)` - `\(\overline{X} \approx N(\mu, \frac{\sigma^2}{n})\)` - We get the sampling distribution of the sample proportion for free! - When `\(X_i \sim\)` Bernoulli(p), `\(\mu = p\)` and `\(\sigma^2 = p(1-p)\)`. - By the Central Limit Theorem, `\(\overline{X} = \hat{P} \approx N(p, \frac{p(1-p)}{n})\)` --- ## Exercise We ask a random sample of 435 people if they like cats. Assume that the probability that a randomly selected person likes cats is 75%. Let `\(X_i\)` be the Bernoulli random variable representing whether or not the `\(i\)`th person likes cats. What is the sampling distribution of the sample proportion? What is the probability that the sample proportion is smaller than .7? --- ## Normal approximation to the Binomial distribution - Consider `\(Y \sim\)` Binomial(n, p) - As the number of trials increases, the binomial distribution becomes symmetric and bell-shaped - For large `\(n\)`, `\(Y \approx N(np, np(1-p))\)` - Commonly used rule of thumb: `\(np > 5\)` and `\(n(1-p) > 5\)` - This approximation comes from the Central Limit Theorem: `\(\overline{X} = \hat{P} = \frac{\sum_{i = 1}^n{X_i}}{n} \approx N(p, \frac{p(1-p)}{n})\)` where `\(X_i \sim\)` Bernoulli(p). - Now, `\(Y = \sum_{i = 1}^n{X_i} = n \overline{X}\)` - By the rules for expectation and variance, `\(E(n\overline{X}) = nE(\overline{X}) = np\)` and `\(Var(n\overline{X}) = n^2 Var(\overline{X}) = np(1-p)\)` --- ## Exercise We ask a random sample of 435 people if they like cats. Assume that the probability that a randomly selected person likes cats is 75%. Find the probability that at least 325 people like cats. --- ## Exercise We ask a random sample of 435 people if they like cats. Assume that the probability that a randomly selected person likes cats is 75%. Using a normal approximation, find the probability that at least 325 people like cats. --- ## Course content 1. Fundamentals of R - Overview of data types and structures - Data manipulation and data visualization tools 2. Descriptive statistics for numerical and categorical data 3. Probability - Rules of probability computation; conditional probability - Basic probability models: Binomial, Normal and Poisson 4. Statistical inference - Sampling distributions of sample mean and sample proportion - **Hypothesis testing and confidence intervals for population mean and population proportion** --- ## Estimation and testing - **Estimation**: "guess" for the unknown parameter - **Point estimate**: "Best guess", or estimate, for the population parameter. "Based on a survey of 300 UC Davis students, we estimate that 50% of students ride bikes on campus." - **Interval estimate**: "Based on a survey of 300 UC Davis students, we are 95% confident that the proportion of students riding bikes on campus is (.45, .55)." - **Testing**: evaluating whether our observed sample provides evidence against some claim about the population - "We reject the hypothesis that 10% of UC Davis students ride bikes on campus" --- ## Point estimates vs. confidence intervals - **Point estimate**: single value - For example, the sample mean `\(\overline{X}\)` is often used as an estimator for the population mean `\(\mu\)` - In R: `mean()` calculates the estimate (sample statistic) from a given sample - Point estimates vs. confidence intervals: If you want to catch a fish, do you prefer a spear or a net? .pull-left[ <img src="img/spear.png" width="70%" style="display: block; margin: auto;" /> ] .pull-right[ <img src="img/net.png" width="70%" style="display: block; margin: auto;" /> ] --- ## Point estimates vs. confidence intervals If you want to estimate a population parameter, do you prefer to report a range of values or a single value? .pull-left[ <img src="img/spear.png" width="400" style="display: block; margin: auto;" /> ] .pull-right[ <img src="img/net.png" width="400" style="display: block; margin: auto;" /> ] --- ## What is a Confidence Interval? - Range of reasonable values that are intended to **contain the parameter of interest** with a certain **degree of confidence** - Written as (point estimate - margin of error, point estimate + margin of error). <img src="img/ci2.png" width="40%" style="display: block; margin: auto;" /> - Accompanied by a **level of confidence** (often 95%) - "We are 95% **confident** that the true population parameter falls within the interval (point estimate - margin of error, point estimate + margin of error)" --- ## Example: Approval ratings <img src="img/approval.png" width="70%" style="display: block; margin: auto;" /> .tiny[ Source: https://www.rasmussenreports.com/public_content/politics/biden_administration/prez_track_sep23 ] What is the population parameter of interest? What is the sample statistic? -- **Population parameter**: Proportion of likely voters that strongly disapprove of President Biden (red line; alternatively, strongly approve (green line)) **Sample statistic**: Proportion based on 1500 likely voters (on a given day) **95% confidence interval**: (37%, 43%) for strongly disapprove --- ## Example: Vaccine efficacy <img src="img/vaccine.png" width="45%" style="display: block; margin: auto;" /> "The uncertainty around a point estimate can be small or large. Scientists represent this uncertainty by calculating a range of possibilities, which they call a confidence interval." --- ## Elements and interpretation .pull-left[ <img src="img/vaccine.png" width="75%" style="display: block; margin: auto;" /> "We are 95% **confident** that the true population parameter falls within the interval (.85, .95)." ] .pull-right[ **Elements**: - Confidence level - Point estimate - Lower bound - Upper bound ] **Interpretation**: If we were to **repeat this procedure** a large number of times, sampling and constructing confidence intervals in the same way, **95% of constructed intervals would contain the true population parameter**. --- ## Interpretation "We are 95% **confident** that the true population parameter falls within the interval (.85, .95)." **Interpretation**: If we were to **repeat this procedure** a large number of times, sampling and constructing confidence intervals in the same way, **95% of constructed intervals would contain the true population parameter**. - If we repeat the experiment 10,000 times, i.e., draw samples and construct 10,000 confidence intervals, we would expect 9,500 of these to contain the true population parameter - **Incorrect interpretation**: *Observed interval* has a 95% *probability* of containing the population parameter - Do not confuse **confidence** with **probability** - Once we **observe an interval** (collected a sample), there is no more variability. The observed interval either contains the population parameter, or it does not. --- ## Confidence Level: Procedure, not Specific Realization - The **confidence level** reflects the measure of confidence in the **procedure** that led to the confidence interval - In our fishing example, it is a statement about the properties of the net (e.g., size, quality) and your fishing technique. Once you've made an attempt (cast net into water), you either caught fish or did not. -- - Another analogy: a game of horseshoes .pull-left[ **Correct**: Based on your throwing technique, 95% of the horseshoes you throw will encircle the stake **Incorrect**: After you throw a horseshoe, it has a 95% *probability* of encircling the stake. It does not! It either encircles the stake or does not ] .pull-right[ <img src="img/horseshoes.jpg" width="100%" style="display: block; margin: auto;" /> .tiny[ Source: https://www.istockphoto.com/photo/pitching-horseshoes-gm140471913-3413882 ] ] --- ## Constructing confidence intervals - Need to quantify the **variability of our sample statistic** - Use the **sampling distribution of the sample mean**, which we derived using the **Central Limit Theorem** - In the following results, assume independent samples and `\(n\)` large enough, so that the Central Limit Theorem holds - Recall: z-scores corresponding to particular probabilities are often written as `\(z_p\)`, where `\(p\)` denotes the probability in the right tail, e.g., `\(z_{.025} \approx 1.96\)` - (These z-scores are called critical values) --- ## Confidence intervals using CLT - What we want: "If we were to **repeat this procedure** a large number of times, sampling and constructing confidence intervals in the same way, **95% of constructed intervals would contain the true population parameter**" - Translates to P(CI contains true parameter) = .95 -- - **Set up**: - `\(X_i\)` are independent and identically distributed with population mean `\(\mu\)` and variance `\(\sigma^2\)` - Want confidence interval for the unknown population parameter `\(\mu\)` - We use the sample mean, `\(\overline{X}\)`, constructed from a sample of size `\(n\)`, as an estimator for the population mean - Assume `\(n\)` is large --- ## Confidence intervals using CLT - Final ingredient: `\(\alpha\)`, which is the **significance level** - **Confidence level** = `\(100(1 - \alpha)\)`%, i.e., a **95% confidence interval** will need `\(\alpha = .05\)` - P(CI contains true parameter) = `\(1 - \alpha\)` - `\(z_{\frac{\alpha}{2}}\)` is the z-score corresponding to a probability of `\(\frac{\alpha}{2}\)` in the right tail - For `\(\alpha = .05\)`, `\(z_{\frac{\alpha}{2}}\)` is 1.96, i.e., `\(P(Z > 1.96) \approx .025\)`. --- ## Confidence intervals using CLT - Consider the interval `$$\left(\overline{X}-z_{\frac{\alpha}{2}}\frac{\sigma}{\sqrt{n}}, \overline{X}+z_{\frac{\alpha}{2}}\frac{\sigma}{\sqrt{n}} \right)$$` - We will prove that P(CI contains true parameter) = `\(1 - \alpha\)`, i.e., `\(P(\overline{X}-z_{\frac{\alpha}{2}}\frac{\sigma}{\sqrt{n}} \leq \mu \leq \overline{X}+z_{\frac{\alpha}{2}}\frac{\sigma}{\sqrt{n}}) = 1 - \alpha\)` - For simplicity, we will use `\(\alpha = .05\)`, and replace `\(z_{\frac{\alpha}{2}}\)` by 1.96 in the proof, but note that this will work for any `\(\alpha\)` between 0 and 1 --- ## Proof --- ## Summary - Sampling distribution of the sample proportion: `\(\overline{X} = \hat{P} \approx N(p, \frac{p(1-p)}{n})\)` - Normal approximation to binomial: For `\(Y \sim\)` Binomial(n, p), when `\(np > 5\)` and `\(n(1-p) > 5\)`, `\(Y \approx N(np, np(1-p))\)` - Confidence intervals: - Introduction and interpretation - Construction using Central Limit Theorem: `\(\left(\overline{x}-z_{\frac{\alpha}{2}}\frac{\sigma}{\sqrt{n}}, \overline{x}+z_{\frac{\alpha}{2}}\frac{\sigma}{\sqrt{n}} \right)\)`