class: center, middle, inverse, title-slide .title[ # Sample proportion and Confidence intervals ] .subtitle[ ##
STA35A: Statistical Data Science 1 ] .author[ ### Xiao Hui Tai ] .date[ ### November 29, 2023 ] --- layout: true <!-- <div class="my-footer"> --> <!-- <span> --> <!-- <a href="https://datasciencebox.org" target="_blank">datasciencebox.org</a> --> <!-- </span> --> <!-- </div> --> --- <style type="text/css"> .tiny .remark-code { font-size: 70%; } .small .remark-code { font-size: 80%; } .tiny { font-size: 60%; } .small { font-size: 80%; } </style> ## Recap - Central Limit Theorem: `\(\overline{X} \approx N(\mu, \frac{\sigma^2}{n})\)` - Sampling distribution of the sample mean --- ## Today - Sampling distribution of the sample proportion: `\(\overline{X} = \hat{P} \approx N(p, \frac{p(1-p)}{n})\)` - Normal approximation to binomial - Confidence intervals: - Introduction and interpretation - Construction using Central Limit Theorem --- ## Example: Normal data In the shoe size example, we have `\(X_1, ..., X_n \sim N(11, 1.5^2)\)`. Say we collected 1000 samples, so the sample size `\(n = 1000\)`. What distribution does the sampling distribution of the sample mean follow? -- - By the Central Limit Theorem, `\(\overline{X} \approx N(\mu, \frac{\sigma^2}{n})\)`, in this case `\(N(11, \frac{1.5^2}{1000})\)`. -- What is `\(P(\overline{X} < 10.9)\)`? Calculate this in two ways: using the original distribution, and using the standard normal distribution. -- - Standardizing, we have `\(P(\overline{X} < 10.9) = P(\frac{\overline{X} - 11}{1.5/\sqrt{1000}} < \frac{10.9 - 11}{1.5/\sqrt{1000}}) = P(Z < \frac{-.1}{1.5/\sqrt{1000}})\)` ```r pnorm(10.9, mean = 11, sd = sqrt(1.5^2/1000)) ``` ``` ## [1] 0.01750749 ``` ```r pnorm((10.9 - 11)/(1.5/sqrt(1000)) ) # standardized version ``` ``` ## [1] 0.01750749 ``` --- ## Example: Bernoulli data Assume that 67% of the population in India has had a prior COVID infection. Define the variable `\(X_i\)` to take value 1 if the `\(i\)`th randomly sampled Indian resident has been infected, and let it be 0 otherwise. Assume the samples are independent. What distribution does `\(X_i\)` follow? What are the parameters? What is the mean? What is the variance? -- - The random variable `\(X_i\)` follows a Bernoulli distribution. A Bernoulli random variable has mean `\(p\)` and variance `\(p(1- p)\)`. Here `\(p=0.67\)`. -- For a random sample, the sample mean `\(\overline{x}=\hat{p}\)` is an **estimate** of this probability (fraction of 1's) **Sample proportion** is `\(\hat{P} = \frac{1}{n}\sum_{i = 1}^n X_i\)`, where `\(X_i \sim\)` Bernoulli(p) --- ## Simulation: Prior Infections - For repeated samples of Indian residents, if we compute the proportion with prior infection in each, what values will we see? Will we get 0.67 each time? Again, we perform 10000 experiments where each time we sample from a Bernoulli distribution with `prob = 0.67` and different values of `n`. ```r set.seed(0) n10 <- t(replicate(n = 10000, rbinom(10, size = 1, prob = .67))) n10 <- rowMeans(n10) set.seed(0) n50 <- t(replicate(n = 10000, rbinom(50, size = 1, prob = .67))) n50 <- rowMeans(n50) set.seed(0) n100 <- t(replicate(n = 10000, rbinom(100, size = 1, prob = .67))) n100 <- rowMeans(n100) set.seed(0) n500 <- t(replicate(n = 10000, rbinom(500, size = 1, prob = .67))) n500 <- rowMeans(n500) ``` --- ## Simulation: Prior Infections .small[ .pull-left[ ```r data.frame(propCovid = n10, sampleSize = 10) %>% bind_rows( data.frame(n50, sampleSize = 50) %>% rename(propCovid = n50) ) %>% bind_rows( data.frame(n100, sampleSize = 100) %>% rename(propCovid = n100) )%>% bind_rows( data.frame(n500, sampleSize = 500) %>% rename(propCovid = n500) ) %>% ggplot(aes(x = propCovid, fill = as.factor(sampleSize))) + geom_density(alpha = .5) + labs(x = "Proportion of people with prior infection", y = "Density", title = "Sampling distribution from Bernoulli(.67)", fill = "Sample size") + scale_fill_viridis_d() ``` ] ] .pull-right[ <img src="lecture21_files/figure-html/unnamed-chunk-6-1.png" width="100%" style="display: block; margin: auto;" /> ] --- ## Sampling distribution of the sample proportion - Sample proportion is the **same as the sample mean**, when the **distribution is Bernoulli** - Sample proportion `\(\hat{P} = \frac{1}{n}\sum_{i = 1}^n X_i\)`, where `\(X_i \sim\)` Bernoulli(p) -- - Earlier example: Assume that 67% of the population in India has had a prior COVID infection. Define the variable `\(X_i\)` to take value 1 if the `\(i\)`th randomly sampled Indian resident has been infected, and let it be 0 otherwise. Assume the samples are independent. - The random variable `\(X\)` is Bernoulli. A Bernoulli random variable has mean `\(p\)` and variance `\(p(1- p)\)`. Here `\(p=0.67\)`. - The sample proportion that had a prior infection is just the fraction of 1's - Sample mean defined as `\(\overline{X} = \frac{\sum_{i = 1}^n{X_i}}{n}\)`. When `\(X_i\)` is Bernoulli, it can only take the values 0 and 1, and the sample mean is the same as the sample proportion. `\(\overline{X} = \hat{P}\)`. --- ## Sampling distribution of the sample proportion - Recall: the Central Limit Theorem does not restrict the distribution of the underlying `\(X_1, ..., X_n\)` - `\(\overline{X} \approx N(\mu, \frac{\sigma^2}{n})\)` - We get the sampling distribution of the sample proportion for free! - When `\(X_i \sim\)` Bernoulli(p), `\(\mu = p\)` and `\(\sigma^2 = p(1-p)\)`. - By the Central Limit Theorem, `\(\overline{X} = \hat{P} \approx N(p, \frac{p(1-p)}{n})\)` --- ## Example: cats We ask a random sample of 435 people if they like cats. Assume that the probability that a randomly selected person likes cats is 75%. Let `\(X_i\)` be the Bernoulli variable representing whether or not the `\(i\)`th person likes cats. What is the sampling distribution of the sample proportion? What is the probability that the sample proportion is smaller than .7? -- By the Central Limit Theorem, `\(\overline{X} = \hat{P} = \frac{\sum_{i = 1}^n{X_i}}{n} \approx N(p, \frac{p(1-p)}{n})\)`. In this case, `\(\overline{X} = \hat{P} = \frac{\sum_{i = 1}^{435}{X_i}}{435} \approx N(.75, \frac{.75(1-.75)}{435})\)`. ```r pnorm(.7, mean = .75, sd = sqrt(.75*(1-.75)/435)) ``` ``` ## [1] 0.008013087 ``` --- ## Normal approximation to the Binomial distribution - Consider `\(Y \sim\)` Binomial(n, p) - As the number of trials increases, the binomial distribution becomes symmetric and bell-shaped - For large `\(n\)`, `\(Y \approx N(np, np(1-p))\)` - Commonly used rule of thumb: `\(np > 5\)` and `\(n(1-p) > 5\)` - This approximation comes from the Central Limit Theorem: `\(\overline{X} = \hat{P} = \frac{\sum_{i = 1}^n{X_i}}{n} \approx N(p, \frac{p(1-p)}{n})\)` where `\(X_i \sim\)` Bernoulli(p). - Now, `\(Y = \sum_{i = 1}^n{X_i} = n \overline{X}\)` - By the rules for expectation and variance, `\(E(n\overline{X}) = nE(\overline{X}) = np\)` and `\(Var(n\overline{X}) = n^2 Var(\overline{X}) = np(1-p)\)` --- ## Example: cats We ask a random sample of 435 people if they like cats. Assume that the probability that a randomly selected person likes cats is 75%. Find the probability that at least 325 people like cats. -- Let `\(Y\)` be the number of people, out of 435, that like cats. Then `\(Y \sim\)` Binomial(435, .75). We want `\(P(Y \geq 325)\)`. In R: ```r sum(dbinom(325:435, 435, .75)) ``` ``` ## [1] 0.5802823 ``` ```r 1 - pbinom(324, 435, .75) # pbinom alternative ``` ``` ## [1] 0.5802823 ``` --- ## Example: cats We ask a random sample of 435 people if they like cats. Assume that the probability that a randomly selected person likes cats is 75%. Using a normal approximation, find the probability that at least 325 people like cats. -- Using the Central Limit Theorem, we have `\(\overline{X} = \hat{P} = \frac{\sum_{i = 1}^n{X_i}}{n} \approx N(p, \frac{p(1-p)}{n})\)`, so `\(Y = \sum_{i = 1}^n{X_i} = n \overline{X} \approx N(np, np(1-p))\)`. In this case, `\(Y \approx N(435*.75, 435*.75*.25)\)`. We want `\(P(Y \geq 325)\)`, so in R: ```r 1 - pnorm(324, 435*.75, sd = sqrt(435*.75*.25)) ``` ``` ## [1] 0.5983724 ``` Note that this is only an approximation! --- ## Course content 1. Fundamentals of R - Overview of data types and structures - Data manipulation and data visualization tools 2. Descriptive statistics for numerical and categorical data 3. Probability - Rules of probability computation; conditional probability - Basic probability models: Binomial, Normal and Poisson 4. Statistical inference - Sampling distributions of sample mean and sample proportion - **Hypothesis testing and confidence intervals for population mean and population proportion** --- ## Estimation and testing - **Estimation**: "guess" for the unknown parameter - **Point estimate**: "Best guess", or estimate, for the population parameter. "Based on a survey of 300 UC Davis students, we estimate that 50% of students ride bikes on campus." - **Interval estimate**: "Based on a survey of 300 UC Davis students, we are 95% confident that the proportion of students riding bikes on campus is (.45, .55)." - **Testing**: evaluating whether our observed sample provides evidence against some claim about the population - "We reject the hypothesis that 10% of UC Davis students ride bikes on campus" --- ## Point estimates vs. confidence intervals - **Point estimate**: single value - For example, the sample mean `\(\overline{X}\)` is often used as an estimator for the population mean `\(\mu\)` - In R: `mean()` calculates the estimate (sample statistic) from a given sample - Point estimates vs. confidence intervals: If you want to catch a fish, do you prefer a spear or a net? .pull-left[ <img src="img/spear.png" width="70%" style="display: block; margin: auto;" /> ] .pull-right[ <img src="img/net.png" width="70%" style="display: block; margin: auto;" /> ] --- ## Point estimates vs. confidence intervals If you want to estimate a population parameter, do you prefer to report a range of values or a single value? .pull-left[ <img src="img/spear.png" width="400" style="display: block; margin: auto;" /> ] .pull-right[ <img src="img/net.png" width="400" style="display: block; margin: auto;" /> ] --- ## What is a Confidence Interval? - Range of reasonable values that are intended to **contain the parameter of interest** with a certain **degree of confidence** - Written as (point estimate - margin of error, point estimate + margin of error). <img src="img/ci2.png" width="40%" style="display: block; margin: auto;" /> - Accompanied by a **level of confidence** (often 95%) - "We are 95% **confident** that the true population parameter falls within the interval (point estimate - margin of error, point estimate + margin of error)" --- ## Example: Approval ratings <img src="img/approval.png" width="70%" style="display: block; margin: auto;" /> .tiny[ Source: https://www.rasmussenreports.com/public_content/politics/biden_administration/prez_track_sep23 ] What is the population parameter of interest? What is the sample statistic? -- **Population parameter**: Proportion of likely voters that strongly disapprove of President Biden (red line; alternatively, strongly approve (green line)) **Sample statistic**: Proportion based on 1500 likely voters (on a given day) **95% confidence interval**: (37%, 43%) for strongly disapprove --- ## Example: Vaccine efficacy <img src="img/vaccine.png" width="45%" style="display: block; margin: auto;" /> "The uncertainty around a point estimate can be small or large. Scientists represent this uncertainty by calculating a range of possibilities, which they call a confidence interval." --- ## Elements and interpretation .pull-left[ <img src="img/vaccine.png" width="75%" style="display: block; margin: auto;" /> "We are 95% **confident** that the true population parameter falls within the interval (.85, .95)." ] .pull-right[ **Elements**: - Confidence level - Point estimate - Lower bound - Upper bound ] **Interpretation**: If we were to **repeat this procedure** a large number of times, sampling and constructing confidence intervals in the same way, **95% of constructed intervals would contain the true population parameter**. --- ## Interpretation "We are 95% **confident** that the true population parameter falls within the interval (.85, .95)." **Interpretation**: If we were to **repeat this procedure** a large number of times, sampling and constructing confidence intervals in the same way, **95% of constructed intervals would contain the true population parameter**. - If we repeat the experiment 10,000 times, i.e., draw samples and construct 10,000 confidence intervals, we would expect 9,500 of these to contain the true population parameter - **Incorrect interpretation**: *Observed interval* has a 95% *probability* of containing the population parameter - Do not confuse **confidence** with **probability** - Once we **observe an interval** (collected a sample), there is no more variability. The observed interval either contains the population parameter, or it does not. --- ## Confidence Level: Procedure, not Specific Realization - The **confidence level** reflects the measure of confidence in the **procedure** that led to the confidence interval - In our fishing example, it is a statement about the properties of the net (e.g., size, quality) and your fishing technique. Once you've made an attempt (cast net into water), you either caught fish or did not. -- - Another analogy: a game of horseshoes .pull-left[ **Correct**: Based on your throwing technique, 95% of the horseshoes you throw will encircle the stake **Incorrect**: After you throw a horseshoe, it has a 95% *probability* of encircling the stake. It does not! It either encircles the stake or does not ] .pull-right[ <img src="img/horseshoes.jpg" width="100%" style="display: block; margin: auto;" /> .tiny[ Source: https://www.istockphoto.com/photo/pitching-horseshoes-gm140471913-3413882 ] ] --- ## Constructing confidence intervals - Need to quantify the **variability of our sample statistic** - Use the **sampling distribution of the sample mean**, which we derived using the **Central Limit Theorem** - In the following results, assume independent samples and `\(n\)` large enough, so that the Central Limit Theorem holds - Recall: z-scores corresponding to particular probabilities are often written as `\(z_p\)`, where `\(p\)` denotes the probability in the right tail, e.g., `\(z_{.025} \approx 1.96\)` - (These z-scores are called critical values) --- ## Confidence intervals using CLT - What we want: "If we were to **repeat this procedure** a large number of times, sampling and constructing confidence intervals in the same way, **95% of constructed intervals would contain the true population parameter**" - Translates to P(CI contains true parameter) = .95 -- - **Set up**: - `\(X_i\)` are independent and identically distributed with population mean `\(\mu\)` and variance `\(\sigma^2\)` - Want confidence interval for the unknown population parameter `\(\mu\)` - We use the sample mean, `\(\overline{X}\)`, constructed from a sample of size `\(n\)`, as an estimator for the population mean - Assume `\(n\)` is large --- ## Confidence intervals using CLT - Final ingredient: `\(\alpha\)`, which is the **significance level** - **Confidence level** = `\(100(1 - \alpha)\)`%, i.e., a **95% confidence interval** will need `\(\alpha = .05\)` - P(CI contains true parameter) = `\(1 - \alpha\)` - `\(z_{\frac{\alpha}{2}}\)` is the z-score corresponding to a probability of `\(\frac{\alpha}{2}\)` in the right tail - For `\(\alpha = .05\)`, `\(z_{\frac{\alpha}{2}}\)` is 1.96, i.e., `\(P(Z > 1.96) \approx .025\)`. --- ## Confidence intervals using CLT - Consider the interval `$$\left(\overline{X}-z_{\frac{\alpha}{2}}\frac{\sigma}{\sqrt{n}}, \overline{X}+z_{\frac{\alpha}{2}}\frac{\sigma}{\sqrt{n}} \right)$$` - We will prove that P(CI contains true parameter) = `\(1 - \alpha\)`, i.e., `\(P(\overline{X}-z_{\frac{\alpha}{2}}\frac{\sigma}{\sqrt{n}} \leq \mu \leq \overline{X}+z_{\frac{\alpha}{2}}\frac{\sigma}{\sqrt{n}}) = 1 - \alpha\)` - For simplicity, we will use `\(\alpha = .05\)`, and replace `\(z_{\frac{\alpha}{2}}\)` by 1.96 in the proof, but note that this will work for any `\(\alpha\)` between 0 and 1 --- ## Proof $$ `\begin{aligned} & P\left(\overline{X}-1.96\frac{\sigma}{\sqrt{n}} \leq \mu \leq \overline{X} + 1.96\frac{\sigma}{\sqrt{n}}\right) \\ &= P\left(\overline{X}-1.96\frac{\sigma}{\sqrt{n}} \leq \mu \text{ and } \mu \leq \overline{X} + 1.96\frac{\sigma}{\sqrt{n}} \right) \\ &= P\left(\overline{X}-\mu \leq 1.96\frac{\sigma}{\sqrt{n}} \text{ and } \overline{X} - \mu \geq -1.96\frac{\sigma}{\sqrt{n}} \right) \\ &= P\left( \frac{\overline{X}-\mu}{\frac{\sigma}{\sqrt{n}}} \leq 1.96 \text{ and } \frac{\overline{X} - \mu}{\frac{\sigma}{\sqrt{n}}} \geq -1.96 \right) \\ &= P\left( -1.96 \leq \frac{\overline{X}-\mu}{\frac{\sigma}{\sqrt{n}}} \leq 1.96\right) \\ &\approx P\left( -1.96 \leq Z \leq 1.96\right) \text{ by the Central Limit Theorem} \\ &= .95 \end{aligned}` $$ --- ## Confidence Interval for Population Mean - Collect sample - Compute sample mean `\(\bar{x}\)` - An `\(\alpha\)`-level confidence interval for `\(\mu\)` is `$$\left(\overline{x}-z_{\frac{\alpha}{2}}\frac{\sigma}{\sqrt{n}}, \overline{x}+z_{\frac{\alpha}{2}}\frac{\sigma}{\sqrt{n}} \right)$$` Notice that the random variable `\(\overline{X}\)` has been replaced by the sample statistic `\(\overline{x}\)` --- ## Example Let `\(X_1, X_2, ..., X_{200}\)` be independent `\(N(\mu, 4)\)` random variables. We collect the sample of size 200, and the resulting sample mean, `\(\overline{x}\)`, is `\(\overline{x} = 24\)`. What is a 95% confidence interval for `\(\mu\)`? -- Since `\(n = 200\)` is large, by CLT, a 95% confidence interval for `\(\mu\)` is given by `$$\left(\overline{x}-z_{\frac{\alpha}{2}}\frac{\sigma}{\sqrt{n}}, \overline{x}+z_{\frac{\alpha}{2}}\frac{\sigma}{\sqrt{n}} \right)$$` Substitute `\(\overline{x} = 24\)`, `\(z_{\frac{\alpha}{2}} = 1.96\)`, `\(\sigma = 2\)`, `\(n = 200\)`, giving `\(\left(24-1.96*2/ \sqrt{200}, 24 + 1.96*2/ \sqrt{200} \right)\)` or (23.72, 24.28) We are 95% confident that `\(\mu\)` falls within the interval (23.72, 24.28). --- ## Summary - Sampling distribution of the sample proportion: `\(\overline{X} = \hat{P} \approx N(p, \frac{p(1-p)}{n})\)` - Normal approximation to binomial: For `\(Y \sim\)` Binomial(n, p), when `\(np > 5\)` and `\(n(1-p) > 5\)`, `\(Y \approx N(np, np(1-p))\)` - Confidence intervals: - Introduction and interpretation - Construction using Central Limit Theorem: `\(\left(\overline{x}-z_{\frac{\alpha}{2}}\frac{\sigma}{\sqrt{n}}, \overline{x}+z_{\frac{\alpha}{2}}\frac{\sigma}{\sqrt{n}} \right)\)`