class: center, middle, inverse, title-slide .title[ # Sampling distribution of sample mean ] .subtitle[ ##
STA35A: Statistical Data Science 1 ] .author[ ### Xiao Hui Tai ] .date[ ### November 27, 2023 ] --- layout: true <!-- <div class="my-footer"> --> <!-- <span> --> <!-- <a href="https://datasciencebox.org" target="_blank">datasciencebox.org</a> --> <!-- </span> --> <!-- </div> --> --- <style type="text/css"> .tiny .remark-code { font-size: 70%; } .small .remark-code { font-size: 80%; } .tiny { font-size: 60%; } .small { font-size: 80%; } </style> ## Course content 1. Fundamentals of R - Overview of data types and structures - Data manipulation and data visualization tools 2. Descriptive statistics for numerical and categorical data 3. Probability - Rules of probability computation; conditional probability - Basic probability models: Binomial, Normal and Poisson 4. **Statistical inference** - **Sampling distributions of sample mean and sample proportion** - Hypothesis testing and confidence intervals for population mean and population proportion --- ## Today - Introduction to sampling distributions - Central Limit Theorem - Sampling distribution of the sample mean --- ## Recall (lecture 13): What is statistical inference? - **Descriptive statistics**: summarize and describe data. - Do not necessarily generalize beyond the data - **Statistical inference** - Draw conclusions about the larger population - Using sample data to make conclusions about the underlying population the sample came from .pull-left[ <img src="img/soup.png" width="50%" style="display: block; margin: auto;" /> ] .pull-right[ Similar to tasting a spoonful of soup while cooking to make an inference about the entire pot. ] --- ## Recall: Example of statistical inference in action When a **sample statistic** is used to estimate a **population parameter**, it will be accompanied by a **margin of error** <img src="img/approval.png" width="85%" style="display: block; margin: auto;" /> .tiny[ Source: https://www.rasmussenreports.com/public_content/politics/biden_administration/prez_track_sep23 ] --- ## Recall: Many Topics in Statistical Inference - Fundamentals: probability, distributions, random variables, ... - **Sampling** - Hypothesis testing - Point estimates and confidence intervals - Modeling: Linear regression, analysis of variance, nonparametric models, machine learning, ... --- ## Sampling Distribution of the Sample Mean Recall our shoe size example, where wearers of men's shoe sizes follow a `\(N(11, 1.5^2)\)` distribution. Say we are interested in the sample mean of shoe sizes. We have a sample of 1000 observations. ```r set.seed(0) sampled1000_1 <- rnorm(1000, 11, 1.5) head(sampled1000_1, 20) ``` ``` ## [1] 12.894431 10.510650 12.994699 12.908644 11.621962 8.690075 ## [7] 9.607149 10.557919 10.991349 14.606980 12.145390 9.801486 ## [13] 9.278514 10.565808 10.551177 10.382734 11.378335 9.662118 ## [19] 11.653525 9.143692 ``` ```r mean(sampled1000_1) ``` ``` ## [1] 10.97626 ``` --- ## Sampling Distribution of the Sample Mean Now we repeat the experiment, i.e., get a different sample of 1000 observations. ```r set.seed(10) sampled1000_2 <- rnorm(1000, 11, 1.5) head(sampled1000_2, 20) ``` ``` ## [1] 11.028119 10.723621 8.943004 10.101248 11.441818 11.584691 ## [7] 9.187886 10.454486 8.559991 10.615282 12.652669 12.133672 ## [13] 10.642650 12.481167 12.112085 11.134021 9.567584 10.707274 ## [19] 12.388282 11.724468 ``` ```r mean(sampled1000_2) ``` ``` ## [1] 11.01706 ``` ```r all.equal(mean(sampled1000_1), mean(sampled1000_2)) ``` ``` ## [1] "Mean relative difference: 0.003717704" ``` --- ## Sampling Distribution of the Sample Mean If we repeat the experiment an infinite number of times, what distribution of sample means would we get? This is known as the **sampling distribution**. 1. Draw a sample of size `\(n\)`, calculate its mean `\(\overline{x}_1\)` 2. Draw a second sample of the same size, calculate its mean `\(\overline{x}_2\)` 3. Repeat this many times to get sample means `\(\overline{x}_1, \overline{x}_2, \ldots\)` `\(\overline{x}_1, \overline{x}_2, \ldots\)` are **sample statistics** What is the distribution of `\(\overline{x}_1, \overline{x}_2, \overline{x}_3, \ldots\)`? --- ## Sampling Distribution of the Sample Mean - The sample mean `\(\overline{X}\)`, is defined as `\(\overline{X} = \frac{\sum_{i = 1}^n X_i}{n}\)` - `\(\overline{x}_1, \overline{x}_2, \overline{x}_3, \ldots\)` are draws from `\(\overline{X}\)` Here we consider `\(X_1, ..., X_n\)` that are **independent and identically distributed**. (E.g., `\(X_1, ... X_n \sim N(11, 1.5^2)\)` for the shoe size distribution.) --- ## Sampling Distribution of the Sample Mean 1. Draw a sample of size `\(n\)` and calculate its mean `\(\overline{x}_1\)` 2. Draw a second sample of the same size and calculate its mean `\(\overline{x}_2\)` 3. Repeat this many times to get sample means `\(\overline{x}_1, \overline{x}_2, \ldots\)` We cannot repeat an infinite number of times, but we do this 10,000 times in R. .small[ ```r set.seed(0) repeat10000 <- t(replicate(n = 10000, rnorm(1000, 11, 1.5))) str(repeat10000) ``` ``` ## num [1:10000, 1:1000] 12.9 10.6 11.7 10.2 12.2 ... ``` ```r head(rowMeans(repeat10000), 20) ``` ``` ## [1] 10.97626 10.96282 11.10221 11.00373 11.00616 11.02959 ## [7] 10.99695 11.06309 10.92670 11.08920 10.97525 10.95720 ## [13] 10.95437 11.07828 11.02202 11.05240 11.00455 10.95824 ## [19] 10.96431 10.96413 ``` ```r means10000 <- rowMeans(repeat10000) ``` ] --- ## Sampling Distribution of the Sample Mean ```r data.frame(shoesMean = means10000) %>% ggplot(aes(x = shoesMean)) + geom_density() + labs(x = "Mean of sample of size 1000", y = "Density", title = "Sampling distribution from N(11, 1.5^2)") ``` <img src="lecture20_files/figure-html/unnamed-chunk-8-1.png" width="60%" style="display: block; margin: auto;" /> --- ## Sampling Distribution of the Sample Mean .pull-left[ <img src="lecture20_files/figure-html/unnamed-chunk-9-1.png" width="100%" style="display: block; margin: auto;" /> ] .pull-right[ How would we describe this distribution? - Center - Spread - Shape ] --- ## Sampling Distribution of the Sample Mean .pull-left[ <img src="lecture20_files/figure-html/unnamed-chunk-10-1.png" width="100%" style="display: block; margin: auto;" /> ] .small[ .pull-right[ How would we describe this distribution? - Center - Centered at 11 (same as the population parameter) - Spread - Looks to be much smaller than the original distribution (the original distribution has standard deviation 1.5) - Shape - Symmetric and bell-shaped ] ] --- ## Effect of changing sample size - Earlier we used a sample size of 1000. What if we used a sample size of 50? ```r set.seed(0) repeat10000_n50 <- t(replicate(n = 10000, rnorm(50, 11, 1.5))) str(repeat10000_n50) ``` ``` ## num [1:10000, 1:50] 12.89 11.4 12.17 13.21 9.43 ... ``` ```r head(rowMeans(repeat10000_n50), 20) ``` ``` ## [1] 11.03590 11.03211 10.68366 11.17968 11.06924 11.13242 ## [7] 11.03710 10.97258 10.96971 10.88460 10.93739 10.80427 ## [13] 11.03709 10.90858 10.84133 11.32991 11.04654 10.74770 ## [19] 10.98847 10.88685 ``` ```r means10000_n50 <- rowMeans(repeat10000_n50) ``` --- ## Effect of changing sample size .tiny[ .pull-left[ ```r data.frame(shoesMean = means10000, sampleSize = 1000) %>% bind_rows( data.frame(means10000_n50, sampleSize = 50) %>% rename(shoesMean = means10000_n50) ) %>% ggplot(aes(x = shoesMean, fill = as.factor(sampleSize))) + geom_density() + labs(x = "Mean shoe size", y = "Density", title = "Sampling distribution from N(11, 1.5^2)", fill = "Sample size") + scale_fill_viridis_d() # guides(fill = "none") ``` ] ] .pull-right[ <img src="lecture20_files/figure-html/unnamed-chunk-13-1.png" width="100%" style="display: block; margin: auto;" /> ] - What do you notice about the spread? -- - A larger sample size produces more precise estimates - We will formalize this intuition using the **Central Limit Theorem** --- ## Note on sampling distributions - "Sampling distributions are never observed, but we keep them in mind" - **Real-world applications**: one draw from the sampling distribution, `\(\overline{x}\)` - **Simulations**: we cannot run experiments an infinite number of times to generate the sampling distribution - Useful to think of a sample statistic as coming from such a hypothetical distribution - Helps us characterize sample statistics that we observe --- ## Sampling distributions, confidence intervals and hypothesis testing What can we do with the sampling distribution? - **Confidence intervals**: Estimate a population parameter as point estimate `\(\pm\)` margin of error - Margin of error: (1) how confident we want to be (2) sample statistic's variability - **Hypothesis testing**: Test whether a population parameter is equal to some value - How likely is it that we have obtained the observed sample statistic, if the population parameter is indeed that value? --- ## Central Limit Theorem - In words: for *any distribution* with a well-defined mean and variance, the **sample mean** is approximately normally distributed - Formally: - Population with mean `\(\mu\)` and standard deviation `\(\sigma\)` - Independent samples `\(X_1, ..., X_n\)` - `\(\overline{X} = \frac{\sum_{i = 1}^n X_i}{n}\)` -- - Properties of sampling distribution `\(\overline{X}\)`: - Mean is identical to the population mean `\(\mu\)`, i.e., `\(E(\overline{X}) = \mu\)` - Standard deviation `\(\frac{\sigma}{\sqrt{n}}\)`, i.e., `\(Var(\overline{X}) = \frac{\sigma^2}{n}\)` - For large `\(n\)` ( `\(n \rightarrow \infty\)` ), distribution is approximately normal - i.e., `\(\overline{X} \approx N(\mu, \frac{\sigma^2}{n})\)` --- ## Intuition - The average of **many measurements** of the same unknown quantity tends to give a **better estimate** than a single measurement - If we want to know the population mean test score of the class, getting information from a sample of 10 students is better than asking a single student - Recall the **law of large numbers**: as `\(n \rightarrow \infty\)` , `\(\overline{X} \rightarrow E(X)\)` --- ## Intuition .pull-left[ <img src="lecture20_files/figure-html/unnamed-chunk-15-1.png" width="100%" style="display: block; margin: auto;" /> - Note that here we are using Bernoulli(.3), so `\(\mu = p = .3\)` and `\(\sigma^2 = p(1-p) = .3(.7)\)` ] .pull-right[ - In this illustration, think of each point plotted as a single draw from `\(\overline{X} \sim N(\mu, \frac{\sigma^2}{n})\)`, as we vary `\(n\)`, the sample size (plotted on the horizontal axis) - CLT tells us the distribution of `\(\overline{X}\)` at each value of `\(n\)` - For large values of `\(n\)`, we see that `\(Var(\overline{X}) = \frac{\sigma^2}{n}\)` gets very small, so any draw from this distribution will be very close to `\(E(\overline{X}) = \mu\)` ] --- ## Intuition - We can see the same narrowing distribution (smaller variance) with our shoes example: <img src="lecture20_files/figure-html/unnamed-chunk-16-1.png" width="70%" style="display: block; margin: auto;" /> - Nice applet where you can adjust the sample size and other parameters: http://demonstrations.wolfram.com/SamplingDistributionOfTheSampleMean/ --- ## Central Limit Theorem Set-up: - Population with mean `\(\mu\)` and standard deviation `\(\sigma\)` - Independent samples `\(X_1, ..., X_n\)` - `\(\overline{X} = \frac{\sum_{i = 1}^n X_i}{n}\)` Properties of sampling distribution `\(\overline{X}\)`: - Mean is identical to the population mean `\(\mu\)`, i.e., `\(E(\overline{X}) = \mu\)` - Standard deviation `\(\frac{\sigma}{\sqrt{n}}\)`, i.e., `\(Var(\overline{X}) = \frac{\sigma^2}{n}\)` - For large `\(n\)` ( `\(n \rightarrow \infty\)` ), distribution is approximately normal - i.e., `\(\overline{X} \approx N(\mu, \frac{\sigma^2}{n})\)` Notice that this does not restrict the distribution of the underlying `\(X_1, ..., X_n\)` in any way. These can be normal, binomial, Poisson, ... --- ## Central Limit Theorem with different underlying distributions - All we need to know is `\(E(X_i)\)` or `\(\mu\)`, and `\(Var(X_i)\)` or `\(\sigma^2\)` - For **normally distributed** random variables with mean `\(\mu\)` and variance `\(\sigma^2\)` -- - `\(\overline{X} \sim N(\mu, \frac{\sigma^2}{n})\)` (actually, we don't need CLT for this. Why?) - For **Bernoulli distributed** random variables with probability of success `\(p\)`, `\(Var(X_i)= p(1-p)\)` -- - `\(\overline{X} = \hat{P} \approx N(p, \frac{p(1-p)}{n})\)` - For **Poisson( `\(\lambda\)` ) distributed** random variables with `\(E(X_i) = \lambda\)` and `\(Var(X_i) = \lambda\)` -- - `\(\overline{X} \approx N(\lambda, \frac{\lambda}{n})\)` --- ## How Large is Large Enough for n? - A commonly used rule of thumb is `\(n > 50\)` - For Bernoulli data, one rule of thumb is that `\(n\)` should be large enough that `\(n p>5\)` and `\(n(1- p)>5\)`. Sometimes you also see `\(n p>10\)` and `\(n(1- p)> 10\)` - If p is around a half, you need a smaller sample size than if p is close to 0 or 1 --- ## Example: Normal data In the shoe size example, we have `\(X_1, ..., X_n \sim N(11, 1.5^2)\)`. Say we collected 1000 samples, so the sample size `\(n = 1000\)`. What distribution does the sampling distribution of the sample mean follow? -- - By the Central Limit Theorem, `\(\overline{X} \approx N(\mu, \frac{\sigma^2}{n})\)`, in this case `\(N(11, \frac{1.5^2}{1000})\)`. -- What is `\(P(\overline{X} < 10.9)\)`? Calculate this in two ways: using the original distribution, and using the standard normal distribution. -- - Standardizing, we have `\(P(\overline{X} < 10.9) = P(\frac{\overline{X} - 11}{1.5/\sqrt{1000}} < \frac{10.9 - 11}{1.5/\sqrt{1000}}) = P(Z < \frac{-.1}{1.5/\sqrt{1000}})\)` ```r pnorm(10.9, mean = 11, sd = sqrt(1.5^2/1000)) ``` ``` ## [1] 0.01750749 ``` ```r pnorm((10.9 - 11)/(1.5/sqrt(1000)) ) # standardized version ``` ``` ## [1] 0.01750749 ``` --- ## Example: Bernoulli data Assume that 67% of the population in India has had a prior COVID infection. Define the variable `\(X_i\)` to take value 1 if the `\(i\)`th randomly sampled Indian resident has been infected, and let it be 0 otherwise. Assume the samples are independent. What distribution does `\(X_i\)` follow? What are the parameters? What is the mean? What is the variance? -- - The random variable `\(X_i\)` follows a Bernoulli distribution. A Bernoulli random variable has mean `\(p\)` and variance `\(p(1- p)\)`. Here `\(p=0.67\)`. -- For a random sample, the sample mean `\(\overline{x}=\hat{p}\)` is an **estimate** of this probability (fraction of 1's) **Sample proportion** is `\(\hat{P} = \frac{1}{n}\sum_{i = 1}^n X_i\)`, where `\(X_i \sim\)` Bernoulli(p) --- ## Simulation: Prior Infections - For repeated samples of Indian residents, if we compute the proportion with prior infection in each, what values will we see? Will we get 0.67 each time? Again, we perform 10000 experiments where each time we sample from a Bernoulli distribution with `prob = 0.67` and different values of `n`. ```r set.seed(0) n10 <- t(replicate(n = 10000, rbinom(10, size = 1, prob = .67))) n10 <- rowMeans(n10) set.seed(0) n50 <- t(replicate(n = 10000, rbinom(50, size = 1, prob = .67))) n50 <- rowMeans(n50) set.seed(0) n100 <- t(replicate(n = 10000, rbinom(100, size = 1, prob = .67))) n100 <- rowMeans(n100) set.seed(0) n500 <- t(replicate(n = 10000, rbinom(500, size = 1, prob = .67))) n500 <- rowMeans(n500) ``` --- ## Simulation: Prior Infections .small[ .pull-left[ ```r data.frame(propCovid = n10, sampleSize = 10) %>% bind_rows( data.frame(n50, sampleSize = 50) %>% rename(propCovid = n50) ) %>% bind_rows( data.frame(n100, sampleSize = 100) %>% rename(propCovid = n100) )%>% bind_rows( data.frame(n500, sampleSize = 500) %>% rename(propCovid = n500) ) %>% ggplot(aes(x = propCovid, fill = as.factor(sampleSize))) + geom_density(alpha = .5) + labs(x = "Proportion of people with prior infection", y = "Density", title = "Sampling distribution from Bernoulli(.67)", fill = "Sample size") + scale_fill_viridis_d() ``` ] ] .pull-right[ <img src="lecture20_files/figure-html/unnamed-chunk-20-1.png" width="100%" style="display: block; margin: auto;" /> ] Next time: sampling distribution of sample proportion --- ## Summary - Central Limit Theorem: `\(\overline{X} \approx N(\mu, \frac{\sigma^2}{n})\)` - Sampling distribution of the sample mean