Sampling distributions

class: center, middle, inverse, title-slide

.title[
# Sampling distributions
]
.subtitle[
## <br><br> STA35A: Statistical Data Science 1
]
.author[
### Xiao Hui Tai
]
.date[
### November 20, 2024
]

---

layout: true

---

## Today

- Sum of independent normal distributions

- Introduction to sampling distributions

- Central Limit Theorem

- Sampling distribution of the sample mean

---
## Sum of independent normal random variables

- Important property: **Any linear combination of normal random variables is a normal random variable** 
--

- A linear combination of two random variables, `\(X\)` and `\(Y\)`, is of the form `\(aX+bY\)`, where `\(a\)`
and `\(b\)` are constants

- Recall: 
  - `\(E(aX + bY) = aE(X) + bE(Y)\)`
  - For a linear combination of **independent** random variables `\(Var(aX + bY) = a^2 Var(X) + b^2 Var(Y)\)`

- `\(X \sim N(\mu_x, \sigma_x^2)\)` and `\(Y \sim N(\mu_y, \sigma_y^2)\)` are independent, `\(W = X + Y \sim N(\mu_x + \mu_y, \sigma_x^2 + \sigma_y^2)\)`

---
## Sum of independent normal random variables

- Extends to more than two random variables in the linear combination

- `\(E(aX + bY) = aE(X) + bE(Y)\)`
  - `\(b\)` can be negative, e.g., `\(E(X - Y) = E(X) - E(Y)\)` and `\(Var(X - Y) = Var(X) + Var(Y)\)`.

---

## Course content

1. Fundamentals of R
  - Overview of data types and structures
  - Data manipulation and data visualization tools

2. Descriptive statistics for numerical and categorical data

3. Probability
  - Rules of probability computation; conditional probability
  - Basic probability models: Binomial, Normal and Poisson

4. **Statistical inference**
  - **Sampling distributions of sample mean and sample proportion**
  - Hypothesis testing and confidence intervals for population mean and population proportion

---

## Recall (lecture 13): What is statistical inference?
- **Descriptive statistics**: summarize and describe data. 
  - Do not necessarily generalize beyond the data

- **Statistical inference**
  - Draw conclusions about the larger population 
  - Using sample data to make conclusions about the underlying population the sample came from

.pull-left[
<img src="img/soup.png" width="50%" style="display: block; margin: auto;" />
]

.pull-right[
Similar to tasting a spoonful of soup while cooking to make an inference about the entire pot.
]

---
## Recall: Example of statistical inference in action

When a **sample statistic** is used to estimate a **population parameter**, it will be accompanied by a **margin of error**

.tiny[
Source: https://www.rasmussenreports.com/public_content/politics/biden_administration/prez_track_sep23
]

---
## Recall: Many Topics in Statistical Inference

- Fundamentals: probability, distributions, random variables, ...

- **Sampling**

- Hypothesis testing

- Point estimates and confidence intervals

- Modeling: Linear regression, analysis of variance, nonparametric models, machine learning, ...

---
## Sampling Distribution of the Sample Mean

Recall our shoe size example, where wearers of men's shoe sizes follow a `\(N(11, 1.5^2)\)` distribution.

Say we are interested in the sample mean of shoe sizes. We have a sample of 1000 observations.

``` r
set.seed(0)
sampled1000_1 <- rnorm(1000, 11, 1.5)
head(sampled1000_1, 20)
```

```
##  [1] 12.894431 10.510650 12.994699 12.908644 11.621962  8.690075
##  [7]  9.607149 10.557919 10.991349 14.606980 12.145390  9.801486
## [13]  9.278514 10.565808 10.551177 10.382734 11.378335  9.662118
## [19] 11.653525  9.143692
```

``` r
mean(sampled1000_1) 
```

```
## [1] 10.97626
```

---
## Sampling Distribution of the Sample Mean 
Now we repeat the experiment, i.e., get a different sample of 1000 observations.

``` r
set.seed(10)
sampled1000_2 <- rnorm(1000, 11, 1.5)
head(sampled1000_2, 20)
```

```
##  [1] 11.028119 10.723621  8.943004 10.101248 11.441818 11.584691
##  [7]  9.187886 10.454486  8.559991 10.615282 12.652669 12.133672
## [13] 10.642650 12.481167 12.112085 11.134021  9.567584 10.707274
## [19] 12.388282 11.724468
```

``` r
mean(sampled1000_2) 
```

```
## [1] 11.01706
```

``` r
all.equal(mean(sampled1000_1), mean(sampled1000_2))
```

```
## [1] "Mean relative difference: 0.003717704"
```
---
## Sampling Distribution of the Sample Mean

If we repeat the experiment an infinite number of times, what distribution of sample means would we get? This is known as the **sampling distribution**.

1. Draw a sample of size `\(n\)`, calculate its mean `\(\overline{x}_1\)`
2. Draw a second sample of the same size, calculate its mean `\(\overline{x}_2\)`
3. Repeat this many times to get sample means `\(\overline{x}_1, \overline{x}_2, \ldots\)`

`\(\overline{x}_1, \overline{x}_2, \ldots\)` are **sample statistics**

What is the distribution of `\(\overline{x}_1, \overline{x}_2, \overline{x}_3, \ldots\)`?

---
## Sampling Distribution of the Sample Mean

- The sample mean `\(\overline{X}\)`, is defined as `\(\overline{X} = \frac{\sum_{i = 1}^n X_i}{n}\)`

- `\(\overline{x}_1, \overline{x}_2, \overline{x}_3, \ldots\)` are draws from `\(\overline{X}\)`

Here we consider `\(X_1, ..., X_n\)` that are **independent and identically distributed**. (E.g., `\(X_1, ... X_n \sim N(11, 1.5^2)\)` for the shoe size distribution.)

---
## Sampling Distribution of the Sample Mean

1. Draw a sample of size `\(n\)` and calculate its mean `\(\overline{x}_1\)`
2. Draw a second sample of the same size and calculate its mean `\(\overline{x}_2\)`
3. Repeat this many times to get sample means `\(\overline{x}_1, \overline{x}_2, \ldots\)`

We cannot repeat an infinite number of times, but we do this 10,000 times in R.

.small[

``` r
set.seed(0)
repeat10000 <- t(replicate(n = 10000, rnorm(1000, 11, 1.5)))
str(repeat10000)
```

```
##  num [1:10000, 1:1000] 12.9 10.6 11.7 10.2 12.2 ...
```

``` r
head(rowMeans(repeat10000), 20)
```

```
##  [1] 10.97626 10.96282 11.10221 11.00373 11.00616 11.02959
##  [7] 10.99695 11.06309 10.92670 11.08920 10.97525 10.95720
## [13] 10.95437 11.07828 11.02202 11.05240 11.00455 10.95824
## [19] 10.96431 10.96413
```

``` r
means10000 <- rowMeans(repeat10000)
```
]

---
## Sampling Distribution of the Sample Mean

``` r
data.frame(shoesMean = means10000) %>%
  ggplot(aes(x = shoesMean)) +
  geom_density() +
  labs(x = "Mean of sample of size 1000",
       y = "Density",
       title = "Sampling distribution from N(11, 1.5^2)")
```

---
## Sampling Distribution of the Sample Mean

.pull-left[
<img src="lecture20_files/figure-html/unnamed-chunk-9-1.png" width="100%" style="display: block; margin: auto;" />
]
.pull-right[
How would we describe this distribution?

- Center

- Spread

- Shape
]
---
## Sampling Distribution of the Sample Mean

.pull-left[
<img src="lecture20_files/figure-html/unnamed-chunk-10-1.png" width="100%" style="display: block; margin: auto;" />
]
.small[
.pull-right[
How would we describe this distribution?

- Center
  - Centered at 11 (same as the population parameter)

- Spread
  - Looks to be much smaller than the original distribution (the original distribution has standard deviation 1.5)

- Shape
  - Symmetric and bell-shaped
]
]

---
## Effect of changing sample size

- Earlier we used a sample size of 1000. What if we used a sample size of 50?

``` r
set.seed(0)
repeat10000_n50 <- t(replicate(n = 10000, rnorm(50, 11, 1.5)))
str(repeat10000_n50)
```

```
##  num [1:10000, 1:50] 12.89 11.4 12.17 13.21 9.43 ...
```

``` r
head(rowMeans(repeat10000_n50), 20)
```

```
##  [1] 11.03590 11.03211 10.68366 11.17968 11.06924 11.13242
##  [7] 11.03710 10.97258 10.96971 10.88460 10.93739 10.80427
## [13] 11.03709 10.90858 10.84133 11.32991 11.04654 10.74770
## [19] 10.98847 10.88685
```

``` r
means10000_n50 <- rowMeans(repeat10000_n50)
```

---
## Effect of changing sample size

.tiny[
.pull-left[

``` r
data.frame(shoesMean = means10000, sampleSize = 1000) %>%
  bind_rows(
    data.frame(means10000_n50, sampleSize = 50) %>%
      rename(shoesMean = means10000_n50)
  )  %>%
  ggplot(aes(x = shoesMean,
             fill = as.factor(sampleSize))) +
  geom_density() +
  labs(x = "Mean shoe size",
       y = "Density",
       title = "Sampling distribution from N(11, 1.5^2)",
       fill = "Sample size")  +
  scale_fill_viridis_d()
 # guides(fill = "none")
```
]
]

.pull-right[
<img src="lecture20_files/figure-html/unnamed-chunk-13-1.png" width="100%" style="display: block; margin: auto;" />
]

- What do you notice about the spread?

- A larger sample size produces more precise estimates 
  
  - We will formalize this intuition using the **Central Limit Theorem**
  
---
## Note on sampling distributions

- "Sampling distributions are never observed, but we keep them in mind"

- **Real-world applications**: one draw from the sampling distribution, `\(\overline{x}\)`

- **Simulations**: we cannot run experiments an infinite number of times to generate the sampling distribution

- Useful to think of a sample statistic as coming from such a hypothetical distribution

- Helps us characterize sample statistics that we observe

---
## Sampling distributions, confidence intervals and hypothesis testing

What can we do with the sampling distribution?

- **Confidence intervals**: Estimate a population parameter as point estimate `\(\pm\)` margin of error

- Margin of error: (1) how confident we want to be (2) sample statistic's variability

- **Hypothesis testing**: Test whether a population parameter is equal to some value

- How likely is it that we have obtained the observed sample statistic, if the population parameter is indeed that value?

---
## Central Limit Theorem

- In words: for *any distribution* with a well-defined mean and variance, the **sample mean** is approximately normally distributed

- Formally: 
  - Population with mean `\(\mu\)` and standard deviation `\(\sigma\)`
  - Independent samples `\(X_1, ..., X_n\)`
  - `\(\overline{X} = \frac{\sum_{i = 1}^n X_i}{n}\)`

- Properties of sampling distribution `\(\overline{X}\)`:
  - Mean is identical to the population mean `\(\mu\)`, i.e., `\(E(\overline{X}) = \mu\)`

- Standard deviation `\(\frac{\sigma}{\sqrt{n}}\)`, i.e., `\(Var(\overline{X}) = \frac{\sigma^2}{n}\)`

- For large `\(n\)` ( `\(n \rightarrow \infty\)` ), distribution is approximately normal
  
  - i.e., `\(\overline{X} \approx N(\mu, \frac{\sigma^2}{n})\)`

---
## Intuition

- The average of **many measurements** of the same unknown quantity tends to give a **better estimate** than a single measurement

- If we want to know the population mean test score of the class, getting information from a sample of 10 students is better than asking a single student

- Recall the **law of large numbers**: as `\(n \rightarrow \infty\)` , `\(\overline{X} \rightarrow E(X)\)`

---
## Intuition
.pull-left[
<img src="lecture20_files/figure-html/unnamed-chunk-15-1.png" width="100%" style="display: block; margin: auto;" />

- Note that here we are using Bernoulli(.3), so `\(\mu = p = .3\)` and `\(\sigma^2 = p(1-p) = .3(.7)\)`
]

.pull-right[
- In this illustration, think of each point plotted as a single draw from `\(\overline{X} \sim N(\mu, \frac{\sigma^2}{n})\)`, as we vary `\(n\)`, the sample size (plotted on the horizontal axis)

- CLT tells us the distribution of `\(\overline{X}\)` at each value of `\(n\)`

- For large values of `\(n\)`, we see that `\(Var(\overline{X}) = \frac{\sigma^2}{n}\)` gets very small, so any draw from this distribution will be very close to `\(E(\overline{X}) = \mu\)`

]

---
## Intuition

- We can see the same narrowing distribution (smaller variance) with our shoes example:

- Nice applet where you can adjust the sample size and other parameters: http://demonstrations.wolfram.com/SamplingDistributionOfTheSampleMean/

---
## Central Limit Theorem

Set-up:
- Population with mean `\(\mu\)` and standard deviation `\(\sigma\)`
- Independent samples `\(X_1, ..., X_n\)`
- `\(\overline{X} = \frac{\sum_{i = 1}^n X_i}{n}\)`

Properties of sampling distribution `\(\overline{X}\)`:

- Mean is identical to the population mean `\(\mu\)`, i.e., `\(E(\overline{X}) = \mu\)`

- Standard deviation `\(\frac{\sigma}{\sqrt{n}}\)`, i.e., `\(Var(\overline{X}) = \frac{\sigma^2}{n}\)`

- For large `\(n\)` ( `\(n \rightarrow \infty\)` ), distribution is approximately normal

- i.e., `\(\overline{X} \approx N(\mu, \frac{\sigma^2}{n})\)`

Notice that this does not restrict the distribution of the underlying `\(X_1, ..., X_n\)` in any way. These can be normal, binomial, Poisson, ...

---
## Central Limit Theorem with different underlying distributions

- All we need to know is `\(E(X_i)\)` or `\(\mu\)`, and `\(Var(X_i)\)` or `\(\sigma^2\)`

- For **normally distributed** random variables with mean `\(\mu\)` and variance `\(\sigma^2\)`

- `\(\overline{X} \sim N(\mu, \frac{\sigma^2}{n})\)` (actually, we don't need CLT for this. Why?)

- For **Poisson( `\(\lambda\)` ) distributed** random variables with `\(E(X_i) = \lambda\)` and `\(Var(X_i) = \lambda\)`

- `\(\overline{X} \approx N(\lambda, \frac{\lambda}{n})\)` 
  
---

## How Large is Large Enough for n?

- A commonly used rule of thumb is `\(n > 50\)`

- For Bernoulli data, one rule of thumb is that `\(n\)` should be large enough that `\(n p>5\)` and `\(n(1- p)>5\)`. Sometimes you also see `\(n p>10\)` and `\(n(1- p)> 10\)`

- If p is around a half, you need a smaller sample size than if p is close to 0 or 1

---

## Example: Normal data

In the shoe size example, we have `\(X_1, ..., X_n \sim N(11, 1.5^2)\)`. Say we collected 1000 samples, so the sample size `\(n = 1000\)`. What distribution does the sampling distribution of the sample mean follow?

What is `\(P(\overline{X} < 10.9)\)`? Calculate this in two ways: using the original distribution, and using the standard normal distribution.

---
## Summary
  
- Sum of independent normal distributions: any linear combination of normal random variables is a normal random variable

- Central Limit Theorem: `\(\overline{X} \approx N(\mu, \frac{\sigma^2}{n})\)`

- Sampling distribution of the sample mean