class: center, middle, inverse, title-slide .title[ # Normal Distribution ] .subtitle[ ##
STA35A: Statistical Data Science 1 ] .author[ ### Xiao Hui Tai ] .date[ ### November 18, 2024 ] --- layout: true <!-- <div class="my-footer"> --> <!-- <span> --> <!-- <a href="https://datasciencebox.org" target="_blank">datasciencebox.org</a> --> <!-- </span> --> <!-- </div> --> --- <style type="text/css"> .tiny .remark-code { font-size: 70%; } .small .remark-code { font-size: 80%; } .tiny { font-size: 60%; } .small { font-size: 80%; } </style> ## Today - Common probability distributions: Normal - Standard normal distribution - R functions: - `dnorm()` for densities - `pnorm()` for `\(P(X\leq x)\)` - `rnorm()` for random sample - `qnorm()` for quantiles (the value corresponding to an input probability, e.g., `\(P(X \leq \ ?) = p\)`) --- ## Recall: Continuous random variables <img src="img/density.png" width="40%" style="display: block; margin: auto;" /> - Probability distribution for a discrete random variable: **probability mass function** - Continuous random variable: **probability density functions** - For a continuous random variable, probability for any exact value is zero - Instead, we think about probabilities in ranges. - `\(P(a \leq X \leq b)\)` is the area under the density function between `\(a\)` and `\(b\)`. --- ## Normal Distribution - The **normal distribution** is an example of a continuous distribution - It is a very important distribution and one of the primary inferential tools in statistics - Many **natural phenomenon** approximate the normal distribution, such as weight, height, blood pressure, annual rainfall - Commonly called the *Gaussian distribution* after [Carl Friedrich Gauss](https://en.wikipedia.org/wiki/Carl_Friedrich_Gauss) - Also sometimes called a *bell curve* --- ## Illustration: Shoe sizes - Mickle et al (2010 *Footwear Science*) showed the following bimodal distribution of shoe sizes in the US. <img src="img/bimodalshoes.png" width="80%" style="display: block; margin: auto;" /> Note that standard shoe sizes are discrete. --- ## Illustration: Shoe sizes - Let `\(X\)` represent the shoe size for wearers of men's shoes - (Hypothetical) probability distribution of shoe sizes of wearers of men's shoes. <img src="lecture19_files/figure-html/unnamed-chunk-5-1.png" width="60%" style="display: block; margin: auto;" /> --- ## Illustration: Shoe sizes What is the probability of a customer wanting a men's shoe size smaller than 9? <img src="lecture19_files/figure-html/unnamed-chunk-6-1.png" width="60%" style="display: block; margin: auto;" /> --- ## Smaller Shoes .pull-left[ ``` ## size probability ## 5.5 0.0001 ## 6.0 0.0006 ## 6.5 0.0012 ## 7.0 0.0032 ## 7.5 0.0081 ## 8.0 0.0180 ## 8.5 0.0334 ## 9.0 0.0556 ## 9.5 0.0805 ## 10.0 0.1072 ## 10.5 0.1202 ``` ] .pull-right[ ``` ## size probability ## 11.0 0.1326 ## 11.5 0.1247 ## 12.0 0.1109 ## 12.5 0.0807 ## 13.0 0.0550 ## 13.5 0.0345 ## 14.0 0.0182 ## 14.5 0.0086 ## 15.0 0.0050 ## 15.5 0.0012 ## 16.0 0.0004 ``` ] The probability of a random men's shoe wearer having a shoe size less than 9 in this population is 0.0646. What is the probability of shoe size 10-11.5? --- ## Moving to Continuous Distributions - Now suppose we could get *really* well-fitting shoes, using quarter sizes (9, 9.25, 9.5, 9.75, ...) or even tenth sizes (9, 9.1, 9.2, ...), or shoes specifically made to fit your feet perfectly. - As the number of sizes increases, the bar widths become narrower -> probability distribution of continuous random variable .pull-left[ <img src="lecture19_files/figure-html/normal-1.png" width="90%" style="display: block; margin: auto;" /> ] .pull-right[ This is a **probability density function**. ] --- ## Moving to Continuous Distributions - Probability density function can be used to get the probability of any range of continuous shoe sizes <img src="lecture19_files/figure-html/unnamed-chunk-7-1.png" width="60%" style="display: block; margin: auto;" /> E.g., probability of shoe size being less than 9 (shaded area) --- ## Moving to Continuous Distributions <img src="lecture19_files/figure-html/unnamed-chunk-8-1.png" width="40%" style="display: block; margin: auto;" /> - How do we find this area of interest? - Calculus! `$$P(a \leq X \leq b)=\text{area between a and b below the curve}=\int_a^b f(x)dx$$` where `\(f(x)\)` represents the density curve - In this course, we will use R --- ## Normal Distribution - **Symmetric, bell-shaped** - Characterized by the mean, `\(\mu\)`, and the standard deviation, `\(\sigma\)` (or variance, `\(\sigma^2\)`) - For the normal distribution, the **density function** is given by `$$f(x)=\frac{1}{\sqrt{2\pi\sigma^2}}e^{-\frac{1}{2}\frac{(x-\mu)^2}{\sigma^2}}$$` - Notation: `\(N(\mu,\sigma^2)\)` - The normal distribution with mean 0 and standard deviation 1 is called the **standard normal distribution**. It is commonly denoted `\(Z \sim N(0, 1)\)`. --- ## Probability density function for Normal Distribution - Like `dbinom()` and `dpois()`, `dnorm()` in R gives us the probability density function - Here instead of `\(P(X = x)\)`, it is the **value of the probability density function**, `\(f(x)\)` on the previous slide, at values that we input - `dnorm()` has arguments `x`, `mean` and `sd`, where `mean` and `sd` are the mean and standard deviation of the normal distribution that we want - **Remember that `\(P(X = x) = 0\)` for a continuous random variable**; the value that `dnorm()` gives us is not a probability but the height of the density function --- ## Probability density function for Normal Distribution ``` r dnorm(x = -3:3, mean = 0, sd = 1) ``` ``` ## [1] 0.004431848 0.053990967 0.241970725 0.398942280 0.241970725 ## [6] 0.053990967 0.004431848 ``` .small[ ``` r data.frame(x = c(-3, 3)) %>% ggplot(aes(x)) + stat_function(fun = dnorm, args = list(mean = 0, sd = 1)) + labs(title = "Probability distribution of N(0, 1)", y = "f(x)") ``` <img src="lecture19_files/figure-html/unnamed-chunk-10-1.png" width="60%" style="display: block; margin: auto;" /> ] --- ## Normal Distribution varying mean - Which of the three distributions have means 0, 1, and 4? <img src="lecture19_files/figure-html/unnamed-chunk-11-1.png" width="60%" style="display: block; margin: auto;" /> --- ## Normal Distribution varying standard deviation - Which has standard deviation 1, 2, and 4? <img src="lecture19_files/figure-html/unnamed-chunk-12-1.png" width="60%" style="display: block; margin: auto;" /> --- ## Calculating probabilities for the normal distribution - Recall `pbinom()`: `\(P(X \leq x)\)` for binomial - `pnorm()` for `\(P(X \leq x)\)` for the normal distribution - Arguments: - `q`, "vector of quantiles" ( `\(x\)` in `\(P(X \leq x)\)` ) - `mean`, `\(\mu\)` (default value 0) - `sd`, standard deviation `\(\sigma\)` (default value 1) ``` r pnorm(0) ``` ``` ## [1] 0.5 ``` --- ## Back to shoes example Men's shoe sizes follow a normal distribution with mean 11 and standard deviation 1.5, i.e., `\(N(\mu = 11,\sigma^2 = 1.5^2)\)` <img src="lecture19_files/figure-html/unnamed-chunk-14-1.png" width="60%" style="display: block; margin: auto;" /> E.g., probability of shoe size being less than 9 (shaded area) --- ## Calculating probabilities for our shoes example Given `\(N(\mu = 11,\sigma^2 = 1.5^2)\)`, what is the probability of shoe sizes less than 9? -- ``` r pnorm(9, mean = 11, sd = 1.5) ``` ``` ## [1] 0.09121122 ``` What is the probability of shoe sizes greater than 9? -- ``` r 1 - pnorm(9, mean = 11, sd = 1.5) ``` ``` ## [1] 0.9087888 ``` --- ## Calculating probabilities for our shoes example What is the probability of shoe sizes less than 13? -- ``` r pnorm(13, mean = 11, sd = 1.5) ``` ``` ## [1] 0.9087888 ``` -- What is the probability of shoe size 10-11.5? -- ``` r pnorm(11.5, mean = 11, sd = 1.5) - pnorm(10, mean = 11, sd = 1.5) ``` ``` ## [1] 0.3780661 ``` --- ## Probabilities between two values <img src="img/stdnorm6.png" width="60%" style="display: block; margin: auto;" /> What is the probability of shoe size 10-11.5? ``` r pnorm(11.5, mean = 11, sd = 1.5) - pnorm(10, mean = 11, sd = 1.5) ``` ``` ## [1] 0.3780661 ``` --- ## Sampling from Normal distribution in R - `rnorm()` - Arguments `n, mean, sd` ``` r set.seed(0) # so results are reproducible normalDraws <- rnorm(n = 100, mean = 0, sd = 1) head(normalDraws, 20) ``` ``` ## [1] 1.262954285 -0.326233361 1.329799263 1.272429321 ## [5] 0.414641434 -1.539950042 -0.928567035 -0.294720447 ## [9] -0.005767173 2.404653389 0.763593461 -0.799009249 ## [13] -1.147657009 -0.289461574 -0.299215118 -0.411510833 ## [17] 0.252223448 -0.891921127 0.435683299 -1.237538422 ``` --- ## Frequency distribution varying mean and sd .small[ ``` r set.seed(0) # so results are reproducible normal1 <- rnorm(n = 5000, mean = 3, sd = 2) normal2 <- rnorm(n = 5000, mean = 3, sd = 10) normal3 <- rnorm(n = 5000, mean = 11, sd = 1.5) # shoe size distribution ``` ] <img src="lecture19_files/figure-html/unnamed-chunk-23-1.png" width="60%" style="display: block; margin: auto;" /> --- ## Frequency distribution varying mean and sd .small[ ``` r set.seed(0) # so results are reproducible normal1 <- rnorm(n = 5000, mean = 3, sd = 2) normal2 <- rnorm(n = 5000, mean = 3, sd = 10) normal3 <- rnorm(n = 5000, mean = 11, sd = 1.5) ``` ] <img src="lecture19_files/figure-html/unnamed-chunk-25-1.png" width="60%" style="display: block; margin: auto;" /> --- ## Standard normal distribution - Recall: `\(Z \sim N(0, 1)\)` - Any normally distributed random variable can be expressed as a standard normal by **subtracting the mean and dividing by the standard deviation** - This process is called **standardization** - `\(Y \sim N(\mu, \sigma^2)\)` - `\(Z = \frac{Y - \mu}{\sigma}\)` - `\(E\left(\frac{Y - \mu}{\sigma}\right) = \frac{1}{\sigma}[E(Y) - \mu] = 0\)` - `\(Var\left(\frac{Y - \mu}{\sigma}\right) = \frac{1}{\sigma^2}[Var(Y)] = \frac{1}{\sigma^2}[\sigma^2] = 1\)` - **Moving the location** (mean moves to 0) and **changing the scale** (standard deviation becomes 1) --- ## More about the standard normal distribution - Probability of shoe sizes smaller than 13: .small[ ``` r pnorm(13, mean = 11, sd = 1.5) ``` ``` ## [1] 0.9087888 ``` ] - Let `\(Y\)` be the random variable denoting men's shoe sizes. Then `\(Y \sim N(11, 1.5^2)\)`. .tiny[ $$ `\begin{aligned} P(Y \leq 13) &= P\left(\frac{Y - \mu_y}{\sigma_y} \leq \frac{13 - \mu_y}{\sigma_y} \right) \\ &=P\left( Z \leq \frac{13-11}{1.5} \right) \\ &=P(Z \leq \frac{2}{1.5}) \end{aligned}` $$ ] .small[ ``` r pnorm(2/1.5, mean = 0, sd = 1) ``` ``` ## [1] 0.9087888 ``` ] --- ## z-score .tiny[ $$ `\begin{aligned} P(Y \leq 13) &= P\left(\frac{Y - \mu_y}{\sigma_y} \leq \frac{13 - \mu_y}{\sigma_y} \right) \\ &=P\left( Z \leq \frac{13-11}{1.5} \right) \\ &=P(Z \leq \frac{2}{1.5}) \end{aligned}` $$ ] - Standardized value `\(\frac{13-11}{1.5}\)` is a z-score - `\(z = \frac{x - \mu}{\sigma} = \frac{\text{value - mean}}{\text{standard deviation}}\)` - **Number of standard deviations above (positive z-scores) or below the mean (negative z-scores)** --- ## z-score .tiny[ $$ `\begin{aligned} P(Y \leq 13) &= P\left(\frac{Y - \mu_y}{\sigma_y} \leq \frac{13 - \mu_y}{\sigma_y} \right) \\ &=P\left( Z \leq \frac{13-11}{1.5} \right) \\ &=P(Z \leq \frac{2}{1.5}) \end{aligned}` $$ ] - `\(x - \mu\)` is the number relative to the mean, e.g., shoe size 13 is 2 above the mean - Dividing by `\(\sigma\)`: gives number of standard deviations above the mean - e.g., shoe size distribution has sd = 1.5, so shoe size 13 is `\(\frac{2}{1.5} = 1.33\)` standard deviations above the mean - **Relative** positions stay the same, i.e., `\(P(Y \leq 13) = P(Z \leq \frac{2}{1.5})\)` --- ## Recall: Variance and standard deviation - **Rules of thumb** for symmetric, bell-shaped distributions: 68%, 95%, and 99.7% of the values lie within one, two, and three standard deviations of the mean, respectively <img src="img/sdRules.png" width="60%" style="display: block; margin: auto;" /> ``` r pnorm(2) ``` ``` ## [1] 0.9772499 ``` --- ## Standardizing in R Consider the samples we drew earlier from `\(N \sim (11, 1.5^2)\)` .small[ ``` r set.seed(0) # so results are reproducible normal3 <- rnorm(n = 5000, mean = 11, sd = 1.5) # shoe size distribution standardizedNormal3 <- (normal3 - 11)/1.5 ``` ] <img src="lecture19_files/figure-html/unnamed-chunk-31-1.png" width="60%" style="display: block; margin: auto;" /> --- ## Standardizing in R <img src="lecture19_files/figure-html/unnamed-chunk-32-1.png" width="60%" style="display: block; margin: auto;" /> .small[ ``` r sum(normal3 <= 13)/length(normal3) ``` ``` ## [1] 0.9072 ``` ``` r sum(standardizedNormal3 <= 2/1.5)/length(standardizedNormal3) ``` ``` ## [1] 0.9072 ``` ] --- ## Exercise Assume that player ratings in chess tournaments follow a symmetric, bell-shaped distribution with mean 1600 and standard deviation 350. What common probability distribution do player ratings follow, and what are the parameters? -- A player with a rating of 2650 enters the tournament. What is the probability of a rating higher than this player? -- What is the probability of ratings between 1200 and 1800? --- ## More about the standard normal distribution - We saw earlier that `\(P(Z \leq 0) = .5\)`. This is because the standard normal distribution is symmetric with mean 0. ``` r pnorm(0) # default value of mean = 0 and sd = 1 ``` ``` ## [1] 0.5 ``` - Tail probabilities of the standard normal distribution - The symmetry of the normal distribution allows us to calculate the probability of values falling in the tails - For any `\(z\)`-score, `\(P(Z \leq -z) = P(Z \geq z)\)` <img src="img/stdnorm5.png" width="20%" style="display: block; margin: auto;" /> --- ## Quantiles for the normal distribution - Quantiles are cut points dividing the range of a probability distribution into continuous intervals - Recall: quartiles (four groups) and percentiles (100 groups) - `\(P(X \leq q) = p\)`, where `\(q\)` is the quantile (think of value on the horizontal axis), e.g., `\(P(Z \leq 0) = .5\)` - Recall: `pnorm(q, mean, sd)` for `\(P(X\leq x)\)`, or `\(P(Z \leq z)\)` for standard normal. `pnorm()` returns the probability, `p` ``` r pnorm(q = 0, mean = 0, sd = 1) ``` ``` ## [1] 0.5 ``` - `qnorm(p, mean, sd)` for the quantile, e.g., `\(P(X \leq \ ?) = p\)`. `qnorm()` returns the quantile, `q` ``` r qnorm(p = .5, mean = 0, sd = 1) ``` ``` ## [1] 0 ``` --- ## Important reference points for the normal distribution - z-scores (quantiles) corresponding to particular probabilities are often written as `\(z_p\)` - `\(p\)` denotes the probability in the **right tail**, e.g., `\(z_{.025} \approx 1.96\)` - Important reference points: 2.5% in the left and right tails - In R: .pull-left[ ``` r qnorm(.025, lower.tail = FALSE) ``` ``` ## [1] 1.959964 ``` ``` r qnorm(.975) ``` ``` ## [1] 1.959964 ``` ] .pull-right[ <img src="img/stdnorm5.png" width="70%" style="display: block; margin: auto;" /> ] --- ## Important reference points for the normal distribution .pull-left[ ``` r pnorm(1.96) ``` ``` ## [1] 0.9750021 ``` ``` r pnorm(1.96, lower.tail = FALSE) ``` ``` ## [1] 0.0249979 ``` <img src="img/stdnorm1.png" width="78%" style="display: block; margin: auto;" /> ] .pull-right[ ``` r pnorm(-1.96) ``` ``` ## [1] 0.0249979 ``` ``` r pnorm(-1.96, lower.tail = FALSE) ``` ``` ## [1] 0.9750021 ``` <img src="img/stdnorm2.png" width="70%" style="display: block; margin: auto;" /> ] --- ## Standard normal table <img src="img/normaltable.png" width="60%" style="display: block; margin: auto;" /> --- ## Standard normal table .pull-left[ What is the probability of a shoe size bigger than 13 (z-score 1.33)? <img src="img/normalcurveupper.png" width="100%" style="display: block; margin: auto;" /> ] -- .pull-right[ .small[ ``` r pnorm(13, mean = 11, sd = 1.5, lower.tail = FALSE) ``` ``` ## [1] 0.09121122 ``` ``` r pnorm(2/1.5, lower.tail = FALSE) ``` ``` ## [1] 0.09121122 ``` ``` r 1 - pnorm(2/1.5) ``` ``` ## [1] 0.09121122 ``` ] ] --- ## Summary: Distributions in R - For each distribution, R has a family of commands, starting with the letters `d`, `p`, `q` and `r` - `d` for density - `p` for cumulative density up to input value `\(P(X \leq x)\)`. Think of `\(P(X \leq q) = p\)` - `q` for the quantile, e.g., `\(P(X \leq \ ?) = p\)` - `r` for a random sample from the distribution --- ## Summary - Common probability distributions: Normal - Standard normal distribution - R functions: - `dnorm()` for densities - `pnorm()` for `\(P(X\leq x)\)` - `rnorm()` for random sample - `qnorm()` for quantiles (the value corresponding to an input probability, e.g., `\(P(X \leq \ ?) = p\)`)