Bayes’ Theorem and Random Variables

class: center, middle, inverse, title-slide

.title[
# Bayes’ Theorem and Random Variables
]
.subtitle[
## <br><br> STA35A: Statistical Data Science 1
]
.author[
### Xiao Hui Tai
]
.date[
### November 1, 2024
]

---

layout: true

---

## Today

- Bayes' Theorem

- Random variables

- Expectation and variance
  
  - Discrete and continuous random variables

---

## Bayes' Theorem

- Often we know `$P(B|A)$` when we really want `$P(A|B)$`

- For example, a 40-year-old woman has a positive screening mammogram.
    - `$A$` = cancer 
    - `$B$` = positive mammogram screening result.

- Using Bayes' Theorem is sometimes described as "updating our beliefs"
    - Without any information on the test result: the probability of cancer is `$P(A)$`. With test result we can calculate `$P(A|B)$`

- Want `$P(A|B)$`

- Need `$P(A)$`, the prevalance of breast cancer among 40-year-old women.

- Properties of screening tool: 
  - `$P(B | A)$`
  - `$P(B | A^c)$`
  - `$P(B)$`

---

## Bayes' Theorem

**Bayes' theorem** says that `$P(A \mid B) =\frac{P(A \cap B)}{P(B)} =\frac{P(B \mid A)P(A)}{P(B)} =\frac{P(B \mid A)P(A)}{P(B \mid A)P(A) + P(B \mid A^c)P(A^c)}$`.

- The last equality is by the law of total probability

<img src="img/bayes.webp" width="50%" style="display: block; margin: auto;" />
Credit: Matt Buck, Flickr, CC BY-SA 2.0

---

## Bayes' Theorem

`$P(A \mid B) =\frac{P(A \cap B)}{P(B)} =\frac{P(B \mid A)P(A)}{P(B)} =\frac{P(B \mid A)P(A)}{P(B \mid A)P(A) + P(B \mid A^c)P(A^c)}$`
 
- `$A$` = cancer, `$B$` = positive screening result.

- Say `$P(A) = .01$`

- `$P(B | A)= .85$`

- `$P(B|A^c) = .1$` (probability of a positive screening result given that a person does not have cancer)

What is `$P(A \mid B)$`?

---

## Bayes' Theorem in Action

`$P(A \mid B) =\frac{P(A \cap B)}{P(B)} =\frac{P(B \mid A)P(A)}{P(B)} =\frac{P(B \mid A)P(A)}{P(B \mid A)P(A) + P(B \mid A^c)P(A^c)}$`

- `$P(A) = .01$`

- `$P(B | A) = .85$`

- `$P(B|A^c) = .1$`

- Then `$P(A \mid B) =\frac{.85*.01}{.85*.01 + .1*(1 - .01)} = 0.079$`

---
## Behind the scenes: Hypothetical 10,000

Consider a hypothetical population of 10,000 40-year-old women.

.pull-left[

We have

1. The prevalance of breast cancer among 40-year-old women, `$P(A)=.01$`.
2. The sensitivity of a screening mammogram for diagnosing cancer, `$P(B | A)=.85$`.
3. The probability of a positive screening result given that a person does not have cancer, `$P(B|A^c) = .1$`.

]

.pull-right[

| | Cancer (A) | No Cancer | Total |
|---|---:|---:|---:|
| Mammo + (B) | | | |
| Mammo - | | | |
| Total | | | 10000 |
]

---

## Behind the scenes: Hypothetical 10,000

.pull-left[

We have

1. **The prevalance of breast cancer among 40-year-old women, `$P(A)=.01$`.**
2. The sensitivity of a screening mammogram for diagnosing cancer, `$P(B | A)=.85$`.
3. The probability of a positive screening result given that a person does not have cancer, `$P(B|A^c) = .1$`.

]

.pull-right[

| | Cancer (A) | No Cancer | Total |
|---|---:|---:|---:|
| Mammo + (B) | | | |
| Mammo - | | | |
| Total | 100 | 9900 | 10000 |
]

---

## Behind the scenes: Hypothetical 10,000

.pull-left[

We have

1. The prevalance of breast cancer among 40-year-old women, `$P(A)=.01$`.
2. **The sensitivity of a screening mammogram for diagnosing cancer, `$P(B | A)=.85$`.**
3. The probability of a positive screening result given that a person does not have cancer, `$P(B|A^c) = .1$`.

]

.pull-right[

| | Cancer (A) | No Cancer | Total |
|---|---:|---:|---:|
| Mammo + (B) | 85 | | |
| Mammo - | 15 | | |
| Total | 100 | 9900 | 10000 |
]

---

## Behind the scenes: Hypothetical 10,000

.pull-left[

We have

1. The prevalance of breast cancer among 40-year-old women, `$P(A)=.01$`.
2. The sensitivity of a screening mammogram for diagnosing cancer, `$P(B | A)=.85$`.
3. **The probability of a positive screening result given that a person does not have cancer, `$P(B|A^c) = .1$`.**

]

.pull-right[

| | Cancer (A) | No Cancer | Total |
|---|---:|---:|---:|
| Mammo + (B) | 85 | 990 | |
| Mammo - | 15 | 8910 | |
| Total | 100 | 9900 | 10000 |
]

---

## Behind the scenes: Hypothetical 10,000

.pull-left[

- Now fill in row totals

- Conditional probability of cancer given a positive mammogram?

- This entire computation is equivalent to doing `$P(A \mid B) =\frac{P(B \mid A)P(A)}{P(B \mid A)P(A) + P(B \mid A^c)P(A^c)}$`

]

.pull-right[

| | Cancer (A) | No Cancer | Total |
|---|---:|---:|---:|
| Mammo + (B) | 85 | 990 | 1075 |
| Mammo - | 15 | 8910 | 8925 |
| Total | 100 | 9900 | 10000 |

]

---
## Exercise

Edison Research gathered exit poll results from several sources for the Wisconsin recall election of Scott Walker. They found that 53% of the respondents voted in favor of Scott Walker. Additionally, they estimated that of those who did vote in favor for Scott Walker, 37% had a college degree, while 44% of those who voted against Scott Walker had a college degree. Suppose we randomly sampled a person who participated in the exit poll and found that he had a college degree. What is the probability that he voted in favor of Scott Walker?

---
## Random variables

- A random variable = mapping from possible outcomes in a sample space to a probability space.

- Recall: **sample space** is the set of all possible outcomes from a **random process**

<img src="img/coinToss.png" width="60%" style="display: block; margin: auto;" />
.small[
Source: https://medium.com/jun94-devpblog/prob-stats-1-random-variable-483c45242b3c
]

---
## Random variables

<img src="img/coinToss.png" width="60%" style="display: block; margin: auto;" />
.small[
Source: https://medium.com/jun94-devpblog/prob-stats-1-random-variable-483c45242b3c
]

- Let `$X$` be the random variable indicating whether a coin flip results in heads.

- Instead of saying `$P(\texttt{heads})$`, we say `$P(X = 1)$`

- This representation allows us to apply mathematical frameworks and get a better understanding of real-world phenomenon

---
## Random variables

- Denoted by capital letters, e.g., `$X$`, `$Y$`, `$Z$`

- A realization or draw of the random variable is denoted by a lowercase letter, `$x$`, `$y$`, `$z$`

- Other examples of random variables:

- Mass of classroom chairs
  
  - Ages of students at UC Davis
  
--

- **Discrete** random variables:
  
  - Each outcome has an associated probability `$P(X = x_i)$` where `$i = 1, ..., k$` ( `$k$` outcomes are denoted by lower-case, `$x_1, ..., x_k$`)
  - Sometimes also written as `$p_1, ..., p_k$`

---

## Expectation

- Two books assigned: textbook and study guide.

- 20% of students do not buy either book, 55% buy the textbook only, and 25% buy both books

- Say the class has 100 students. How many books should the bookstore expect to sell to the class?

- How many books should the bookstore expect to sell per student?

- Intuitively: `$.2*0 + .55*1 + .25*2 = 1.05$`

---
## Expectation

- Let `$X=$` number of books sold per student

- The three possible outcomes are `$x_1$` = 0 books, `$x_2$` = 1 book (1 textbook for each student), `$x_3$` = 2 books (1 textbook and 1 study guide for each student)

i | 1 | 2 | 3
--|--|--|--
`$x_i$` | 0 | 1 | 2
`$P(X = x_i)$` | .2 | .55 | .25

$$
`\begin{aligned}
E(X) &= x_1 \times P(X = x_1) + x_2 \times P(X = x_2) + ... + x_k \times P(X = x_k)\\
&= \sum_{i=1}^k x_i P(X = x_i)
\end{aligned}`
$$

- Using this definition: `$E(X) = 0*.2 + 1* .55 + 2 * .25 = 1.05$`.

---
## Expectation

- Say we are interested in the **amount of revenue** that the bookstore can expect to earn per student.

- The textbook costs $ 137 and the study guide costs $ 33.

- What modifications do we need? Before:

- Let `$X=$` number of books sold per student

- The three possible outcomes are `$x_1$` = 0 books, `$x_2$` = 1 books (1 textbook for each student), `$x_3$` = 2 books (1 textbook and 1 study guide for each student)

i | 1 | 2 | 3
--|--|--|--
`$x_i$` | 0 | 1 | 2
`$P(X = x_i)$` | .2 | .55 | .25

- `$E(X) = 0*.2 + 1* .55 + 2 * .25 = 1.05$`.
  
---
## Expectation

- Let `$X=$` revenue from books sold per student

- The three possible outcomes are `$x_1$` = $ 0, `$x_2$` = $ 137 (1 textbook for each student), `$x_3$` = $ 170 (1 textbook and 1 study guide for each student)

i | 1 | 2 | 3
--|--|--|--
`$x_i$` | 0 | 137 | 170
`$P(X = x_i)$` | .2 | .55 | .25

- Using this definition: `$E(X) = 0*.2 + 137* .55 + 170 * .25 = 117.85$`.
 
---
## Expectation

- Denoted by `$E(X)$`, `$\mu$` or `$\mu_x$`

- **Expected or average outcome** of `$X$`, where `$X$` is a random variable

- Given a probability distribution for a **discrete random variable**, we can calculate it using

$$
`\begin{aligned}
E(X) &= x_1 \times P(X = x_1) + x_2 \times P(X = x_2) + ... + x_k \times P(X = x_k)\\
&= \sum_{i=1}^k x_i P(X = x_i)
\end{aligned}`
$$

- Recall: this is a **population parameter**, a fixed quantity
  - The sample version, the **sample statistic**, is the sample mean `$\bar{X}$`

---
## Properties of the expectation

- `$E[c] = c$`, where c is a constant

- `$E[aX] = aE[X]$`

- `$E[aX + c] = aE[X] + c$`

- To calculate `$E(X^2)$` (will be useful later), simply replace `$x_i$` in the sum by `$x_i^2$`, i.e.,

$$
`\begin{aligned}
E(X^2) &= x_1^2 \times P(X = x_1) + x_2^2 \times P(X = x_2) + ... + x_k^2 \times P(X = x_k)\\
&= \sum_{i=1}^k x_i^2 P(X = x_i)
\end{aligned}`
$$

- More generally, `$E[g(X)] = \sum_{i=1}^k g(x_i) P(X = x_i)$` (Law of the unconscious statistician)

---
## Variance

- Recall: we saw the **sample variance**, calculated for a data set

- Take the square of deviations and find the mean
  
  - `$s^2 = \frac{(x_1 - \bar{x})^2 + (x_2 - \bar{x})^2 + ... + (x_n - \bar{x})^2}{n - 1}$`

- **Population variance** is often denoted by `$\sigma^2$`, `$\sigma_x^2$`, or `$Var(X)$`

- Given a probability distribution for a discrete random variable, we can calculate it using

.small[
$$
`\begin{aligned}
Var(X) &= E[(X-\mu)^2] \\
&= (x_1 - \mu)^2 \times P(X = x_1) + (x_2 - \mu)^2 \times P(X = x_2) + ... + (x_k - \mu)^2 \times P(X = x_k)\\
&= \sum_{i=1}^k (x_i - \mu)^2 P(X = x_i)
\end{aligned}`
$$
]

- Note: rather than summing over observations, these are over possible outcomes, weighted by their probabilities

---
## Variance

Another common way to write the variance is 
$$
`\begin{aligned}
Var(X) &= E[(X-E(X))^2] \\
&= E[ (X^2 - 2XE(X) + [E(X)]^2) ] \\
&= E(X^2) - 2E(X)E(X) + [E(X)]^2 \\
&= E(X^2) - [E(X)]^2
\end{aligned}`
$$

- Recall: `$E(X) = \sum_{i=1}^k x_i P(X = x_i)$`

- To calculate `$E(X^2)$`, simply replace `$x_i$` in the sum above by `$x_i^2$`, i.e.,

$$
`\begin{aligned}
E(X^2) &= x_1^2 \times P(X = x_1) + x_2^2 \times P(X = x_2) + ... + x_k^2 \times P(X = x_k)\\
&= \sum_{i=1}^k x_i^2 P(X = x_i)
\end{aligned}`
$$

---
## Properties of the variance

- `$Var[c] = 0$`, where c is a constant

- `$Var[aX] = a^2Var[X]$`

- `$Var[aX + c] = a^2Var[X]$`

---
## Linear combinations of random variables

- Often we care not just about a single random variable, but a combination of them

- E.g.,

- The total revenue of our bookstore is a combination of books from different classes, not just our one statistics class
  
  - The total gain or loss in a stock portfolio is the sum of the gains and losses in its components
  
  - Total weekly commute time is a combination of daily commute
  
 
---
## Linear combinations of random variables
 
- Let `$W$` be the weekly commute time per student at UC Davis

- `$X_1$` = commute time per student on Monday
  - `$X_2$` = commute time per student on Tuesday
  - ...
  - `$X_5$` =  commute time per student on Friday
  - `$W = X_1 + X_2 + ... + X_5$` is also a random variable

---
## Linear combinations of random variables
- A **linear combination** of two random variables, `$X$` and `$Y$`, is of the form 
$$
aX + bY,
$$

where `$a$` and `$b$` are constants.

- `$a$` and `$b$` are also called coefficients

- In our example, `$W = X_1 + X_2 + ... + X_5$` is a linear combination with coefficients 1.

---
## Expectation of linear combinations of random variables

- The expectation for a linear combination of random variables is given by 
$$
E(aX + bY) = aE(X) + bE(Y)
$$

- In our example, say `$E(X_1) = ... = E(X_5) = 21$` minutes.

- Then, `$E(W) = 1*21 + 1*21 + 1*21 + 1*21 + 1*21 = 105$` minutes.

---
## Variance of linear combinations of random variables

- The variance for a linear combination of **independent** random variables
$$
Var(aX + bY) = a^2 Var(X) + b^2 Var(Y)
$$

- **Note: this is only true if `$X$` and `$Y$` are independent**.

- In our example, say `$Var(X_1) = ... = Var(X_5) = 5$` minutes.

- Commute times on each day of the week are independent.

- Then, `$Var(W) = 1^2*5 + 1^2*5 + 1^2*5 + 1^2*5 + 1^2*5 = 25$` minutes.

---
## Exercise

A portfolio's value increases by 18% during a financial boom and by 9% during normal times. It decreases by 12% during a recession. What is the expected return on this portfolio if each scenario is equally likely?

---
## Exercise

Suppose we have independent observations `$X_1$` and `$X_2$` from a distribution with mean `$\mu$` and standard deviation `$\sigma$`. What is the variance of the mean of the two values: `$\frac{X_1 + X_2}{2}$`?

Suppose we have 3 independent observations `$X_1$`, `$X_2$` and `$X_3$` from a distribution with mean `$\mu$` and standard deviation `$\sigma$`. What is the variance of the mean of these 3 values: `$\frac{X_1 + X_2 + X_3}{3}$`?

Suppose we have n independent observations `$X_1$`, `$X_2$`, ..., `$X_n$` from a distribution with mean `$\mu$` and standard deviation `$\sigma$`. What is the variance of the mean of these n values: `$\frac{X_1 + X_2 + ... + X_n}{n}$`?

---
## Summary
--

- Bayes' Theorem
  - `$P(A \mid B) =\frac{P(A \cap B)}{P(B)} =\frac{P(B \mid A)P(A)}{P(B)} =\frac{P(B \mid A)P(A)}{P(B \mid A)P(A) + P(B \mid A^c)P(A^c)}$`

- Random variables

- Expectation and variance
  
  - Discrete random variables