Descriptive Statistics: Numerical Data

class: center, middle, inverse, title-slide

.title[
# Descriptive Statistics: Numerical Data
]
.subtitle[
## <br><br> STA35A: Statistical Data Science 1
]
.author[
### Xiao Hui Tai
]
.date[
### October 23, 2024
]

---

layout: true

---

## Today
- Relationships between numerical variables

- Describing categorical distributions

- Bar plot

- Relationships between categorical data

- Contingency tables

---
## Relationships between numerical variables

- Paired or bivariate data

- Scatterplot

- Hexplot

- Correlation
  
  - Line graph

---

## Scatterplot
We have seen many examples of scatterplots

--
- UN Votes

- Star Wars

- Anscombe's quartet

- Palmer Penguins

---

## Scatterplot
Each point is a single observation with **two characteristics**, or variables, plotted on the x- and y-axis respectively

``` r
ggplot(loans, aes(x = debt_to_income, y = interest_rate)) +
  geom_point()
```

---

## Scatterplot in base R

.tiny[
.pull-left[

``` r
ggplot(loans, aes(x = debt_to_income, y = interest_rate)) +
  geom_point()
```

```
## Warning: Removed 24 rows containing missing values or values outside the
## scale range (`geom_point()`).
```

<img src="lecture11_files/figure-html/unnamed-chunk-4-1.png" width="100%" style="display: block; margin: auto;" />
]

.pull-right[

``` r
plot(loans$debt_to_income, loans$interest_rate)
```

<img src="lecture11_files/figure-html/unnamed-chunk-5-1.png" width="100%" style="display: block; margin: auto;" />
]
]

---

## Overplotting and hex plots

- **Overplotting** is when points are plotted on top of each other

- Common in **large data sets**

- A few ways to deal with this include using `alpha`, or `jitter()`

- Alternatively, **hex plots** or hexbin plots

.tiny[
.pull-left[

``` r
ggplot(loans, aes(x = debt_to_income, y = interest_rate)) +
  geom_point()
```

<img src="lecture11_files/figure-html/unnamed-chunk-6-1.png" width="100%" style="display: block; margin: auto;" />
]

.pull-right[

``` r
ggplot(loans, aes(x = debt_to_income, y = interest_rate)) +
  geom_hex()
```

<img src="lecture11_files/figure-html/unnamed-chunk-7-1.png" width="100%" style="display: block; margin: auto;" />
]
]

---

## Hex plot

- Hex plots **divide the graphing surface into hexagons**

- All points are grouped into their respective hexagonal regions 
  
  - **Color gradient** indicates the number of observations (count) in each hexagonal area.

---
## Correlation

- Correlation is the **association between two variables**

- **(Pearson) Correlation coefficient** is a measure of **linear** correlation between two sets of data

- Ranges from -1 to 1

---
## Correlation

Recall:
- Sample mean: `$\bar{x} = \frac{x_1 + x_2 + ... + x_n}{n} = \frac{\sum_{i = 1}^{n} x_i}{n}$`
- Sample variance: `$s^2 = \frac{(x_1 - \bar{x})^2 + (x_2 - \bar{x})^2 + ... + (x_n - \bar{x})^2}{n - 1} = \frac{\sum_{i=1}^{n} (x_i-\bar{x})^2}{n-1}$`

- Population mean: `$\mu$`
- Population variance: `$\sigma^2$`

- When talking about a population parameter for a variable `$x$`, might use subscript `$x$`, e.g., `$\mu_x$`, `$\sigma^2_x$`; similarly for a sample statistic, e.g., `$s_x^2$`