Descriptive Statistics: Categorical Data

class: center, middle, inverse, title-slide

.title[
# Descriptive Statistics: Categorical Data
]
.subtitle[
## <br><br> STA35A: Statistical Data Science 1
]
.author[
### Xiao Hui Tai
]
.date[
### October 25, 2024
]

---

layout: true

---

## Today

- Describing categorical distributions

- Relationships between categorical data

- Relationships between numerical and categorical data

---
## Data: Lending Club

- Lending Club is a platform that allows individuals to lend to other individuals

``` r
loans <- loans_full_schema %>%
  select(loan_amount, interest_rate, term, grade, 
         state, annual_income, homeownership, debt_to_income,
         issue_month)
glimpse(loans)
```

```
## Rows: 10,000
## Columns: 9
## $ loan_amount    <int> 28000, 5000, 2000, 21600, 23000, 5000, 2…
## $ interest_rate  <dbl> 14.07, 12.61, 17.09, 6.72, 14.07, 6.72, …
## $ term           <dbl> 60, 36, 36, 36, 36, 36, 60, 60, 36, 36, …
## $ grade          <fct> C, C, D, A, C, A, C, B, C, A, C, B, C, B…
## $ state          <fct> NJ, HI, WI, PA, CA, KY, MI, AZ, NV, IL, …
## $ annual_income  <dbl> 90000, 40000, 40000, 30000, 35000, 34000…
## $ homeownership  <fct> MORTGAGE, RENT, RENT, RENT, RENT, OWN, M…
## $ debt_to_income <dbl> 18.01, 5.04, 21.15, 10.16, 57.96, 6.46, …
## $ issue_month    <fct> Mar-2018, Feb-2018, Feb-2018, Jan-2018, …
```

---
## Bar plot

A bar plot is common way to display a **single categorical variable**.

.tiny[
.pull-left[

``` r
ggplot(loans, aes(x = homeownership)) +
  geom_bar()
```

<img src="lecture12_files/figure-html/unnamed-chunk-4-1.png" width="100%" style="display: block; margin: auto;" />
]

.pull-right[

``` r
barplot(table(loans$homeownership)
        [table(loans$homeownership) > 0])
```

<img src="lecture12_files/figure-html/unnamed-chunk-5-1.png" width="100%" style="display: block; margin: auto;" />
]
]

---
## Bar plot with proportions

.tiny[
.pull-left[

``` r
ggplot(loans, aes(x = homeownership)) +
  geom_bar()
```

<img src="lecture12_files/figure-html/unnamed-chunk-6-1.png" width="100%" style="display: block; margin: auto;" />
]

.pull-right[

``` r
ggplot(loans, aes(x = homeownership)) +
  geom_bar(aes(y = after_stat(count)/sum(after_stat(count))))
```

<img src="lecture12_files/figure-html/unnamed-chunk-7-1.png" width="100%" style="display: block; margin: auto;" />
]
]

---
## Contingency tables
A contingency table summarizes data for **two categorical variables**

``` r
xtabs(~ homeownership + grade, data = loans_full_schema)
```

```
##              grade
## homeownership         A    B    C    D    E    F    G
##                  0    0    0    0    0    0    0    0
##      ANY         0    0    0    0    0    0    0    0
##      MORTGAGE    0 1285 1499 1234  587  148   32    4
##      OWN         0  347  414  335  211   38    5    3
##      RENT        0  827 1124 1084  648  149   21    5
```

Each value in the table represents the number of times a particular combination of variable outcomes occurred, in other words the **frequency distribution** of the variables

---
## Contingency tables

``` r
xtabs(~ homeownership + grade, data = loans_full_schema)
```

- An additional row for **column totals** is often included
- Similarly, an additional column for **row totals**
- How do we code this in R?
  - `rowSums` and `colSums`

---
## Contingency tables with row and column totals

``` r
outTable <- xtabs(~ homeownership + grade, data = loans_full_schema)
outTableTotals <- outTable %>%
  cbind(rowTotal = rowSums(outTable)) 
outTableTotals <- outTableTotals %>%
  rbind(columnTotal = colSums(outTableTotals))
outTableTotals
```

```
##                  A    B    C    D   E  F  G rowTotal
##             0    0    0    0    0   0  0  0        0
## ANY         0    0    0    0    0   0  0  0        0
## MORTGAGE    0 1285 1499 1234  587 148 32  4     4789
## OWN         0  347  414  335  211  38  5  3     1353
## RENT        0  827 1124 1084  648 149 21  5     3858
## columnTotal 0 2459 3037 2653 1446 335 58 12    10000
```

---
## Contingency tables with proportions
- Sometimes, proportions might be more useful than totals

- Row proportions are the proportion out of **row totals**

- Column proportions are the proportion out of **column totals**

- How do we code **row proportions** in R?

- How about **column proportions**?

---
## Contingency tables with proportions
- How do we code **row proportions** in R?

.tiny[

``` r
prop.table(outTable, margin = 1)
```

```
##              grade
## homeownership                         A            B
##                                                     
##      ANY                                            
##      MORTGAGE 0.0000000000 0.2683232408 0.3130089789
##      OWN      0.0000000000 0.2564671101 0.3059866962
##      RENT     0.0000000000 0.2143597719 0.2913426646
##              grade
## homeownership            C            D            E
##                                                     
##      ANY                                            
##      MORTGAGE 0.2576738359 0.1225725621 0.0309041554
##      OWN      0.2475979305 0.1559497413 0.0280857354
##      RENT     0.2809745982 0.1679626750 0.0386210472
##              grade
## homeownership            F            G
##                                        
##      ANY                               
##      MORTGAGE 0.0066819795 0.0008352474
##      OWN      0.0036954915 0.0022172949
##      RENT     0.0054432348 0.0012960083
```
]

- Note that each row should sum to 1

---
## Contingency tables with proportions
- How about **column proportions**?

.tiny[

``` r
prop.table(outTable, margin = 2)
```

```
##              grade
## homeownership          A         B         C         D         E
##                0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
##      ANY       0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
##      MORTGAGE  0.5225702 0.4935792 0.4651338 0.4059474 0.4417910
##      OWN       0.1411143 0.1363187 0.1262721 0.1459198 0.1134328
##      RENT      0.3363156 0.3701021 0.4085940 0.4481328 0.4447761
##              grade
## homeownership         F         G
##               0.0000000 0.0000000
##      ANY      0.0000000 0.0000000
##      MORTGAGE 0.5517241 0.3333333
##      OWN      0.0862069 0.2500000
##      RENT     0.3620690 0.4166667
```
]

- Note that each column should sum to 1

---
## Stacked bar plot

- A stacked bar plot looks at numeric values across two categorical variable

- Each bar in a standard bar plot is divided into stacked sub-bars, each one corresponding to a level of the second categorical variable.

``` r
ggplot(loans, aes(x = homeownership, 
                  fill = grade)) + #<<
  geom_bar()
```

<img src="lecture12_files/figure-html/unnamed-chunk-13-1.png" width="60%" style="display: block; margin: auto;" />
---

## Stacked bar plot
Turning `grade` into an ordered variable makes `ggplot` use the `viridis` scale by default

``` r
str(loans$grade)
```

```
##  Factor w/ 8 levels "","A","B","C",..: 4 4 5 2 4 2 4 3 4 2 ...
```

``` r
loans <- loans %>%
  mutate(grade = factor(grade, ordered = TRUE))
str(loans$grade)
```

```
##  Ord.factor w/ 7 levels "A"<"B"<"C"<"D"<..: 3 3 4 1 3 1 3 2 3 1 ...
```

---
## Stacked bar plot

``` r
ggplot(loans, aes(x = homeownership, 
                  fill = grade)) + #<<
  geom_bar()
```

---

## Stacked bar plot
Adding `position = "fill"` argument changes visualization to proportions, and standardizes the height of columns

``` r
ggplot(loans, aes(x = homeownership, fill = grade)) +
  geom_bar(position = "fill") #<<
```

---
## Stacked bar plot: counts vs. proportions
Which bar plot is a **more useful representation** for visualizing the relationship between homeownership and grade?

.pull-left[
<img src="lecture12_files/figure-html/unnamed-chunk-17-1.png" width="100%" style="display: block; margin: auto;" />
]
.pull-right[
<img src="lecture12_files/figure-html/unnamed-chunk-18-1.png" width="100%" style="display: block; margin: auto;" />
]

If there were **no relationship** between homeownership and grade, we would expect to see the bars to be similar lengths across the homeownership status (columns).

---
## Stacked bar plot: counts vs. proportions
Is there a relationship between homeownership and grade?

.pull-left[
<img src="lecture12_files/figure-html/unnamed-chunk-19-1.png" width="100%" style="display: block; margin: auto;" />
]
.pull-right[
<img src="lecture12_files/figure-html/unnamed-chunk-20-1.png" width="100%" style="display: block; margin: auto;" />
]

---

## Customizing bar plots

.panelset[
.panel[.panel-name[Plot]
<img src="lecture12_files/figure-html/unnamed-chunk-21-1.png" width="60%" style="display: block; margin: auto;" />
]
.panel[.panel-name[Code]

``` r
ggplot(loans, aes(y = homeownership, #<<
                  fill = grade)) +
  geom_bar(position = "fill") +
  labs( #<<
    x = "Proportion", #<<
    y = "Homeownership", #<<
    fill = "Grade", #<<
    title = "Grades of Lending Club loans", #<<
    subtitle = "and homeownership of lendee" #<<
  ) #<<
```
]
]

---
## Relationships between numerical and categorical data 
- We saw histograms, boxplots, and density plots earlier, for describing a **single numerical variable**

- To look at **relationships between these numerical data and a categorical variable**, we can:

- Fill and facet histograms and density plots
  
  - Use side-by-side boxplots
  
  - Violin plots
  
  - Ridge plots

- Numerical summaries
  - `group_by()`

---

## Fill a histogram with a categorical variable

Is there a relationship between loan amount and home-ownership status?

.panelset[
.panel[.panel-name[Plot]
<img src="lecture12_files/figure-html/unnamed-chunk-22-1.png" width="60%" style="display: block; margin: auto;" />
]
.panel[.panel-name[Code]

``` r
ggplot(loans, aes(x = loan_amount, 
                  fill = homeownership)) + #<<
  geom_histogram(binwidth = 5000,
                 alpha = 0.5) + #<<
  labs(
    x = "Loan amount ($)",
    y = "Frequency",
    title = "Amounts of Lending Club loans"
  ) +
  scale_fill_viridis_d()
```
]
]

---

## Fill a histogram with a categorical variable

Is there a relationship between loan amount and home-ownership status? 
  - Need `position = "identity"` argument if we don't want histogram bars to be stacked on top of one another

.panelset[
.panel[.panel-name[Plot]
<img src="lecture12_files/figure-html/unnamed-chunk-23-1.png" width="60%" style="display: block; margin: auto;" />
]
.panel[.panel-name[Code]

``` r
ggplot(loans, aes(x = loan_amount, 
                  fill = homeownership)) + 
  geom_histogram(position = "identity", #<<
                 binwidth = 5000,
                 alpha = 0.5) + 
  labs(
    x = "Loan amount ($)",
    y = "Frequency",
    title = "Amounts of Lending Club loans"
  ) +
  scale_fill_viridis_d()
```
]
]

---

## Facet a histogram with a categorical variable

Is there a relationship between loan amount and home-ownership status?

.panelset[
.panel[.panel-name[Plot]
<img src="lecture12_files/figure-html/unnamed-chunk-24-1.png" width="60%" style="display: block; margin: auto;" />
]
.panel[.panel-name[Code]

``` r
ggplot(loans, aes(x = loan_amount, fill = homeownership)) + 
  geom_histogram(binwidth = 5000) +
  labs(
    x = "Loan amount ($)",
    y = "Frequency",
    title = "Amounts of Lending Club loans"
  ) +
  facet_wrap(~ homeownership, nrow = 3) + #<<
  scale_fill_viridis_d()
```
]
]

---
## Filling density plots with a categorical variable

Is there a relationship between loan amount and home-ownership status?

.panelset[
.panel[.panel-name[Plot]
<img src="lecture12_files/figure-html/unnamed-chunk-25-1.png" width="60%" style="display: block; margin: auto;" />
]
.panel[.panel-name[Code]

``` r
ggplot(loans, aes(x = loan_amount, 
                  fill = homeownership)) + #<<
  geom_density(adjust = 2, 
               alpha = 0.5) + #<<
  labs(
    x = "Loan amount ($)",
    y = "Density",
    title = "Amounts of Lending Club loans", 
    fill = "Homeownership" #<<
  ) +
  scale_fill_viridis_d()
```
]
]

---

## Side-by-side boxplots

Is there a relationship between loan amount and home-ownership status?

.panelset[
.panel[.panel-name[Plot]
<img src="lecture12_files/figure-html/unnamed-chunk-26-1.png" width="60%" style="display: block; margin: auto;" />
]
.panel[.panel-name[Code]

``` r
ggplot(loans, aes(x = loan_amount,
                  y = homeownership)) + #<<
  geom_boxplot() +
  labs(
    x = "Loan amount ($)",
    y = "Home-ownership status",
    title = "Amounts of Lending Club loans" #<<
  )
```
]
]

---

## Violin plots

Is there a relationship between loan amount and home-ownership status?

- A violin plot is a hybrid of a boxplot and a density plot

``` r
ggplot(loans, aes(x = homeownership, y = loan_amount)) +
  geom_violin()
```

---

## Ridge plots

``` r
library(ggridges)
ggplot(loans, aes(x = loan_amount, y = homeownership, 
                  fill = homeownership)) + 
  geom_density_ridges(alpha = 0.5) +
  scale_fill_viridis_d()
```

---

## Ridge plots

Ridge plots can also be used to investigate **more complicated relationships**, such as those between a categorical and numerical variable, conditional on another categorical variable

Here, we consider the relationship between loan grade and loan amount, conditional on each level of home ownership

``` r
ggplot(loans, aes(x = loan_amount, y = homeownership, fill = grade, color = grade)) + 
  geom_density_ridges(alpha = 0.5)
```

---

## Ridge plots

Here, we consider the relationship between loan grade and loan amount, conditional on each level of home ownership

``` r
ggplot(loans, aes(x = loan_amount, y = homeownership, fill = grade, color = grade)) + 
  geom_density_ridges(alpha = 0.5)
```

Interestingly, those who had mortgages tend to have a higher proportion of grade G loans that have higher loan amounts.

---
## Numerical summaries in R, grouping by a categorical variable 
Question: `homeownership` is a factor variable with three levels, `MORTGAGE`, `OWN` and `RENT`. How do we calculate the mean loan amount for each type of home ownership status?

``` r
loans %>%
  group_by(homeownership) %>%
  summarize(meanLoan = mean(loan_amount))
```

```
## # A tibble: 3 × 2
##   homeownership meanLoan
##   <fct>            <dbl>
## 1 MORTGAGE        18129.
## 2 OWN             15684.
## 3 RENT            14406.
```

---

## What descriptive statistics to use? Which plots to produce?

- Fancier does not always mean better; a pretty plot can look great but tell us nothing

- Think about what question you are trying to answer and pick the figure that best suits the purpose

- Hadley Wickham on exploratory data analysis: "EDA is not a formal process with a strict set of rules. More than anything, EDA is a state of mind." (Chapter 10, R for Data Science)

---

## Exercise

Rainfall data:

We are interested in looking at patterns of total daily rainfall in 2020.

- What type of visualization would be most suitable? 
- Write code to manipulate the data to a suitable form for plotting
- Write code to generate the visualization.

What about if we are interested in comparing total daily rainfall during the summer months (June, July, August), and during the winter months (December, January, February), for the entire data period?

---
## Summary

- Describing categorical distributions

- Bar plot

- Relationships between categorical data

- Contingency tables
  
  - Stacked bar plot

- Relationships between numerical and categorical data

- Fill and facet
  
  - Side-by-side boxplots
  
  - Other fancy plots
  
  - Numerical summaries in R