class: center, middle, inverse, title-slide .title[ # Descriptive Statistics: Categorical Data ] .subtitle[ ##
STA35A: Statistical Data Science 1 ] .author[ ### Xiao Hui Tai ] .date[ ### October 25, 2024 ] --- layout: true <!-- <div class="my-footer"> --> <!-- <span> --> <!-- <a href="https://datasciencebox.org" target="_blank">datasciencebox.org</a> --> <!-- </span> --> <!-- </div> --> --- <style type="text/css"> .tiny .remark-code { font-size: 60%; } .small .remark-code { font-size: 80%; } </style> ## Today - Describing categorical distributions - Relationships between categorical data - Relationships between numerical and categorical data --- ## Data: Lending Club - Lending Club is a platform that allows individuals to lend to other individuals ``` r loans <- loans_full_schema %>% select(loan_amount, interest_rate, term, grade, state, annual_income, homeownership, debt_to_income, issue_month) glimpse(loans) ``` ``` ## Rows: 10,000 ## Columns: 9 ## $ loan_amount <int> 28000, 5000, 2000, 21600, 23000, 5000, 2… ## $ interest_rate <dbl> 14.07, 12.61, 17.09, 6.72, 14.07, 6.72, … ## $ term <dbl> 60, 36, 36, 36, 36, 36, 60, 60, 36, 36, … ## $ grade <fct> C, C, D, A, C, A, C, B, C, A, C, B, C, B… ## $ state <fct> NJ, HI, WI, PA, CA, KY, MI, AZ, NV, IL, … ## $ annual_income <dbl> 90000, 40000, 40000, 30000, 35000, 34000… ## $ homeownership <fct> MORTGAGE, RENT, RENT, RENT, RENT, OWN, M… ## $ debt_to_income <dbl> 18.01, 5.04, 21.15, 10.16, 57.96, 6.46, … ## $ issue_month <fct> Mar-2018, Feb-2018, Feb-2018, Jan-2018, … ``` --- ## Bar plot A bar plot is common way to display a **single categorical variable**. .tiny[ .pull-left[ ``` r ggplot(loans, aes(x = homeownership)) + geom_bar() ``` <img src="lecture12_files/figure-html/unnamed-chunk-4-1.png" width="100%" style="display: block; margin: auto;" /> ] .pull-right[ ``` r barplot(table(loans$homeownership) [table(loans$homeownership) > 0]) ``` <img src="lecture12_files/figure-html/unnamed-chunk-5-1.png" width="100%" style="display: block; margin: auto;" /> ] ] --- ## Bar plot with proportions .tiny[ .pull-left[ ``` r ggplot(loans, aes(x = homeownership)) + geom_bar() ``` <img src="lecture12_files/figure-html/unnamed-chunk-6-1.png" width="100%" style="display: block; margin: auto;" /> ] .pull-right[ ``` r ggplot(loans, aes(x = homeownership)) + geom_bar(aes(y = after_stat(count)/sum(after_stat(count)))) ``` <img src="lecture12_files/figure-html/unnamed-chunk-7-1.png" width="100%" style="display: block; margin: auto;" /> ] ] --- ## Contingency tables A contingency table summarizes data for **two categorical variables** ``` r xtabs(~ homeownership + grade, data = loans_full_schema) ``` ``` ## grade ## homeownership A B C D E F G ## 0 0 0 0 0 0 0 0 ## ANY 0 0 0 0 0 0 0 0 ## MORTGAGE 0 1285 1499 1234 587 148 32 4 ## OWN 0 347 414 335 211 38 5 3 ## RENT 0 827 1124 1084 648 149 21 5 ``` Each value in the table represents the number of times a particular combination of variable outcomes occurred, in other words the **frequency distribution** of the variables --- ## Contingency tables ``` r xtabs(~ homeownership + grade, data = loans_full_schema) ``` ``` ## grade ## homeownership A B C D E F G ## 0 0 0 0 0 0 0 0 ## ANY 0 0 0 0 0 0 0 0 ## MORTGAGE 0 1285 1499 1234 587 148 32 4 ## OWN 0 347 414 335 211 38 5 3 ## RENT 0 827 1124 1084 648 149 21 5 ``` - An additional row for **column totals** is often included - Similarly, an additional column for **row totals** - How do we code this in R? - `rowSums` and `colSums` --- ## Contingency tables with row and column totals ``` r outTable <- xtabs(~ homeownership + grade, data = loans_full_schema) outTableTotals <- outTable %>% cbind(rowTotal = rowSums(outTable)) outTableTotals <- outTableTotals %>% rbind(columnTotal = colSums(outTableTotals)) outTableTotals ``` ``` ## A B C D E F G rowTotal ## 0 0 0 0 0 0 0 0 0 ## ANY 0 0 0 0 0 0 0 0 0 ## MORTGAGE 0 1285 1499 1234 587 148 32 4 4789 ## OWN 0 347 414 335 211 38 5 3 1353 ## RENT 0 827 1124 1084 648 149 21 5 3858 ## columnTotal 0 2459 3037 2653 1446 335 58 12 10000 ``` --- ## Contingency tables with proportions - Sometimes, proportions might be more useful than totals - Row proportions are the proportion out of **row totals** - Column proportions are the proportion out of **column totals** - How do we code **row proportions** in R? - How about **column proportions**? --- ## Contingency tables with proportions - How do we code **row proportions** in R? .tiny[ ``` r prop.table(outTable, margin = 1) ``` ``` ## grade ## homeownership A B ## ## ANY ## MORTGAGE 0.0000000000 0.2683232408 0.3130089789 ## OWN 0.0000000000 0.2564671101 0.3059866962 ## RENT 0.0000000000 0.2143597719 0.2913426646 ## grade ## homeownership C D E ## ## ANY ## MORTGAGE 0.2576738359 0.1225725621 0.0309041554 ## OWN 0.2475979305 0.1559497413 0.0280857354 ## RENT 0.2809745982 0.1679626750 0.0386210472 ## grade ## homeownership F G ## ## ANY ## MORTGAGE 0.0066819795 0.0008352474 ## OWN 0.0036954915 0.0022172949 ## RENT 0.0054432348 0.0012960083 ``` ] - Note that each row should sum to 1 --- ## Contingency tables with proportions - How about **column proportions**? --- ## Contingency tables with proportions - How about **column proportions**? .tiny[ ``` r prop.table(outTable, margin = 2) ``` ``` ## grade ## homeownership A B C D E ## 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 ## ANY 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 ## MORTGAGE 0.5225702 0.4935792 0.4651338 0.4059474 0.4417910 ## OWN 0.1411143 0.1363187 0.1262721 0.1459198 0.1134328 ## RENT 0.3363156 0.3701021 0.4085940 0.4481328 0.4447761 ## grade ## homeownership F G ## 0.0000000 0.0000000 ## ANY 0.0000000 0.0000000 ## MORTGAGE 0.5517241 0.3333333 ## OWN 0.0862069 0.2500000 ## RENT 0.3620690 0.4166667 ``` ] - Note that each column should sum to 1 --- ## Stacked bar plot - A stacked bar plot looks at numeric values across two categorical variable - Each bar in a standard bar plot is divided into stacked sub-bars, each one corresponding to a level of the second categorical variable. ``` r ggplot(loans, aes(x = homeownership, fill = grade)) + #<< geom_bar() ``` <img src="lecture12_files/figure-html/unnamed-chunk-13-1.png" width="60%" style="display: block; margin: auto;" /> --- ## Stacked bar plot Turning `grade` into an ordered variable makes `ggplot` use the `viridis` scale by default ``` r str(loans$grade) ``` ``` ## Factor w/ 8 levels "","A","B","C",..: 4 4 5 2 4 2 4 3 4 2 ... ``` ``` r loans <- loans %>% mutate(grade = factor(grade, ordered = TRUE)) str(loans$grade) ``` ``` ## Ord.factor w/ 7 levels "A"<"B"<"C"<"D"<..: 3 3 4 1 3 1 3 2 3 1 ... ``` --- ## Stacked bar plot ``` r ggplot(loans, aes(x = homeownership, fill = grade)) + #<< geom_bar() ``` <img src="lecture12_files/figure-html/unnamed-chunk-15-1.png" width="60%" style="display: block; margin: auto;" /> --- ## Stacked bar plot Adding `position = "fill"` argument changes visualization to proportions, and standardizes the height of columns ``` r ggplot(loans, aes(x = homeownership, fill = grade)) + geom_bar(position = "fill") #<< ``` <img src="lecture12_files/figure-html/unnamed-chunk-16-1.png" width="60%" style="display: block; margin: auto;" /> --- ## Stacked bar plot: counts vs. proportions Which bar plot is a **more useful representation** for visualizing the relationship between homeownership and grade? .pull-left[ <img src="lecture12_files/figure-html/unnamed-chunk-17-1.png" width="100%" style="display: block; margin: auto;" /> ] .pull-right[ <img src="lecture12_files/figure-html/unnamed-chunk-18-1.png" width="100%" style="display: block; margin: auto;" /> ] If there were **no relationship** between homeownership and grade, we would expect to see the bars to be similar lengths across the homeownership status (columns). --- ## Stacked bar plot: counts vs. proportions Is there a relationship between homeownership and grade? .pull-left[ <img src="lecture12_files/figure-html/unnamed-chunk-19-1.png" width="100%" style="display: block; margin: auto;" /> ] .pull-right[ <img src="lecture12_files/figure-html/unnamed-chunk-20-1.png" width="100%" style="display: block; margin: auto;" /> ] --- ## Customizing bar plots .panelset[ .panel[.panel-name[Plot] <img src="lecture12_files/figure-html/unnamed-chunk-21-1.png" width="60%" style="display: block; margin: auto;" /> ] .panel[.panel-name[Code] ``` r ggplot(loans, aes(y = homeownership, #<< fill = grade)) + geom_bar(position = "fill") + labs( #<< x = "Proportion", #<< y = "Homeownership", #<< fill = "Grade", #<< title = "Grades of Lending Club loans", #<< subtitle = "and homeownership of lendee" #<< ) #<< ``` ] ] --- ## Relationships between numerical and categorical data - We saw histograms, boxplots, and density plots earlier, for describing a **single numerical variable** - To look at **relationships between these numerical data and a categorical variable**, we can: - Fill and facet histograms and density plots - Use side-by-side boxplots - Violin plots - Ridge plots - Numerical summaries - `group_by()` --- ## Fill a histogram with a categorical variable Is there a relationship between loan amount and home-ownership status? .panelset[ .panel[.panel-name[Plot] <img src="lecture12_files/figure-html/unnamed-chunk-22-1.png" width="60%" style="display: block; margin: auto;" /> ] .panel[.panel-name[Code] ``` r ggplot(loans, aes(x = loan_amount, fill = homeownership)) + #<< geom_histogram(binwidth = 5000, alpha = 0.5) + #<< labs( x = "Loan amount ($)", y = "Frequency", title = "Amounts of Lending Club loans" ) + scale_fill_viridis_d() ``` ] ] --- ## Fill a histogram with a categorical variable Is there a relationship between loan amount and home-ownership status? - Need `position = "identity"` argument if we don't want histogram bars to be stacked on top of one another .panelset[ .panel[.panel-name[Plot] <img src="lecture12_files/figure-html/unnamed-chunk-23-1.png" width="60%" style="display: block; margin: auto;" /> ] .panel[.panel-name[Code] ``` r ggplot(loans, aes(x = loan_amount, fill = homeownership)) + geom_histogram(position = "identity", #<< binwidth = 5000, alpha = 0.5) + labs( x = "Loan amount ($)", y = "Frequency", title = "Amounts of Lending Club loans" ) + scale_fill_viridis_d() ``` ] ] --- ## Facet a histogram with a categorical variable Is there a relationship between loan amount and home-ownership status? .panelset[ .panel[.panel-name[Plot] <img src="lecture12_files/figure-html/unnamed-chunk-24-1.png" width="60%" style="display: block; margin: auto;" /> ] .panel[.panel-name[Code] ``` r ggplot(loans, aes(x = loan_amount, fill = homeownership)) + geom_histogram(binwidth = 5000) + labs( x = "Loan amount ($)", y = "Frequency", title = "Amounts of Lending Club loans" ) + facet_wrap(~ homeownership, nrow = 3) + #<< scale_fill_viridis_d() ``` ] ] --- ## Filling density plots with a categorical variable Is there a relationship between loan amount and home-ownership status? .panelset[ .panel[.panel-name[Plot] <img src="lecture12_files/figure-html/unnamed-chunk-25-1.png" width="60%" style="display: block; margin: auto;" /> ] .panel[.panel-name[Code] ``` r ggplot(loans, aes(x = loan_amount, fill = homeownership)) + #<< geom_density(adjust = 2, alpha = 0.5) + #<< labs( x = "Loan amount ($)", y = "Density", title = "Amounts of Lending Club loans", fill = "Homeownership" #<< ) + scale_fill_viridis_d() ``` ] ] --- ## Side-by-side boxplots Is there a relationship between loan amount and home-ownership status? .panelset[ .panel[.panel-name[Plot] <img src="lecture12_files/figure-html/unnamed-chunk-26-1.png" width="60%" style="display: block; margin: auto;" /> ] .panel[.panel-name[Code] ``` r ggplot(loans, aes(x = loan_amount, y = homeownership)) + #<< geom_boxplot() + labs( x = "Loan amount ($)", y = "Home-ownership status", title = "Amounts of Lending Club loans" #<< ) ``` ] ] --- ## Violin plots Is there a relationship between loan amount and home-ownership status? - A violin plot is a hybrid of a boxplot and a density plot ``` r ggplot(loans, aes(x = homeownership, y = loan_amount)) + geom_violin() ``` <img src="lecture12_files/figure-html/unnamed-chunk-27-1.png" width="60%" style="display: block; margin: auto;" /> --- ## Ridge plots ``` r library(ggridges) ggplot(loans, aes(x = loan_amount, y = homeownership, fill = homeownership)) + geom_density_ridges(alpha = 0.5) + scale_fill_viridis_d() ``` <img src="lecture12_files/figure-html/unnamed-chunk-28-1.png" width="60%" style="display: block; margin: auto;" /> --- ## Ridge plots Ridge plots can also be used to investigate **more complicated relationships**, such as those between a categorical and numerical variable, conditional on another categorical variable Here, we consider the relationship between loan grade and loan amount, conditional on each level of home ownership ``` r ggplot(loans, aes(x = loan_amount, y = homeownership, fill = grade, color = grade)) + geom_density_ridges(alpha = 0.5) ``` <img src="lecture12_files/figure-html/unnamed-chunk-29-1.png" width="60%" style="display: block; margin: auto;" /> --- ## Ridge plots Here, we consider the relationship between loan grade and loan amount, conditional on each level of home ownership ``` r ggplot(loans, aes(x = loan_amount, y = homeownership, fill = grade, color = grade)) + geom_density_ridges(alpha = 0.5) ``` <img src="lecture12_files/figure-html/unnamed-chunk-30-1.png" width="60%" style="display: block; margin: auto;" /> Interestingly, those who had mortgages tend to have a higher proportion of grade G loans that have higher loan amounts. --- ## Numerical summaries in R, grouping by a categorical variable Question: `homeownership` is a factor variable with three levels, `MORTGAGE`, `OWN` and `RENT`. How do we calculate the mean loan amount for each type of home ownership status? -- ``` r loans %>% group_by(homeownership) %>% summarize(meanLoan = mean(loan_amount)) ``` ``` ## # A tibble: 3 × 2 ## homeownership meanLoan ## <fct> <dbl> ## 1 MORTGAGE 18129. ## 2 OWN 15684. ## 3 RENT 14406. ``` --- ## What descriptive statistics to use? Which plots to produce? - Fancier does not always mean better; a pretty plot can look great but tell us nothing - Think about what question you are trying to answer and pick the figure that best suits the purpose - Hadley Wickham on exploratory data analysis: "EDA is not a formal process with a strict set of rules. More than anything, EDA is a state of mind." (Chapter 10, R for Data Science) --- ## Exercise Rainfall data: We are interested in looking at patterns of total daily rainfall in 2020. - What type of visualization would be most suitable? - Write code to manipulate the data to a suitable form for plotting - Write code to generate the visualization. What about if we are interested in comparing total daily rainfall during the summer months (June, July, August), and during the winter months (December, January, February), for the entire data period? --- ## Summary -- - Describing categorical distributions - Bar plot - Relationships between categorical data - Contingency tables - Stacked bar plot - Relationships between numerical and categorical data - Fill and facet - Side-by-side boxplots - Other fancy plots - Numerical summaries in R