class: center, middle, inverse, title-slide .title[ # Descriptive Statistics: Numerical and Categorical Data ] .subtitle[ ##
STA35A: Statistical Data Science 1 ] .author[ ### Xiao Hui Tai ] .date[ ### October 25, 2023 ] --- layout: true <!-- <div class="my-footer"> --> <!-- <span> --> <!-- <a href="https://datasciencebox.org" target="_blank">datasciencebox.org</a> --> <!-- </span> --> <!-- </div> --> --- <style type="text/css"> .tiny .remark-code { font-size: 60%; } .small .remark-code { font-size: 80%; } </style> ## Reminders - Issues with Gradescope assignments - Select **all** corresponding pages - Show **all code/plot outputs** - Do not just submit text file with R script - Print saved variables - **Write answers in markdown portion of the file**, rather than code comments - Break up long comments into multiple lines - Anonymous mid-quarter survey on Canvas, closes on Friday --- ## Recap -- - Describing numerical distributions - Histograms - Measures of central tendency: mean, median, mode - Shape: skewness and modality - Spread: variance and standard deviation, range and interquartile range - Boxplots - Unusual observations - Density plot --- ## Today - Relationships between numerical variables - Scatterplot - Hex plot - Correlation coefficient - Line graph - Describing categorical distributions - Bar plot - Relationships between categorical data - Contingency tables --- ## Data: Lending Club - Lending Club is a platform that allows individuals to lend to other individuals ```r loans <- loans_full_schema %>% select(loan_amount, interest_rate, term, grade, state, annual_income, homeownership, debt_to_income, issue_month) glimpse(loans) ``` ``` ## Rows: 10,000 ## Columns: 9 ## $ loan_amount <int> 28000, 5000, 2000, 21600, 23000, 5000, 2… ## $ interest_rate <dbl> 14.07, 12.61, 17.09, 6.72, 14.07, 6.72, … ## $ term <dbl> 60, 36, 36, 36, 36, 36, 60, 60, 36, 36, … ## $ grade <fct> C, C, D, A, C, A, C, B, C, A, C, B, C, B… ## $ state <fct> NJ, HI, WI, PA, CA, KY, MI, AZ, NV, IL, … ## $ annual_income <dbl> 90000, 40000, 40000, 30000, 35000, 34000… ## $ homeownership <fct> MORTGAGE, RENT, RENT, RENT, RENT, OWN, M… ## $ debt_to_income <dbl> 18.01, 5.04, 21.15, 10.16, 57.96, 6.46, … ## $ issue_month <fct> Mar-2018, Feb-2018, Feb-2018, Jan-2018, … ``` --- ## Relationships between numerical variables - Paired or bivariate data - Scatterplot - Hexplot - Correlation - Line graph --- ## Scatterplot We have seen many examples of scatterplots -- - UN Votes - Star Wars - Anscombe's quartet - Palmer Penguins --- ## Scatterplot Each point is a single observation with **two characteristics**, or variables, plotted on the x- and y-axis respectively ```r ggplot(loans, aes(x = debt_to_income, y = interest_rate)) + geom_point() ``` <img src="lecture11_files/figure-html/unnamed-chunk-4-1.png" width="60%" style="display: block; margin: auto;" /> --- ## Scatterplot in base R .tiny[ .pull-left[ ```r ggplot(loans, aes(x = debt_to_income, y = interest_rate)) + geom_point() ``` ``` ## Warning: Removed 24 rows containing missing values (geom_point). ``` <img src="lecture11_files/figure-html/unnamed-chunk-5-1.png" width="100%" style="display: block; margin: auto;" /> ] .pull-right[ ```r plot(loans$debt_to_income, loans$interest_rate) ``` <img src="lecture11_files/figure-html/unnamed-chunk-6-1.png" width="100%" style="display: block; margin: auto;" /> ] ] --- ## Overplotting and hex plots - **Overplotting** is when points are plotted on top of each other - Common in **large data sets** - A few ways to deal with this include using `alpha`, or `jitter()` - Alternatively, **hex plots** or hexbin plots .tiny[ .pull-left[ ```r ggplot(loans, aes(x = debt_to_income, y = interest_rate)) + geom_point() ``` <img src="lecture11_files/figure-html/unnamed-chunk-7-1.png" width="100%" style="display: block; margin: auto;" /> ] .pull-right[ ```r ggplot(loans, aes(x = debt_to_income, y = interest_rate)) + geom_hex() ``` <img src="lecture11_files/figure-html/unnamed-chunk-8-1.png" width="100%" style="display: block; margin: auto;" /> ] ] --- ## Hex plot - Hex plots **divide the graphing surface into hexagons** - All points are grouped into their respective hexagonal regions - **Color gradient** indicates the number of observations (count) in each hexagonal area. <img src="lecture11_files/figure-html/unnamed-chunk-9-1.png" width="60%" style="display: block; margin: auto;" /> --- ## Correlation - Correlation is the **association between two variables** - **(Pearson) Correlation coefficient** is a measure of **linear** correlation between two sets of data - Ranges from -1 to 1 <img src="img/corr.svg" width="70%" style="display: block; margin: auto;" /> --- ## Correlation Recall: - Sample mean: `\(\bar{x} = \frac{x_1 + x_2 + ... + x_n}{n} = \frac{\sum_{i = 1}^{n} x_i}{n}\)` - Sample variance: `\(s^2 = \frac{(x_1 - \bar{x})^2 + (x_2 - \bar{x})^2 + ... + (x_n - \bar{x})^2}{n - 1} = \frac{\sum_{i=1}^{n} (x_i-\bar{x})^2}{n-1}\)` - Population mean: `\(\mu\)` - Population variance: `\(\sigma^2\)` - When talking about a population parameter for a variable `\(x\)`, might use subscript `\(x\)`, e.g., `\(\mu_x\)`, `\(\sigma^2_x\)`; similarly for a sample statistic, e.g., `\(s_x^2\)` --- ## Correlation - Sample correlation: `\(r_{xy} = \frac{\sum_{i=1}^n (x_i-\bar{x})(y_i-\bar{y})}{\sqrt{\sum_{i=1}^n (x_i-\bar{x})^2 \sum_{i=1}^n(y_i-\bar{y})^2}}\)` - Population correlation: `\(\rho\)` - `cor()` in R --- ## Correlation - Sample correlation: `\(r_{xy} = \frac{\sum_{i=1}^n (x_i-\bar{x})(y_i-\bar{y})}{\sqrt{\sum_{i=1}^n (x_i-\bar{x})^2 \sum_{i=1}^n(y_i-\bar{y})^2}}\)` <img src="img/corr1.png" width="80%" style="display: block; margin: auto;" /> --- ## Correlation - Sample correlation: `\(r_{xy} = \frac{\sum_{i=1}^n (x_i-\bar{x})(y_i-\bar{y})}{\sqrt{\sum_{i=1}^n (x_i-\bar{x})^2 \sum_{i=1}^n(y_i-\bar{y})^2}}\)` <img src="img/corr2.png" width="80%" style="display: block; margin: auto;" /> --- ## Correlation - Sample correlation: `\(r_{xy} = \frac{\sum_{i=1}^n (x_i-\bar{x})(y_i-\bar{y})}{\sqrt{\sum_{i=1}^n (x_i-\bar{x})^2 \sum_{i=1}^n(y_i-\bar{y})^2}}\)` <img src="img/corr3.png" width="80%" style="display: block; margin: auto;" /> --- ## Correlation - Sample correlation: `\(r_{xy} = \frac{\sum_{i=1}^n (x_i-\bar{x})(y_i-\bar{y})}{\sqrt{\sum_{i=1}^n (x_i-\bar{x})^2 \sum_{i=1}^n(y_i-\bar{y})^2}}\)` - What does the **denominator** look like? -- - Recall: Sample variance `\(s^2 = \frac{(x_1 - \bar{x})^2 + (x_2 - \bar{x})^2 + ... + (x_n - \bar{x})^2}{n - 1} = \frac{\sum_{i=1}^{n} (x_i-\bar{x})^2}{n-1}\)` - Denominator: `\(\sqrt{\sum_{i=1}^n (x_i-\bar{x})^2 \sum_{i=1}^n(y_i-\bar{y})^2} = \sqrt{(n-1)s_x^2 (n - 1)s_y^2} = (n-1)s_xs_y\)` - Pearson correlation coefficient is **scale and location-invariant** - **Subtract sample means**, `\(\bar{x}\)` and `\(\bar{y}\)` - You can think of the denominator as a **scaling factor** --- ## Guess the correlation http://guessthecorrelation.com/ <img src="img/corrGame.png" width="80%" style="display: block; margin: auto;" /> --- ## Line graphs Line graphs are most commonly used for data over time, **time series data** .tiny[ .panelset[ .panel[.panel-name[Plot] <img src="lecture11_files/figure-html/unnamed-chunk-15-1.png" width="60%" style="display: block; margin: auto;" /> ] .panel[.panel-name[Code] ```r loans %>% group_by(issue_month) %>% summarize(count = n()) %>% ggplot(aes(x = as.Date(paste0("01-", issue_month), format = "%d-%b-%Y"), y = count)) + geom_point() + geom_line() + scale_x_date(labels = scales::date_format(format = "%m/%Y"), breaks = scales::date_breaks(width = "1 month"), expand = c(.02, .02)) + labs(title = "Number of monthly loans", y = "Number of loans", x = "Month") ``` ] ] ] --- ## Line graphs - Be careful of `geom_path()` vs. `geom_line()`: - `geom_path()` connects the observations in the order in which they appear in the data - `geom_line()` connects them in order of the variable on the x axis. - In base R: `plot(x, y, type = "l")`. Also see `lines()` --- ## Line graphs in base R .tiny[ .panelset[ .panel[.panel-name[Plot] <img src="lecture11_files/figure-html/unnamed-chunk-16-1.png" width="60%" style="display: block; margin: auto;" /> ] .panel[.panel-name[Code] ```r tmpDF <- loans %>% group_by(issue_month) %>% summarize(count = n()) %>% mutate(issue_month = as.Date(paste0("01-", issue_month), format = "%d-%b-%Y")) %>% arrange(issue_month) plot(tmpDF$issue_month, tmpDF$count, type = "l", main = "Number of monthly loans", xlab = "Month", ylab = "Number of loans") ``` ] ] ] --- ## Summary -- - Relationships between numerical variables - Scatterplot - Hex plot - Correlation coefficient - Line graph