class: center, middle, inverse, title-slide .title[ # Descriptive Statistics: Numerical Data ] .subtitle[ ##
STA35A: Statistical Data Science 1 ] .author[ ### Xiao Hui Tai ] .date[ ### October 23, 2024 ] --- layout: true <!-- <div class="my-footer"> --> <!-- <span> --> <!-- <a href="https://datasciencebox.org" target="_blank">datasciencebox.org</a> --> <!-- </span> --> <!-- </div> --> --- <style type="text/css"> .tiny .remark-code { font-size: 60%; } .small .remark-code { font-size: 80%; } </style> ## Today - Relationships between numerical variables - Describing categorical distributions - Bar plot - Relationships between categorical data - Contingency tables --- ## Relationships between numerical variables - Paired or bivariate data - Scatterplot - Hexplot - Correlation - Line graph --- ## Scatterplot We have seen many examples of scatterplots -- - UN Votes - Star Wars - Anscombe's quartet - Palmer Penguins --- ## Scatterplot Each point is a single observation with **two characteristics**, or variables, plotted on the x- and y-axis respectively ``` r ggplot(loans, aes(x = debt_to_income, y = interest_rate)) + geom_point() ``` <img src="lecture11_files/figure-html/unnamed-chunk-3-1.png" width="60%" style="display: block; margin: auto;" /> --- ## Scatterplot in base R .tiny[ .pull-left[ ``` r ggplot(loans, aes(x = debt_to_income, y = interest_rate)) + geom_point() ``` ``` ## Warning: Removed 24 rows containing missing values or values outside the ## scale range (`geom_point()`). ``` <img src="lecture11_files/figure-html/unnamed-chunk-4-1.png" width="100%" style="display: block; margin: auto;" /> ] .pull-right[ ``` r plot(loans$debt_to_income, loans$interest_rate) ``` <img src="lecture11_files/figure-html/unnamed-chunk-5-1.png" width="100%" style="display: block; margin: auto;" /> ] ] --- ## Overplotting and hex plots - **Overplotting** is when points are plotted on top of each other - Common in **large data sets** - A few ways to deal with this include using `alpha`, or `jitter()` - Alternatively, **hex plots** or hexbin plots .tiny[ .pull-left[ ``` r ggplot(loans, aes(x = debt_to_income, y = interest_rate)) + geom_point() ``` <img src="lecture11_files/figure-html/unnamed-chunk-6-1.png" width="100%" style="display: block; margin: auto;" /> ] .pull-right[ ``` r ggplot(loans, aes(x = debt_to_income, y = interest_rate)) + geom_hex() ``` <img src="lecture11_files/figure-html/unnamed-chunk-7-1.png" width="100%" style="display: block; margin: auto;" /> ] ] --- ## Hex plot - Hex plots **divide the graphing surface into hexagons** - All points are grouped into their respective hexagonal regions - **Color gradient** indicates the number of observations (count) in each hexagonal area. <img src="lecture11_files/figure-html/unnamed-chunk-8-1.png" width="60%" style="display: block; margin: auto;" /> --- ## Correlation - Correlation is the **association between two variables** - **(Pearson) Correlation coefficient** is a measure of **linear** correlation between two sets of data - Ranges from -1 to 1 <img src="img/corr.svg" width="70%" style="display: block; margin: auto;" /> --- ## Correlation Recall: - Sample mean: `\(\bar{x} = \frac{x_1 + x_2 + ... + x_n}{n} = \frac{\sum_{i = 1}^{n} x_i}{n}\)` - Sample variance: `\(s^2 = \frac{(x_1 - \bar{x})^2 + (x_2 - \bar{x})^2 + ... + (x_n - \bar{x})^2}{n - 1} = \frac{\sum_{i=1}^{n} (x_i-\bar{x})^2}{n-1}\)` - Population mean: `\(\mu\)` - Population variance: `\(\sigma^2\)` - When talking about a population parameter for a variable `\(x\)`, might use subscript `\(x\)`, e.g., `\(\mu_x\)`, `\(\sigma^2_x\)`; similarly for a sample statistic, e.g., `\(s_x^2\)` --- ## Correlation - Sample correlation: `\(r_{xy} = \frac{\sum_{i=1}^n (x_i-\bar{x})(y_i-\bar{y})}{\sqrt{\sum_{i=1}^n (x_i-\bar{x})^2 \sum_{i=1}^n(y_i-\bar{y})^2}}\)` - Population correlation: `\(\rho\)` - `cor()` in R --- ## Correlation - Sample correlation: `\(r_{xy} = \frac{\sum_{i=1}^n (x_i-\bar{x})(y_i-\bar{y})}{\sqrt{\sum_{i=1}^n (x_i-\bar{x})^2 \sum_{i=1}^n(y_i-\bar{y})^2}}\)` <img src="img/corr1.png" width="80%" style="display: block; margin: auto;" /> --- ## Correlation - Sample correlation: `\(r_{xy} = \frac{\sum_{i=1}^n (x_i-\bar{x})(y_i-\bar{y})}{\sqrt{\sum_{i=1}^n (x_i-\bar{x})^2 \sum_{i=1}^n(y_i-\bar{y})^2}}\)` <img src="img/corr2.png" width="80%" style="display: block; margin: auto;" /> --- ## Correlation - Sample correlation: `\(r_{xy} = \frac{\sum_{i=1}^n (x_i-\bar{x})(y_i-\bar{y})}{\sqrt{\sum_{i=1}^n (x_i-\bar{x})^2 \sum_{i=1}^n(y_i-\bar{y})^2}}\)` <img src="img/corr3.png" width="80%" style="display: block; margin: auto;" /> --- ## Correlation - Sample correlation: `\(r_{xy} = \frac{\sum_{i=1}^n (x_i-\bar{x})(y_i-\bar{y})}{\sqrt{\sum_{i=1}^n (x_i-\bar{x})^2 \sum_{i=1}^n(y_i-\bar{y})^2}}\)` - What does the **denominator** look like? -- - Recall: Sample variance `\(s^2 = \frac{(x_1 - \bar{x})^2 + (x_2 - \bar{x})^2 + ... + (x_n - \bar{x})^2}{n - 1} = \frac{\sum_{i=1}^{n} (x_i-\bar{x})^2}{n-1}\)` - Denominator: `\(\sqrt{\sum_{i=1}^n (x_i-\bar{x})^2 \sum_{i=1}^n(y_i-\bar{y})^2} = \sqrt{(n-1)s_x^2 (n - 1)s_y^2} = (n-1)s_xs_y\)` - Pearson correlation coefficient is **scale and location-invariant** - **Subtract sample means**, `\(\bar{x}\)` and `\(\bar{y}\)` - You can think of the denominator as a **scaling factor** --- ## Guess the correlation http://guessthecorrelation.com/ <img src="img/corrGame.png" width="80%" style="display: block; margin: auto;" /> --- ## Line graphs Line graphs are most commonly used for data over time, **time series data** .tiny[ .panelset[ .panel[.panel-name[Plot] <img src="lecture11_files/figure-html/unnamed-chunk-14-1.png" width="60%" style="display: block; margin: auto;" /> ] .panel[.panel-name[Code] ``` r loans %>% group_by(issue_month) %>% summarize(count = n()) %>% ggplot(aes(x = as.Date(paste0("01-", issue_month), format = "%d-%b-%Y"), y = count)) + geom_point() + geom_line() + scale_x_date(labels = scales::date_format(format = "%m/%Y"), breaks = scales::date_breaks(width = "1 month"), expand = c(.02, .02)) + labs(title = "Number of monthly loans", y = "Number of loans", x = "Month") ``` ] ] ] --- ## Line graphs - Be careful of `geom_path()` vs. `geom_line()`: - `geom_path()` connects the observations in the order in which they appear in the data - `geom_line()` connects them in order of the variable on the x axis. - In base R: `plot(x, y, type = "l")`. Also see `lines()` --- ## Line graphs in base R .tiny[ .panelset[ .panel[.panel-name[Plot] <img src="lecture11_files/figure-html/unnamed-chunk-15-1.png" width="60%" style="display: block; margin: auto;" /> ] .panel[.panel-name[Code] ``` r tmpDF <- loans %>% group_by(issue_month) %>% summarize(count = n()) %>% mutate(issue_month = as.Date(paste0("01-", issue_month), format = "%d-%b-%Y")) %>% arrange(issue_month) plot(tmpDF$issue_month, tmpDF$count, type = "l", main = "Number of monthly loans", xlab = "Month", ylab = "Number of loans") ``` ] ] ] --- ## Exercises - Two employees at a grocery store are weighing produce. One records weights in pounds (lb) and one in kilograms. What should we expect the correlation coefficient between their measurements to be? `\(r_{xy} = \frac{\sum_{i=1}^n (x_i-\bar{x})(y_i-\bar{y})}{\sqrt{\sum_{i=1}^n (x_i-\bar{x})^2 \sum_{i=1}^n(y_i-\bar{y})^2}}\)` - Write R code to calculate the sample correlation coefficient using just basic arithmetic operations (+ or sum(), -, . . . ) and the length() function. (Do not use functions like cor(), mean() or sd().) --- ## Summary -- - Relationships between numerical variables - Scatterplot - Hex plot - Correlation coefficient - Line graph