class: center, middle, inverse, title-slide .title[ # Descriptive Statistics: Numerical Data ] .subtitle[ ##
STA35A: Statistical Data Science 1 ] .author[ ### Xiao Hui Tai ] .date[ ### October 21, 2024 ] --- layout: true <!-- <div class="my-footer"> --> <!-- <span> --> <!-- <a href="https://datasciencebox.org" target="_blank">datasciencebox.org</a> --> <!-- </span> --> <!-- </div> --> --- <style type="text/css"> .tiny .remark-code { font-size: 60%; } .small .remark-code { font-size: 80%; } </style> ## Course content 1. Fundamentals of R - Overview of data types and structures - Data manipulation and data visualization tools 2. **Descriptive statistics for numerical and categorical data** 3. Probability - Rules of probability computation; conditional probability - Basic probability models: Binomial, Normal and Poisson 4. Statistical inference - Sampling distributions of sample mean and sample proportion - Hypothesis testing and confidence intervals for population mean and population proportion --- ## Today - Descriptive statistics - Types of variables (numerical and categorical) - Describing numerical distributions - Histograms - Measures of central tendency: mean, median, mode - Shape: skewness and modality - Spread: variance and standard deviation, range and interquartile range - Boxplots - Unusual observations - Density plot --- ## Descriptive statistics - We've now learned about data manipulation and visualization tools - What visualizations to do and what summary statistics to actually calculate? - **Descriptive statistics** are numbers that are used to summarize and describe data - **Numerical** or **graphical** ways to display the data - Why is this a useful thing to do? **Ages of students**: 18, 19, 20, 18, 20, 20, 19, 18, 19, 20, 18, 20, 20, 19, 18, 19, 20, 18, 20, 20, 19, 18, 19, 20, 18, 20, 20, 19, 18, 19, 20, 18, 20, 20, 19, 18, 19, 20, 18, 20, 20, 19, 18, 19, 20, 18, 20, 20, 19, 18, 19, 20, 18, 20, 20, 19, 18, 19, 20, 18, 20, 20, 19, 18, 19, 20, 18, 20, 20, 19, 18, 19, 20, 18, 20, 20, 19, 18, 19, 20, 18, 20, 20, 19, 18, 19, 20, 18, 20, 20, 19 --- ## Terminology: Number of variables involved - **Univariate** data analysis: distribution of single variable - **Bivariate** data analysis: relationship between two variables - **Multivariate** data analysis: relationship between many variables at once, usually focusing on the relationship between two while conditioning for others --- ## Terminology: Types of variables - **Numerical** variables - E.g., age, length, temperature - **Continuous** variables can take on an infinite number of values - **Discrete** variables only take on non-negative whole numbers - **Categorical** variables - E.g., year in college, type of bike, meal - **Ordinal** variables have levels that have a natural ordering --- ## Data: Lending Club - Lending Club is a platform that allows individuals to lend to other individuals - Data are available in the `openintro` package, called `loans_full_schema` - Includes 10,000 loans made through the Lending Club; has 55 columns .tiny[ ``` r library(openintro) dplyr::glimpse(loans_full_schema) ``` ``` ## Rows: 10,000 ## Columns: 55 ## $ emp_title <chr> "global config enginee… ## $ emp_length <dbl> 3, 10, 3, 1, 10, NA, 1… ## $ state <fct> NJ, HI, WI, PA, CA, KY… ## $ homeownership <fct> MORTGAGE, RENT, RENT, … ## $ annual_income <dbl> 90000, 40000, 40000, 3… ## $ verified_income <fct> Verified, Not Verified… ## $ debt_to_income <dbl> 18.01, 5.04, 21.15, 10… ## $ annual_income_joint <dbl> NA, NA, NA, NA, 57000,… ## $ verification_income_joint <fct> , , , , Verified, , No… ## $ debt_to_income_joint <dbl> NA, NA, NA, NA, 37.66,… ## $ delinq_2y <int> 0, 0, 0, 0, 0, 1, 0, 1… ## $ months_since_last_delinq <int> 38, NA, 28, NA, NA, 3,… ## $ earliest_credit_line <dbl> 2001, 1996, 2006, 2007… ## $ inquiries_last_12m <int> 6, 1, 4, 0, 7, 6, 1, 1… ## $ total_credit_lines <int> 28, 30, 31, 4, 22, 32,… ## $ open_credit_lines <int> 10, 14, 10, 4, 16, 12,… ## $ total_credit_limit <int> 70795, 28800, 24193, 2… ## $ total_credit_utilized <int> 38767, 4321, 16000, 49… ## $ num_collections_last_12m <int> 0, 0, 0, 0, 0, 0, 0, 0… ## $ num_historical_failed_to_pay <int> 0, 1, 0, 1, 0, 0, 0, 0… ## $ months_since_90d_late <int> 38, NA, 28, NA, NA, 60… ## $ current_accounts_delinq <int> 0, 0, 0, 0, 0, 0, 0, 0… ## $ total_collection_amount_ever <int> 1250, 0, 432, 0, 0, 0,… ## $ current_installment_accounts <int> 2, 0, 1, 1, 1, 0, 2, 2… ## $ accounts_opened_24m <int> 5, 11, 13, 1, 6, 2, 1,… ## $ months_since_last_credit_inquiry <int> 5, 8, 7, 15, 4, 5, 9, … ## $ num_satisfactory_accounts <int> 10, 14, 10, 4, 16, 12,… ## $ num_accounts_120d_past_due <int> 0, 0, 0, 0, 0, 0, 0, N… ## $ num_accounts_30d_past_due <int> 0, 0, 0, 0, 0, 0, 0, 0… ## $ num_active_debit_accounts <int> 2, 3, 3, 2, 10, 1, 3, … ## $ total_debit_limit <int> 11100, 16500, 4300, 19… ## $ num_total_cc_accounts <int> 14, 24, 14, 3, 20, 27,… ## $ num_open_cc_accounts <int> 8, 14, 8, 3, 15, 12, 7… ## $ num_cc_carrying_balance <int> 6, 4, 6, 2, 13, 5, 6, … ## $ num_mort_accounts <int> 1, 0, 0, 0, 0, 3, 2, 7… ## $ account_never_delinq_percent <dbl> 92.9, 100.0, 93.5, 100… ## $ tax_liens <int> 0, 0, 0, 1, 0, 0, 0, 0… ## $ public_record_bankrupt <int> 0, 1, 0, 0, 0, 0, 0, 0… ## $ loan_purpose <fct> moving, debt_consolida… ## $ application_type <fct> individual, individual… ## $ loan_amount <int> 28000, 5000, 2000, 216… ## $ term <dbl> 60, 36, 36, 36, 36, 36… ## $ interest_rate <dbl> 14.07, 12.61, 17.09, 6… ## $ installment <dbl> 652.53, 167.54, 71.40,… ## $ grade <fct> C, C, D, A, C, A, C, B… ## $ sub_grade <fct> C3, C1, D1, A3, C3, A3… ## $ issue_month <fct> Mar-2018, Feb-2018, Fe… ## $ loan_status <fct> Current, Current, Curr… ## $ initial_listing_status <fct> whole, whole, fraction… ## $ disbursement_method <fct> Cash, Cash, Cash, Cash… ## $ balance <dbl> 27015.86, 4651.37, 182… ## $ paid_total <dbl> 1999.330, 499.120, 281… ## $ paid_principal <dbl> 984.14, 348.63, 175.37… ## $ paid_interest <dbl> 1015.19, 150.49, 106.4… ## $ paid_late_fees <dbl> 0, 0, 0, 0, 0, 0, 0, 0… ``` ] --- ## Selected variables ``` r loans <- loans_full_schema %>% select(loan_amount, interest_rate, term, grade, state, annual_income, homeownership, debt_to_income, issue_month) glimpse(loans) ``` ``` ## Rows: 10,000 ## Columns: 9 ## $ loan_amount <int> 28000, 5000, 2000, 21600, 23000, 5000, 2… ## $ interest_rate <dbl> 14.07, 12.61, 17.09, 6.72, 14.07, 6.72, … ## $ term <dbl> 60, 36, 36, 36, 36, 36, 60, 60, 36, 36, … ## $ grade <fct> C, C, D, A, C, A, C, B, C, A, C, B, C, B… ## $ state <fct> NJ, HI, WI, PA, CA, KY, MI, AZ, NV, IL, … ## $ annual_income <dbl> 90000, 40000, 40000, 30000, 35000, 34000… ## $ homeownership <fct> MORTGAGE, RENT, RENT, RENT, RENT, OWN, M… ## $ debt_to_income <dbl> 18.01, 5.04, 21.15, 10.16, 57.96, 6.46, … ## $ issue_month <fct> Mar-2018, Feb-2018, Feb-2018, Jan-2018, … ``` --- ## Selected variables .small[ Variable | Description ----------------|------------- `loan_amount` | Amount of the loan received, in US dollars `interest_rate` | Interest rate on the loan, in an annual percentage `term` | The length of the loan, which is always set as a whole number of months `grade` | Loan grade (A-G), represents the quality of the loan and its likelihood of being repaid `state` | US state where the borrower resides `annual_income` | Borrower’s annual income, including any second income, in US dollars `homeownership` | Indicates whether the person owns, owns but has a mortgage, or rents `debt_to_income` | Debt-to-income ratio `issue_month` | Month the loan was issued ] --- ## Variable types .small[ Variable | Description ----------------|------------- `loan_amount` | Amount of the loan received, in US dollars `interest_rate` | Interest rate on the loan, in an annual percentage `term` | The length of the loan, which is always set as a whole number of months `grade` | Loan grade (A-G), represents the quality of the loan and its likelihood of being repaid `state` | US state where the borrower resides `annual_income` | Borrower’s annual income, including any second income, in US dollars `homeownership` | Indicates whether the person owns, owns but has a mortgage, or rents `debt_to_income` | Debt-to-income ratio `issue_month` | Month the loan was issued ] - Numerical variables: Continuous or discrete? - Categorical: Ordinal or not? --- ## Variable types Variable | Type ----------------|------------- `loan_amount` | numerical, continuous `interest_rate` | numerical, continuous `term` | numerical, discrete `grade` | categorical, ordinal `state` | categorical, not ordinal `annual_income` | numerical, continuous `homeownership` | categorical, not ordinal `debt_to_income` | numerical, continuous `issue_month` | date --- ## Describing numerical distributions - **Visual summaries**: - Histogram - Boxplot - Density plot - Line graph - Measures of **central tendency**: mean, median, mode - **Shape**: - Skewness: right-skewed, left-skewed, symmetric - Modality: unimodal, bimodal, multimodal, uniform - **Spread**: variance and standard deviation, range and interquartile range - **Unusual observations** - A **summary statistic** is a single number summarizing a large amount of data --- ## Histogram - Shows **shape, center, and spread** of the data - Contiguous (adjoining) boxes - Horizontal axis: what the data represents - Vertical axis: frequency or relative frequency .tiny[ .pull-left[ ``` r ggplot(loans, aes(x = loan_amount)) + geom_histogram() ``` ``` ## `stat_bin()` using `bins = 30`. Pick better value with ## `binwidth`. ``` <img src="lecture10_files/figure-html/unnamed-chunk-6-1.png" width="100%" style="display: block; margin: auto;" /> ] .pull-right[ ``` r hist(loans_full_schema$loan_amount) ``` <img src="lecture10_files/figure-html/unnamed-chunk-7-1.png" width="100%" style="display: block; margin: auto;" /> ] ] --- ## Histograms and binwidth .panelset[ .panel[.panel-name[binwidth = 1000] ``` r ggplot(loans, aes(x = loan_amount)) + geom_histogram(binwidth = 1000) ``` <img src="lecture10_files/figure-html/unnamed-chunk-8-1.png" width="50%" style="display: block; margin: auto;" /> ] .panel[.panel-name[binwidth = 5000] ``` r ggplot(loans, aes(x = loan_amount)) + geom_histogram(binwidth = 5000) ``` <img src="lecture10_files/figure-html/unnamed-chunk-9-1.png" width="50%" style="display: block; margin: auto;" /> ] .panel[.panel-name[binwidth = 20000] ``` r ggplot(loans, aes(x = loan_amount)) + geom_histogram(binwidth = 20000) ``` <img src="lecture10_files/figure-html/unnamed-chunk-10-1.png" width="50%" style="display: block; margin: auto;" /> ] ] --- ## Adding labels .panelset[ .panel[.panel-name[Plot] <img src="lecture10_files/figure-html/unnamed-chunk-11-1.png" width="60%" style="display: block; margin: auto;" /> ] .panel[.panel-name[Code] ``` r ggplot(loans, aes(x = loan_amount)) + geom_histogram(binwidth = 5000) + labs( #<< x = "Loan amount ($)", #<< y = "Frequency", #<< title = "Amounts of Lending Club loans" #<< ) #<< ``` ] ] --- ## Population vs. sample (briefly; more later) - A **sample** is a portion or **subset** of the larger **population** - E.g., population may be UC Davis students; randomly sample 300 students on the Quad this morning - Population **parameter**, e.g., population mean - This is a fixed quantity - Sample **statistic**, e.g., sample mean - Depends on the sample --- ## Measures of central tendency - **Mean**: "Average", sum the numbers and divide by the count (`mean()`) `\(\bar{x} = \frac{x_1 + x_2 + ... + x_n}{n}\)`, where `\(x\)` is the variable of interest, the subscripts index the `\(n\)` observations, and `\(\bar{x}\)` denotes the **sample mean**. The **population mean** is often denoted by `\(\mu\)`. - **Median**: "Middle value", arrange in ascending order (`median()`) - **Mode**: Most frequent value (`mode()` does not do what you might think) - Note: you will sometimes need the `na.rm = TRUE` option --- ## Measures of central tendency - `mean(loans$loan_amount)` = 16361.92 - `median(loans$loan_amount)` = 14500 - Mode is 10000 (Exercise: how to get this in R?) .tiny[ .pull-left[ ``` r ggplot(loans, aes(x = loan_amount)) + geom_histogram() + geom_vline(xintercept = median(loans$loan_amount), col = "blue") + geom_vline(xintercept = mean(loans$loan_amount), col = "red") ``` ] ] .pull-right[ <img src="lecture10_files/figure-html/unnamed-chunk-13-1.png" width="100%" style="display: block; margin: auto;" /> ] --- ## Measures of shape: Skewness <img src="img/skew.png" width="80%" style="display: block; margin: auto;" /> - **Skewness** is to the side of the longer *tail* - **Positive skew/right skew**: mean > median - **Negative skew/left skew**: mean < median --- ## Measures of shape: Modality - **Mode** is the most frequent value, but in real-world data sets, there might not be any observations with the same value. - A mode is represented by a **prominent peak in the distribution** - **Unimodal** = one prominent peak, **bimodal** = two prominent peaks, **multimodal** = more than two prominent peaks, uniform <img src="img/modality.png" width="80%" style="display: block; margin: auto;" /> --- ## Loans data ``` r ggplot(loans, aes(x = loan_amount)) + geom_histogram(binwidth = 5000) ``` <img src="lecture10_files/figure-html/unnamed-chunk-16-1.png" width="70%" style="display: block; margin: auto;" /> What is the skewness and modality? --- ## Measures of spread: Variance and standard deviation <img src="img/sd.svg" width="70%" style="display: block; margin: auto;" /> - **Red distribution**: concentrated closely near the mean - **Blue distribution**: more widely spread out from the mean - They have the same mean, skewness, modality --- ## Measures of spread: Variance and standard deviation - **Standard deviation** measures how far data values are from their mean - **Deviation** is the distance of an observation from its mean, `\(x_i - \bar{x}\)` - **Sample variance**: "Take the square of deviations and find the mean" - `\(s^2 = \frac{(x_1 - \bar{x})^2 + (x_2 - \bar{x})^2 + ... + (x_n - \bar{x})^2}{n - 1}\)` - For the denominator, use `\(n-1\)` instead of `\(n\)` to make it an *unbiased estimator of the population mean* - **Sample standard deviation**, `\(s = \sqrt{s^2}\)` - In R, `sd()` for sample standard deviation, `var()` for sample variance - **Population** variance and standard deviation are often denoted by `\(\sigma^2\)` and `\(\sigma\)` --- ## Measures of spread: Variance and standard deviation - Standard deviation can roughly be interpreted as the mean distance from mean - **Rules of thumb** for symmetric, bell-shaped distributions: 68%, 95%, and 99.7% of the values lie within one, two, and three standard deviations of the mean, respectively <img src="img/sdRules.png" width="60%" style="display: block; margin: auto;" /> --- ## Measures of spread: Variance and standard deviation <img src="img/sdRules.png" width="60%" style="display: block; margin: auto;" /> Example: Wait times at a restaurant have a mean of 25 minutes and standard deviation of 5 minutes. David waits for 35 minutes. This wait time is longer than roughly __% of wait times. David waits for 10 minutes. This wait time is shorter than roughly __% of wait times. --- ## Measures of spread: Range and interquartile range <img src="img/IQR.png" width="30%" style="display: block; margin: auto;" /> - **Percentile**: a number that divides ordered data into hundredths - Median = 50th percentile - **Quartile**: a number that divides ordered data into quarters - First quartile = 25th percentile - Second quartile = Median = 50th percentile - Third quartile = 75th percentile - **Interquartile range** (IQR) = 3rd - 1st quartile - `IQR()` in R - **Range** = Max - min - **Five-number summary**: Min, 1Q, Median, 3Q, Max - `summary()` in R (also gives mean) --- ## Loans data .tiny[ ``` r sd(loans$loan_amount) ``` ``` ## [1] 10301.96 ``` ``` r var(loans$loan_amount) ``` ``` ## [1] 106130313 ``` ``` r sqrt(var(loans$loan_amount)) ``` ``` ## [1] 10301.96 ``` ``` r summary(loans$loan_amount) ``` ``` ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 1000 8000 14500 16362 24000 40000 ``` ``` r IQR(loans$loan_amount) ``` ``` ## [1] 16000 ``` ] `homeownership` is a factor variable with three levels, `MORTGAGE`, `OWN` and `RENT`. How do we calculate the variance for each type of home ownership status? --- ## Percentiles Vertical lines for 5th, 25th percentile, median, mean, 75th and 95th percentiles .panelset[ .panel[.panel-name[Plot] <img src="lecture10_files/figure-html/unnamed-chunk-22-1.png" width="60%" style="display: block; margin: auto;" /> ] .panel[.panel-name[Code] ``` r ggplot(loans, aes(x = loan_amount)) + geom_histogram() + geom_vline(xintercept = median(loans$loan_amount), col = "blue") + geom_vline(xintercept = quantile(loans$loan_amount, .05), col = "lightblue") + geom_vline(xintercept = quantile(loans$loan_amount, .25), col = "lightblue") + geom_vline(xintercept = quantile(loans$loan_amount, .75), col = "lightblue") + geom_vline(xintercept = quantile(loans$loan_amount, .95), col = "lightblue") + geom_vline(xintercept = mean(loans$loan_amount), col = "red") ``` ] ] --- ## Boxplots <img src="img/boxHist.png" width="80%" style="display: block; margin: auto;" /> --- ## Boxplots ``` r ggplot(loans, aes(x = loan_amount)) + geom_boxplot() + labs(x = "Loan amount") + scale_y_continuous(breaks = NULL) ``` <img src="lecture10_files/figure-html/unnamed-chunk-24-1.png" width="60%" style="display: block; margin: auto;" /> - Lower whisker, box (1Q, median, 3Q), upper whisker - Total length of the box is IQR - The length of each whisker is up to 1.5*IQR - Any points beyond that are **outliers**, observations that are unusually far from the rest of the data - Outliers appear as points --- ## Box plot and outliers Income data are often skewed (right or left?) ``` r ggplot(loans, aes(x = annual_income)) + geom_boxplot()+ scale_y_continuous(breaks = NULL) ``` <img src="lecture10_files/figure-html/unnamed-chunk-25-1.png" width="60%" style="display: block; margin: auto;" /> --- ## Boxplots in base R .tiny[ .pull-left[ ``` r ggplot(loans, aes(x = loan_amount)) + geom_boxplot() ``` <img src="lecture10_files/figure-html/unnamed-chunk-26-1.png" width="100%" style="display: block; margin: auto;" /> ] .pull-right[ ``` r boxplot(loans$loan_amount) ``` <img src="lecture10_files/figure-html/unnamed-chunk-27-1.png" width="100%" style="display: block; margin: auto;" /> ] ] --- ## Another way to remove y-axis labels We saw `scale_y_continuous(breaks = NULL)` earlier .panelset[ .panel[.panel-name[Plot] <img src="lecture10_files/figure-html/unnamed-chunk-28-1.png" width="60%" style="display: block; margin: auto;" /> ] .panel[.panel-name[Code] ``` r ggplot(loans, aes(x = loan_amount)) + geom_boxplot() + labs( x = "Loan amount ($)", y = NULL, title = "Loan amounts of Lending Club loans" ) + theme( #<< axis.ticks.y = element_blank(), #<< axis.text.y = element_blank() #<< ) #<< ``` ] ] --- ## Density plot Density plots are an alternative to histograms .tiny[ .pull-left[ ``` r ggplot(loans, aes(x = loan_amount)) + geom_density() ``` <img src="lecture10_files/figure-html/unnamed-chunk-29-1.png" width="100%" style="display: block; margin: auto;" /> ] .pull-right[ ``` r plot(density(loans$loan_amount)) ``` <img src="lecture10_files/figure-html/unnamed-chunk-30-1.png" width="100%" style="display: block; margin: auto;" /> ] ] --- ## Density plots and adjusting bandwidth .panelset[ .panel[.panel-name[adjust = 0.5] ``` r ggplot(loans, aes(x = loan_amount)) + geom_density(adjust = 0.5) ``` <img src="lecture10_files/figure-html/unnamed-chunk-31-1.png" width="50%" style="display: block; margin: auto;" /> ] .panel[.panel-name[adjust = 1] ``` r ggplot(loans, aes(x = loan_amount)) + geom_density(adjust = 1) # default bandwidth ``` <img src="lecture10_files/figure-html/unnamed-chunk-32-1.png" width="50%" style="display: block; margin: auto;" /> ] .panel[.panel-name[adjust = 2] ``` r ggplot(loans, aes(x = loan_amount)) + geom_density(adjust = 2) ``` <img src="lecture10_files/figure-html/unnamed-chunk-33-1.png" width="50%" style="display: block; margin: auto;" /> ] ] --- ## Title and labels .panelset[ .panel[.panel-name[Plot] <img src="lecture10_files/figure-html/unnamed-chunk-34-1.png" width="60%" style="display: block; margin: auto;" /> ] .panel[.panel-name[Code] ``` r ggplot(loans, aes(x = loan_amount)) + geom_density(adjust = 2) + labs( #<< x = "Loan amount ($)", #<< y = "Density", #<< title = "Amounts of Lending Club loans" #<< ) #<< ``` ] ] --- ## Relationships between numerical variables - Paired or bivariate data - Scatterplot - Hexplot - Correlation - Line graph --- ## Summary -- - Describing numerical distributions - Histograms - Measures of central tendency: mean, median, mode - Shape: skewness and modality - Spread: variance and standard deviation, range and interquartile range - Boxplots - Unusual observations - Density plot