class: center, middle, inverse, title-slide .title[ # Data visualization and Descriptive Statistics ] .subtitle[ ##
STA35A: Statistical Data Science 1 ] .author[ ### Xiao Hui Tai ] .date[ ### October 14, 2024 ] --- layout: true <!-- <div class="my-footer"> --> <!-- <span> --> <!-- <a href="https://datasciencebox.org" target="_blank">datasciencebox.org</a> --> <!-- </span> --> <!-- </div> --> --- <style type="text/css"> .tiny .remark-code { font-size: 60%; } .small .remark-code { font-size: 80%; } .tiny { font-size: 60%; } .small { font-size: 80%; } </style> ## Reminders/announcements - Lab is due today at 9pm - HW 2 posted on course website, due Thursday at 9pm - Note: solutions will be released at 9pm - Schedule for rest of this week: - Wednesday: Oscar Rivera will do review during regular lecture time (same room); **UPDATE: **2-3 PM OH (OR) at MSB 1143 - Thursday: **no lab**, instead 12-1PM OH (XHT; virtual, will post link on Piazza); 3-4 PM OH (OR); HW 2 due - Friday: midterm during regular lecture time (same room); **no homework and no OH** --- ## Midterm - Midterm will cover material until today - Closed-book - You don't need your computers - No make-up exams - Drop policy for exams: 1 midterm may be dropped - Practice problems on Canvas --- ## Today - Misc `ggplot()` items - Descriptive statistics --- ## A note on piping and layering - Pipe `%>%` used mainly in `dplyr` pipelines - Pipe the output of the previous line of code as the first input of the next line of code - `+` used in `ggplot2` plots is used for "layering" - Create the plot in layers, separated by `+` --- ## dplyr Incorrect: ``` r hotels + select(hotel, lead_time) ``` ``` ## Error in eval(expr, envir, enclos): object 'hotel' not found ``` Correct: ``` r hotels %>% select(hotel, lead_time) ``` .tiny[ ``` ## # A tibble: 119,390 × 2 ## hotel lead_time ## <chr> <dbl> ## 1 Resort Hotel 342 ## 2 Resort Hotel 737 ## 3 Resort Hotel 7 ## 4 Resort Hotel 13 ## 5 Resort Hotel 14 ## 6 Resort Hotel 14 ## # ℹ 119,384 more rows ``` ] --- ## ggplot2 Incorrect: .small[ ``` r ggplot(hotels, aes(x = hotel, fill = deposit_type)) %>% geom_bar() ``` ``` ## Error in `geom_bar()`: ## ! `mapping` must be created by `aes()`. ## ℹ Did you use `%>%` or `|>` instead of `+`? ``` ] Correct: ``` r ggplot(hotels, aes(x = hotel, fill = deposit_type)) + geom_bar() ``` <img src="lecture9_files/figure-html/unnamed-chunk-8-1.png" width="25%" style="display: block; margin: auto;" /> --- ## Code styling Many of the styling principles are consistent across `%>%` and `+`: - always a space before - always a line break after (for pipelines with more than 2 lines) Not recommended: ``` r ggplot(hotels,aes(x=hotel,y=deposit_type))+geom_bar() ``` Recommended: ``` r ggplot(hotels, aes(x = hotel, y = deposit_type)) + geom_bar() ``` --- ## Midterm essentials - What `dplyr` function filters only observations that fulfill some condition? - What function returns the number of rows in a data frame? - What `dplyr` function creates new variables? - How to create a binary variable that depends on whether a variable fulfills some condition? (e.g., `nycflights13` example: Create a new variable that indicates whether or not the flight departed late.) - `group_by` and `summarize` - How to make scatterplots --- ## Course content 1. Fundamentals of R - Overview of data types and structures - Data manipulation and data visualization tools 2. **Descriptive statistics for numerical and categorical data** 3. Probability - Rules of probability computation; conditional probability - Basic probability models: Binomial, Normal and Poisson 4. Statistical inference - Sampling distributions of sample mean and sample proportion - Hypothesis testing and confidence intervals for population mean and population proportion --- ## Descriptive statistics - We've now learned about data manipulation and visualization tools - What visualizations to do and what summary statistics to actually calculate? - **Descriptive statistics** are numbers that are used to summarize and describe data - **Numerical** or **graphical** ways to display the data - Why is this a useful thing to do? **Ages of students**: 18, 19, 20, 18, 20, 20, 19, 18, 19, 20, 18, 20, 20, 19, 18, 19, 20, 18, 20, 20, 19, 18, 19, 20, 18, 20, 20, 19, 18, 19, 20, 18, 20, 20, 19, 18, 19, 20, 18, 20, 20, 19, 18, 19, 20, 18, 20, 20, 19, 18, 19, 20, 18, 20, 20, 19, 18, 19, 20, 18, 20, 20, 19, 18, 19, 20, 18, 20, 20, 19, 18, 19, 20, 18, 20, 20, 19, 18, 19, 20, 18, 20, 20, 19, 18, 19, 20, 18, 20, 20, 19 --- ## Terminology: Number of variables involved - **Univariate** data analysis: distribution of single variable - **Bivariate** data analysis: relationship between two variables - **Multivariate** data analysis: relationship between many variables at once, usually focusing on the relationship between two while conditioning for others --- ## Terminology: Types of variables - **Numerical** variables - E.g., age, length, temperature - **Continuous** variables can take on an infinite number of values - **Discrete** variables only take on non-negative whole numbers - **Categorical** variables - E.g., year in college, type of bike, meal - **Ordinal** variables have levels that have a natural ordering --- ## Data: Lending Club - Lending Club is a platform that allows individuals to lend to other individuals - Data are available in the `openintro` package, called `loans_full_schema` - Includes 10,000 loans made through the Lending Club; has 55 columns .tiny[ ``` r library(openintro) dplyr::glimpse(loans_full_schema) ``` ``` ## Rows: 10,000 ## Columns: 55 ## $ emp_title <chr> "global config enginee… ## $ emp_length <dbl> 3, 10, 3, 1, 10, NA, 1… ## $ state <fct> NJ, HI, WI, PA, CA, KY… ## $ homeownership <fct> MORTGAGE, RENT, RENT, … ## $ annual_income <dbl> 90000, 40000, 40000, 3… ## $ verified_income <fct> Verified, Not Verified… ## $ debt_to_income <dbl> 18.01, 5.04, 21.15, 10… ## $ annual_income_joint <dbl> NA, NA, NA, NA, 57000,… ## $ verification_income_joint <fct> , , , , Verified, , No… ## $ debt_to_income_joint <dbl> NA, NA, NA, NA, 37.66,… ## $ delinq_2y <int> 0, 0, 0, 0, 0, 1, 0, 1… ## $ months_since_last_delinq <int> 38, NA, 28, NA, NA, 3,… ## $ earliest_credit_line <dbl> 2001, 1996, 2006, 2007… ## $ inquiries_last_12m <int> 6, 1, 4, 0, 7, 6, 1, 1… ## $ total_credit_lines <int> 28, 30, 31, 4, 22, 32,… ## $ open_credit_lines <int> 10, 14, 10, 4, 16, 12,… ## $ total_credit_limit <int> 70795, 28800, 24193, 2… ## $ total_credit_utilized <int> 38767, 4321, 16000, 49… ## $ num_collections_last_12m <int> 0, 0, 0, 0, 0, 0, 0, 0… ## $ num_historical_failed_to_pay <int> 0, 1, 0, 1, 0, 0, 0, 0… ## $ months_since_90d_late <int> 38, NA, 28, NA, NA, 60… ## $ current_accounts_delinq <int> 0, 0, 0, 0, 0, 0, 0, 0… ## $ total_collection_amount_ever <int> 1250, 0, 432, 0, 0, 0,… ## $ current_installment_accounts <int> 2, 0, 1, 1, 1, 0, 2, 2… ## $ accounts_opened_24m <int> 5, 11, 13, 1, 6, 2, 1,… ## $ months_since_last_credit_inquiry <int> 5, 8, 7, 15, 4, 5, 9, … ## $ num_satisfactory_accounts <int> 10, 14, 10, 4, 16, 12,… ## $ num_accounts_120d_past_due <int> 0, 0, 0, 0, 0, 0, 0, N… ## $ num_accounts_30d_past_due <int> 0, 0, 0, 0, 0, 0, 0, 0… ## $ num_active_debit_accounts <int> 2, 3, 3, 2, 10, 1, 3, … ## $ total_debit_limit <int> 11100, 16500, 4300, 19… ## $ num_total_cc_accounts <int> 14, 24, 14, 3, 20, 27,… ## $ num_open_cc_accounts <int> 8, 14, 8, 3, 15, 12, 7… ## $ num_cc_carrying_balance <int> 6, 4, 6, 2, 13, 5, 6, … ## $ num_mort_accounts <int> 1, 0, 0, 0, 0, 3, 2, 7… ## $ account_never_delinq_percent <dbl> 92.9, 100.0, 93.5, 100… ## $ tax_liens <int> 0, 0, 0, 1, 0, 0, 0, 0… ## $ public_record_bankrupt <int> 0, 1, 0, 0, 0, 0, 0, 0… ## $ loan_purpose <fct> moving, debt_consolida… ## $ application_type <fct> individual, individual… ## $ loan_amount <int> 28000, 5000, 2000, 216… ## $ term <dbl> 60, 36, 36, 36, 36, 36… ## $ interest_rate <dbl> 14.07, 12.61, 17.09, 6… ## $ installment <dbl> 652.53, 167.54, 71.40,… ## $ grade <fct> C, C, D, A, C, A, C, B… ## $ sub_grade <fct> C3, C1, D1, A3, C3, A3… ## $ issue_month <fct> Mar-2018, Feb-2018, Fe… ## $ loan_status <fct> Current, Current, Curr… ## $ initial_listing_status <fct> whole, whole, fraction… ## $ disbursement_method <fct> Cash, Cash, Cash, Cash… ## $ balance <dbl> 27015.86, 4651.37, 182… ## $ paid_total <dbl> 1999.330, 499.120, 281… ## $ paid_principal <dbl> 984.14, 348.63, 175.37… ## $ paid_interest <dbl> 1015.19, 150.49, 106.4… ## $ paid_late_fees <dbl> 0, 0, 0, 0, 0, 0, 0, 0… ``` ] --- ## Selected variables ``` r loans <- loans_full_schema %>% select(loan_amount, interest_rate, term, grade, state, annual_income, homeownership, debt_to_income, issue_month) glimpse(loans) ``` ``` ## Rows: 10,000 ## Columns: 9 ## $ loan_amount <int> 28000, 5000, 2000, 21600, 23000, 5000, 2… ## $ interest_rate <dbl> 14.07, 12.61, 17.09, 6.72, 14.07, 6.72, … ## $ term <dbl> 60, 36, 36, 36, 36, 36, 60, 60, 36, 36, … ## $ grade <fct> C, C, D, A, C, A, C, B, C, A, C, B, C, B… ## $ state <fct> NJ, HI, WI, PA, CA, KY, MI, AZ, NV, IL, … ## $ annual_income <dbl> 90000, 40000, 40000, 30000, 35000, 34000… ## $ homeownership <fct> MORTGAGE, RENT, RENT, RENT, RENT, OWN, M… ## $ debt_to_income <dbl> 18.01, 5.04, 21.15, 10.16, 57.96, 6.46, … ## $ issue_month <fct> Mar-2018, Feb-2018, Feb-2018, Jan-2018, … ``` --- ## Selected variables .small[ Variable | Description ----------------|------------- `loan_amount` | Amount of the loan received, in US dollars `interest_rate` | Interest rate on the loan, in an annual percentage `term` | The length of the loan, which is always set as a whole number of months `grade` | Loan grade, which takes a values A through G and represents the quality of the loan and its likelihood of being repaid `state` | US state where the borrower resides `annual_income` | Borrower’s annual income, including any second income, in US dollars `homeownership` | Indicates whether the person owns, owns but has a mortgage, or rents `debt_to_income` | Debt-to-income ratio `issue_month` | Month the loan was issued ] --- ## Variable types .small[ Variable | Description ----------------|------------- `loan_amount` | Amount of the loan received, in US dollars `interest_rate` | Interest rate on the loan, in an annual percentage `term` | The length of the loan, which is always set as a whole number of months `grade` | Loan grade, which takes a values A through G and represents the quality of the loan and its likelihood of being repaid `state` | US state where the borrower resides `annual_income` | Borrower’s annual income, including any second income, in US dollars `homeownership` | Indicates whether the person owns, owns but has a mortgage, or rents `debt_to_income` | Debt-to-income ratio `issue_month` | Month the loan was issued ] - Numerical variables: Continuous or discrete? - Categorical: Ordinal or not? --- ## Variable types Variable | Type ----------------|------------- `loan_amount` | numerical, continuous `interest_rate` | numerical, continuous `term` | numerical, discrete `grade` | categorical, ordinal `state` | categorical, not ordinal `annual_income` | numerical, continuous `homeownership` | categorical, not ordinal `debt_to_income` | numerical, continuous `issue_month` | date --- ## Describing numerical distributions - **Visual summaries**: - Histogram - Boxplot - Density plot - Line graph - Measures of **central tendency**: mean, median, mode - **Shape**: - Skewness: right-skewed, left-skewed, symmetric - Modality: unimodal, bimodal, multimodal, uniform - **Spread**: variance and standard deviation, range and interquartile range - **Unusual observations** - A **summary statistic** is a single number summarizing a large amount of data --- ## Histogram - Shows **shape, center, and spread** of the data - Contiguous (adjoining) boxes - Horizontal axis: what the data represents - Vertical axis: frequency or relative frequency .tiny[ .pull-left[ ``` r ggplot(loans, aes(x = loan_amount)) + geom_histogram() ``` ``` ## `stat_bin()` using `bins = 30`. Pick better value with ## `binwidth`. ``` <img src="lecture9_files/figure-html/unnamed-chunk-13-1.png" width="100%" style="display: block; margin: auto;" /> ] .pull-right[ ``` r hist(loans_full_schema$loan_amount) ``` <img src="lecture9_files/figure-html/unnamed-chunk-14-1.png" width="100%" style="display: block; margin: auto;" /> ] ] --- ## Histograms and binwidth .panelset[ .panel[.panel-name[binwidth = 1000] ``` r ggplot(loans, aes(x = loan_amount)) + geom_histogram(binwidth = 1000) ``` <img src="lecture9_files/figure-html/unnamed-chunk-15-1.png" width="50%" style="display: block; margin: auto;" /> ] .panel[.panel-name[binwidth = 5000] ``` r ggplot(loans, aes(x = loan_amount)) + geom_histogram(binwidth = 5000) ``` <img src="lecture9_files/figure-html/unnamed-chunk-16-1.png" width="50%" style="display: block; margin: auto;" /> ] .panel[.panel-name[binwidth = 20000] ``` r ggplot(loans, aes(x = loan_amount)) + geom_histogram(binwidth = 20000) ``` <img src="lecture9_files/figure-html/unnamed-chunk-17-1.png" width="50%" style="display: block; margin: auto;" /> ] ] --- ## Adding labels .panelset[ .panel[.panel-name[Plot] <img src="lecture9_files/figure-html/unnamed-chunk-18-1.png" width="60%" style="display: block; margin: auto;" /> ] .panel[.panel-name[Code] ``` r ggplot(loans, aes(x = loan_amount)) + geom_histogram(binwidth = 5000) + labs( #<< x = "Loan amount ($)", #<< y = "Frequency", #<< title = "Amounts of Lending Club loans" #<< ) #<< ``` ] ] --- ## Summary -- - Descriptive statistics - Types of variables (numerical and categorical) - Describing numerical distributions - Histograms