Data visualization and Descriptive Statistics

class: center, middle, inverse, title-slide

.title[
# Data visualization and Descriptive Statistics
]
.subtitle[
## <br><br> STA35A: Statistical Data Science 1
]
.author[
### Xiao Hui Tai
]
.date[
### October 14, 2024
]

---

layout: true

---

## Reminders/announcements

- Lab is due today at 9pm

- HW 2 posted on course website, due Thursday at 9pm
  - Note: solutions will be released at 9pm

- Schedule for rest of this week:
  - Wednesday: Oscar Rivera will do review during regular lecture time (same room); **UPDATE: **2-3 PM OH (OR) at MSB 1143
  - Thursday: **no lab**, instead 12-1PM OH (XHT; virtual, will post link on Piazza); 3-4 PM OH (OR); HW 2 due
  - Friday: midterm during regular lecture time (same room); **no homework and no OH**

---
## Midterm
- Midterm will cover material until today

- Closed-book

- You don't need your computers

- No make-up exams

- Drop policy for exams: 1 midterm may be dropped

- Practice problems on Canvas

---
## Today

- Misc `ggplot()` items
  
- Descriptive statistics

---
## A note on piping and layering

- Pipe `%>%` used mainly in `dplyr` pipelines
  - Pipe the output of the previous line of code as the first input of the next line of code

- `+` used in `ggplot2` plots is used for "layering"
  - Create the plot in layers, separated by `+`

---

## dplyr

Incorrect:

``` r
hotels +
  select(hotel, lead_time)
```

```
## Error in eval(expr, envir, enclos): object 'hotel' not found
```

Correct:

``` r
hotels %>%
  select(hotel, lead_time)
```

.tiny[

```
## # A tibble: 119,390 × 2
##   hotel        lead_time
##   <chr>            <dbl>
## 1 Resort Hotel       342
## 2 Resort Hotel       737
## 3 Resort Hotel         7
## 4 Resort Hotel        13
## 5 Resort Hotel        14
## 6 Resort Hotel        14
## # ℹ 119,384 more rows
```
]

---

## ggplot2

Incorrect:

.small[

``` r
ggplot(hotels, aes(x = hotel, fill = deposit_type)) %>%
  geom_bar()
```

```
## Error in `geom_bar()`:
## ! `mapping` must be created by `aes()`.
## ℹ Did you use `%>%` or `|>` instead of `+`?
```
]

Correct:

``` r
ggplot(hotels, aes(x = hotel, fill = deposit_type)) +
  geom_bar()
```

---
## Code styling

Many of the styling principles are consistent across `%>%` and `+`:

- always a space before
- always a line break after (for pipelines with more than 2 lines)

Not recommended:

``` r
ggplot(hotels,aes(x=hotel,y=deposit_type))+geom_bar()
```

Recommended:

``` r
ggplot(hotels, aes(x = hotel, y = deposit_type)) + 
  geom_bar()
```

---

## Midterm essentials

- What `dplyr` function filters only observations that fulfill some condition?

- What function returns the number of rows in a data frame?

- What `dplyr` function creates new variables?

- How to create a binary variable that depends on whether a variable fulfills some condition? (e.g., `nycflights13` example: Create a new variable that indicates whether or not the flight departed late.)

- `group_by` and `summarize`

- How to make scatterplots

---
## Course content

1. Fundamentals of R
  - Overview of data types and structures
  - Data manipulation and data visualization tools

2. **Descriptive statistics for numerical and categorical data**

3. Probability
  - Rules of probability computation; conditional probability
  - Basic probability models: Binomial, Normal and Poisson

4. Statistical inference
  - Sampling distributions of sample mean and sample proportion 
  - Hypothesis testing and confidence intervals for population mean and population proportion

---
## Descriptive statistics

- We've now learned about data manipulation and visualization tools

- What visualizations to do and what summary statistics to actually calculate?

- **Descriptive statistics** are numbers that are used to summarize and describe data

- **Numerical** or **graphical** ways to display the data

- Why is this a useful thing to do?

**Ages of students**: 18, 19, 20, 18, 20, 20, 19, 18, 19, 20, 18, 20, 20, 19, 18, 19, 20, 18, 20, 20, 19, 18, 19, 20, 18, 20, 20, 19, 18, 19, 20, 18, 20, 20, 19, 18, 19, 20, 18, 20, 20, 19, 18, 19, 20, 18, 20, 20, 19, 18, 19, 20, 18, 20, 20, 19, 18, 19, 20, 18, 20, 20, 19, 18, 19, 20, 18, 20, 20, 19, 18, 19, 20, 18, 20, 20, 19, 18, 19, 20, 18, 20, 20, 19, 18, 19, 20, 18, 20, 20, 19

---

## Terminology: Number of variables involved

- **Univariate** data analysis: distribution of single variable

- **Bivariate** data analysis: relationship between two variables

- **Multivariate** data analysis: relationship between many variables at once, usually focusing on the relationship between two while conditioning for others

---

## Terminology: Types of variables

- **Numerical** variables
  - E.g., age, length, temperature
  
  - **Continuous** variables can take on an infinite number of values
  
  - **Discrete** variables only take on non-negative whole numbers

- **Categorical** variables

- E.g., year in college, type of bike, meal 
  
  - **Ordinal** variables have levels that have a natural ordering

---
## Data: Lending Club

- Lending Club is a platform that allows individuals to lend to other individuals

- Data are available in the `openintro` package, called `loans_full_schema`

- Includes 10,000 loans made through the Lending Club; has 55 columns

.tiny[

``` r
library(openintro)
dplyr::glimpse(loans_full_schema) 
```

```
## Rows: 10,000
## Columns: 55
## $ emp_title                        <chr> "global config enginee…
## $ emp_length                       <dbl> 3, 10, 3, 1, 10, NA, 1…
## $ state                            <fct> NJ, HI, WI, PA, CA, KY…
## $ homeownership                    <fct> MORTGAGE, RENT, RENT, …
## $ annual_income                    <dbl> 90000, 40000, 40000, 3…
## $ verified_income                  <fct> Verified, Not Verified…
## $ debt_to_income                   <dbl> 18.01, 5.04, 21.15, 10…
## $ annual_income_joint              <dbl> NA, NA, NA, NA, 57000,…
## $ verification_income_joint        <fct> , , , , Verified, , No…
## $ debt_to_income_joint             <dbl> NA, NA, NA, NA, 37.66,…
## $ delinq_2y                        <int> 0, 0, 0, 0, 0, 1, 0, 1…
## $ months_since_last_delinq         <int> 38, NA, 28, NA, NA, 3,…
## $ earliest_credit_line             <dbl> 2001, 1996, 2006, 2007…
## $ inquiries_last_12m               <int> 6, 1, 4, 0, 7, 6, 1, 1…
## $ total_credit_lines               <int> 28, 30, 31, 4, 22, 32,…
## $ open_credit_lines                <int> 10, 14, 10, 4, 16, 12,…
## $ total_credit_limit               <int> 70795, 28800, 24193, 2…
## $ total_credit_utilized            <int> 38767, 4321, 16000, 49…
## $ num_collections_last_12m         <int> 0, 0, 0, 0, 0, 0, 0, 0…
## $ num_historical_failed_to_pay     <int> 0, 1, 0, 1, 0, 0, 0, 0…
## $ months_since_90d_late            <int> 38, NA, 28, NA, NA, 60…
## $ current_accounts_delinq          <int> 0, 0, 0, 0, 0, 0, 0, 0…
## $ total_collection_amount_ever     <int> 1250, 0, 432, 0, 0, 0,…
## $ current_installment_accounts     <int> 2, 0, 1, 1, 1, 0, 2, 2…
## $ accounts_opened_24m              <int> 5, 11, 13, 1, 6, 2, 1,…
## $ months_since_last_credit_inquiry <int> 5, 8, 7, 15, 4, 5, 9, …
## $ num_satisfactory_accounts        <int> 10, 14, 10, 4, 16, 12,…
## $ num_accounts_120d_past_due       <int> 0, 0, 0, 0, 0, 0, 0, N…
## $ num_accounts_30d_past_due        <int> 0, 0, 0, 0, 0, 0, 0, 0…
## $ num_active_debit_accounts        <int> 2, 3, 3, 2, 10, 1, 3, …
## $ total_debit_limit                <int> 11100, 16500, 4300, 19…
## $ num_total_cc_accounts            <int> 14, 24, 14, 3, 20, 27,…
## $ num_open_cc_accounts             <int> 8, 14, 8, 3, 15, 12, 7…
## $ num_cc_carrying_balance          <int> 6, 4, 6, 2, 13, 5, 6, …
## $ num_mort_accounts                <int> 1, 0, 0, 0, 0, 3, 2, 7…
## $ account_never_delinq_percent     <dbl> 92.9, 100.0, 93.5, 100…
## $ tax_liens                        <int> 0, 0, 0, 1, 0, 0, 0, 0…
## $ public_record_bankrupt           <int> 0, 1, 0, 0, 0, 0, 0, 0…
## $ loan_purpose                     <fct> moving, debt_consolida…
## $ application_type                 <fct> individual, individual…
## $ loan_amount                      <int> 28000, 5000, 2000, 216…
## $ term                             <dbl> 60, 36, 36, 36, 36, 36…
## $ interest_rate                    <dbl> 14.07, 12.61, 17.09, 6…
## $ installment                      <dbl> 652.53, 167.54, 71.40,…
## $ grade                            <fct> C, C, D, A, C, A, C, B…
## $ sub_grade                        <fct> C3, C1, D1, A3, C3, A3…
## $ issue_month                      <fct> Mar-2018, Feb-2018, Fe…
## $ loan_status                      <fct> Current, Current, Curr…
## $ initial_listing_status           <fct> whole, whole, fraction…
## $ disbursement_method              <fct> Cash, Cash, Cash, Cash…
## $ balance                          <dbl> 27015.86, 4651.37, 182…
## $ paid_total                       <dbl> 1999.330, 499.120, 281…
## $ paid_principal                   <dbl> 984.14, 348.63, 175.37…
## $ paid_interest                    <dbl> 1015.19, 150.49, 106.4…
## $ paid_late_fees                   <dbl> 0, 0, 0, 0, 0, 0, 0, 0…
```
]
---

## Selected variables

``` r
loans <- loans_full_schema %>%
  select(loan_amount, interest_rate, term, grade, 
         state, annual_income, homeownership, debt_to_income,
         issue_month)
glimpse(loans)
```

```
## Rows: 10,000
## Columns: 9
## $ loan_amount    <int> 28000, 5000, 2000, 21600, 23000, 5000, 2…
## $ interest_rate  <dbl> 14.07, 12.61, 17.09, 6.72, 14.07, 6.72, …
## $ term           <dbl> 60, 36, 36, 36, 36, 36, 60, 60, 36, 36, …
## $ grade          <fct> C, C, D, A, C, A, C, B, C, A, C, B, C, B…
## $ state          <fct> NJ, HI, WI, PA, CA, KY, MI, AZ, NV, IL, …
## $ annual_income  <dbl> 90000, 40000, 40000, 30000, 35000, 34000…
## $ homeownership  <fct> MORTGAGE, RENT, RENT, RENT, RENT, OWN, M…
## $ debt_to_income <dbl> 18.01, 5.04, 21.15, 10.16, 57.96, 6.46, …
## $ issue_month    <fct> Mar-2018, Feb-2018, Feb-2018, Jan-2018, …
```

---

## Selected variables

.small[
Variable        | Description
----------------|-------------
`loan_amount`   |	Amount of the loan received, in US dollars
`interest_rate` |	Interest rate on the loan, in an annual percentage
`term`	        | The length of the loan, which is always set as a whole number of months
`grade`	        | Loan grade, which takes a values A through G and represents the quality of the loan and its likelihood of being repaid
`state`         |	US state where the borrower resides
`annual_income` |	Borrower’s annual income, including any second income, in US dollars
`homeownership`	| Indicates whether the person owns, owns but has a mortgage, or rents
`debt_to_income` | Debt-to-income ratio
`issue_month` | Month the loan was issued
]

---

## Variable types

- Numerical variables: Continuous or discrete?
- Categorical: Ordinal or not?

---

## Variable types

Variable        | Type
----------------|-------------
`loan_amount`   |	numerical, continuous
`interest_rate` |	numerical, continuous
`term`	        | numerical, discrete
`grade`	        | categorical, ordinal
`state`         |	categorical, not ordinal
`annual_income` |	numerical, continuous
`homeownership`	| categorical, not ordinal
`debt_to_income` | numerical, continuous
`issue_month` | date

---

## Describing numerical distributions

- **Visual summaries**:
  - Histogram
  - Boxplot
  - Density plot
  - Line graph 
  
- Measures of **central tendency**: mean, median, mode

- **Shape**:
    - Skewness: right-skewed, left-skewed, symmetric 
    - Modality: unimodal, bimodal, multimodal, uniform

- **Spread**: variance and standard deviation, range and interquartile range

- **Unusual observations**

- A **summary statistic** is a single number summarizing a large amount of data

---

## Histogram

- Shows **shape, center, and spread** of the data

- Contiguous (adjoining) boxes
  - Horizontal axis: what the data represents
  - Vertical axis: frequency or relative frequency

.tiny[
.pull-left[

``` r
ggplot(loans, aes(x = loan_amount)) +
  geom_histogram()
```

```
## `stat_bin()` using `bins = 30`. Pick better value with
## `binwidth`.
```

<img src="lecture9_files/figure-html/unnamed-chunk-13-1.png" width="100%" style="display: block; margin: auto;" />
]

.pull-right[

``` r
hist(loans_full_schema$loan_amount)
```

<img src="lecture9_files/figure-html/unnamed-chunk-14-1.png" width="100%" style="display: block; margin: auto;" />
]
]

---
## Histograms and binwidth

.panelset[
.panel[.panel-name[binwidth = 1000]

``` r
ggplot(loans, aes(x = loan_amount)) +
  geom_histogram(binwidth = 1000)
```

<img src="lecture9_files/figure-html/unnamed-chunk-15-1.png" width="50%" style="display: block; margin: auto;" />
]
.panel[.panel-name[binwidth = 5000]

``` r
ggplot(loans, aes(x = loan_amount)) +
  geom_histogram(binwidth = 5000)
```

<img src="lecture9_files/figure-html/unnamed-chunk-16-1.png" width="50%" style="display: block; margin: auto;" />
]
.panel[.panel-name[binwidth = 20000]

``` r
ggplot(loans, aes(x = loan_amount)) +
  geom_histogram(binwidth = 20000)
```

<img src="lecture9_files/figure-html/unnamed-chunk-17-1.png" width="50%" style="display: block; margin: auto;" />
]
]

---

## Adding labels

.panelset[
.panel[.panel-name[Plot]
<img src="lecture9_files/figure-html/unnamed-chunk-18-1.png" width="60%" style="display: block; margin: auto;" />
]
.panel[.panel-name[Code]

``` r
ggplot(loans, aes(x = loan_amount)) +
  geom_histogram(binwidth = 5000) +
  labs( #<<
    x = "Loan amount ($)", #<<
    y = "Frequency", #<<
    title = "Amounts of Lending Club loans" #<<
  ) #<<
```
]
]

---
## Summary
--

- Descriptive statistics

- Types of variables (numerical and categorical)
  
- Describing numerical distributions

- Histograms