Descriptive Statistics: Numerical Data

class: center, middle, inverse, title-slide

.title[
# Descriptive Statistics: Numerical Data
]
.subtitle[
## <br><br> STA35A: Statistical Data Science 1
]
.author[
### Xiao Hui Tai
]
.date[
### October 21, 2024
]

---

layout: true

---

## Course content

1. Fundamentals of R
  - Overview of data types and structures
  - Data manipulation and data visualization tools

2. **Descriptive statistics for numerical and categorical data**

3. Probability
  - Rules of probability computation; conditional probability
  - Basic probability models: Binomial, Normal and Poisson

4. Statistical inference
  - Sampling distributions of sample mean and sample proportion 
  - Hypothesis testing and confidence intervals for population mean and population proportion

---
## Today

- Descriptive statistics

- Types of variables (numerical and categorical)

- Describing numerical distributions

- Histograms
  
  - Measures of central tendency: mean, median, mode

- Shape: skewness and modality
  
  - Spread: variance and standard deviation, range and interquartile range

- Boxplots
    - Unusual observations
  
  - Density plot

---
## Descriptive statistics

- We've now learned about data manipulation and visualization tools

- What visualizations to do and what summary statistics to actually calculate?

- **Descriptive statistics** are numbers that are used to summarize and describe data

- **Numerical** or **graphical** ways to display the data

- Why is this a useful thing to do?

**Ages of students**: 18, 19, 20, 18, 20, 20, 19, 18, 19, 20, 18, 20, 20, 19, 18, 19, 20, 18, 20, 20, 19, 18, 19, 20, 18, 20, 20, 19, 18, 19, 20, 18, 20, 20, 19, 18, 19, 20, 18, 20, 20, 19, 18, 19, 20, 18, 20, 20, 19, 18, 19, 20, 18, 20, 20, 19, 18, 19, 20, 18, 20, 20, 19, 18, 19, 20, 18, 20, 20, 19, 18, 19, 20, 18, 20, 20, 19, 18, 19, 20, 18, 20, 20, 19, 18, 19, 20, 18, 20, 20, 19

---

## Terminology: Number of variables involved

- **Univariate** data analysis: distribution of single variable

- **Bivariate** data analysis: relationship between two variables

- **Multivariate** data analysis: relationship between many variables at once, usually focusing on the relationship between two while conditioning for others

---

## Terminology: Types of variables

- **Numerical** variables
  - E.g., age, length, temperature
  
  - **Continuous** variables can take on an infinite number of values
  
  - **Discrete** variables only take on non-negative whole numbers

- **Categorical** variables

- E.g., year in college, type of bike, meal 
  
  - **Ordinal** variables have levels that have a natural ordering

---
## Data: Lending Club

- Lending Club is a platform that allows individuals to lend to other individuals

- Data are available in the `openintro` package, called `loans_full_schema`

- Includes 10,000 loans made through the Lending Club; has 55 columns

.tiny[

``` r
library(openintro)
dplyr::glimpse(loans_full_schema) 
```

```
## Rows: 10,000
## Columns: 55
## $ emp_title                        <chr> "global config enginee…
## $ emp_length                       <dbl> 3, 10, 3, 1, 10, NA, 1…
## $ state                            <fct> NJ, HI, WI, PA, CA, KY…
## $ homeownership                    <fct> MORTGAGE, RENT, RENT, …
## $ annual_income                    <dbl> 90000, 40000, 40000, 3…
## $ verified_income                  <fct> Verified, Not Verified…
## $ debt_to_income                   <dbl> 18.01, 5.04, 21.15, 10…
## $ annual_income_joint              <dbl> NA, NA, NA, NA, 57000,…
## $ verification_income_joint        <fct> , , , , Verified, , No…
## $ debt_to_income_joint             <dbl> NA, NA, NA, NA, 37.66,…
## $ delinq_2y                        <int> 0, 0, 0, 0, 0, 1, 0, 1…
## $ months_since_last_delinq         <int> 38, NA, 28, NA, NA, 3,…
## $ earliest_credit_line             <dbl> 2001, 1996, 2006, 2007…
## $ inquiries_last_12m               <int> 6, 1, 4, 0, 7, 6, 1, 1…
## $ total_credit_lines               <int> 28, 30, 31, 4, 22, 32,…
## $ open_credit_lines                <int> 10, 14, 10, 4, 16, 12,…
## $ total_credit_limit               <int> 70795, 28800, 24193, 2…
## $ total_credit_utilized            <int> 38767, 4321, 16000, 49…
## $ num_collections_last_12m         <int> 0, 0, 0, 0, 0, 0, 0, 0…
## $ num_historical_failed_to_pay     <int> 0, 1, 0, 1, 0, 0, 0, 0…
## $ months_since_90d_late            <int> 38, NA, 28, NA, NA, 60…
## $ current_accounts_delinq          <int> 0, 0, 0, 0, 0, 0, 0, 0…
## $ total_collection_amount_ever     <int> 1250, 0, 432, 0, 0, 0,…
## $ current_installment_accounts     <int> 2, 0, 1, 1, 1, 0, 2, 2…
## $ accounts_opened_24m              <int> 5, 11, 13, 1, 6, 2, 1,…
## $ months_since_last_credit_inquiry <int> 5, 8, 7, 15, 4, 5, 9, …
## $ num_satisfactory_accounts        <int> 10, 14, 10, 4, 16, 12,…
## $ num_accounts_120d_past_due       <int> 0, 0, 0, 0, 0, 0, 0, N…
## $ num_accounts_30d_past_due        <int> 0, 0, 0, 0, 0, 0, 0, 0…
## $ num_active_debit_accounts        <int> 2, 3, 3, 2, 10, 1, 3, …
## $ total_debit_limit                <int> 11100, 16500, 4300, 19…
## $ num_total_cc_accounts            <int> 14, 24, 14, 3, 20, 27,…
## $ num_open_cc_accounts             <int> 8, 14, 8, 3, 15, 12, 7…
## $ num_cc_carrying_balance          <int> 6, 4, 6, 2, 13, 5, 6, …
## $ num_mort_accounts                <int> 1, 0, 0, 0, 0, 3, 2, 7…
## $ account_never_delinq_percent     <dbl> 92.9, 100.0, 93.5, 100…
## $ tax_liens                        <int> 0, 0, 0, 1, 0, 0, 0, 0…
## $ public_record_bankrupt           <int> 0, 1, 0, 0, 0, 0, 0, 0…
## $ loan_purpose                     <fct> moving, debt_consolida…
## $ application_type                 <fct> individual, individual…
## $ loan_amount                      <int> 28000, 5000, 2000, 216…
## $ term                             <dbl> 60, 36, 36, 36, 36, 36…
## $ interest_rate                    <dbl> 14.07, 12.61, 17.09, 6…
## $ installment                      <dbl> 652.53, 167.54, 71.40,…
## $ grade                            <fct> C, C, D, A, C, A, C, B…
## $ sub_grade                        <fct> C3, C1, D1, A3, C3, A3…
## $ issue_month                      <fct> Mar-2018, Feb-2018, Fe…
## $ loan_status                      <fct> Current, Current, Curr…
## $ initial_listing_status           <fct> whole, whole, fraction…
## $ disbursement_method              <fct> Cash, Cash, Cash, Cash…
## $ balance                          <dbl> 27015.86, 4651.37, 182…
## $ paid_total                       <dbl> 1999.330, 499.120, 281…
## $ paid_principal                   <dbl> 984.14, 348.63, 175.37…
## $ paid_interest                    <dbl> 1015.19, 150.49, 106.4…
## $ paid_late_fees                   <dbl> 0, 0, 0, 0, 0, 0, 0, 0…
```
]
---

## Selected variables

``` r
loans <- loans_full_schema %>%
  select(loan_amount, interest_rate, term, grade, 
         state, annual_income, homeownership, debt_to_income,
         issue_month)
glimpse(loans)
```

```
## Rows: 10,000
## Columns: 9
## $ loan_amount    <int> 28000, 5000, 2000, 21600, 23000, 5000, 2…
## $ interest_rate  <dbl> 14.07, 12.61, 17.09, 6.72, 14.07, 6.72, …
## $ term           <dbl> 60, 36, 36, 36, 36, 36, 60, 60, 36, 36, …
## $ grade          <fct> C, C, D, A, C, A, C, B, C, A, C, B, C, B…
## $ state          <fct> NJ, HI, WI, PA, CA, KY, MI, AZ, NV, IL, …
## $ annual_income  <dbl> 90000, 40000, 40000, 30000, 35000, 34000…
## $ homeownership  <fct> MORTGAGE, RENT, RENT, RENT, RENT, OWN, M…
## $ debt_to_income <dbl> 18.01, 5.04, 21.15, 10.16, 57.96, 6.46, …
## $ issue_month    <fct> Mar-2018, Feb-2018, Feb-2018, Jan-2018, …
```

---

## Selected variables

.small[
Variable        | Description
----------------|-------------
`loan_amount`   |	Amount of the loan received, in US dollars
`interest_rate` |	Interest rate on the loan, in an annual percentage
`term`	        | The length of the loan, which is always set as a whole number of months
`grade`	        | Loan grade (A-G), represents the quality of the loan and its likelihood of being repaid
`state`         |	US state where the borrower resides
`annual_income` |	Borrower’s annual income, including any second income, in US dollars
`homeownership`	| Indicates whether the person owns, owns but has a mortgage, or rents
`debt_to_income` | Debt-to-income ratio
`issue_month` | Month the loan was issued
]

---

## Variable types

- Numerical variables: Continuous or discrete?
- Categorical: Ordinal or not?

---

## Variable types

Variable        | Type
----------------|-------------
`loan_amount`   |	numerical, continuous
`interest_rate` |	numerical, continuous
`term`	        | numerical, discrete
`grade`	        | categorical, ordinal
`state`         |	categorical, not ordinal
`annual_income` |	numerical, continuous
`homeownership`	| categorical, not ordinal
`debt_to_income` | numerical, continuous
`issue_month` | date

---

## Describing numerical distributions

- **Visual summaries**:
  - Histogram
  - Boxplot
  - Density plot
  - Line graph 
  
- Measures of **central tendency**: mean, median, mode

- **Shape**:
    - Skewness: right-skewed, left-skewed, symmetric 
    - Modality: unimodal, bimodal, multimodal, uniform

- **Spread**: variance and standard deviation, range and interquartile range

- **Unusual observations**

- A **summary statistic** is a single number summarizing a large amount of data

---

## Histogram

- Shows **shape, center, and spread** of the data

- Contiguous (adjoining) boxes
  - Horizontal axis: what the data represents
  - Vertical axis: frequency or relative frequency

.tiny[
.pull-left[

``` r
ggplot(loans, aes(x = loan_amount)) +
  geom_histogram()
```

```
## `stat_bin()` using `bins = 30`. Pick better value with
## `binwidth`.
```

<img src="lecture10_files/figure-html/unnamed-chunk-6-1.png" width="100%" style="display: block; margin: auto;" />
]

.pull-right[

``` r
hist(loans_full_schema$loan_amount)
```

<img src="lecture10_files/figure-html/unnamed-chunk-7-1.png" width="100%" style="display: block; margin: auto;" />
]
]

---
## Histograms and binwidth

.panelset[
.panel[.panel-name[binwidth = 1000]

``` r
ggplot(loans, aes(x = loan_amount)) +
  geom_histogram(binwidth = 1000)
```

<img src="lecture10_files/figure-html/unnamed-chunk-8-1.png" width="50%" style="display: block; margin: auto;" />
]
.panel[.panel-name[binwidth = 5000]

``` r
ggplot(loans, aes(x = loan_amount)) +
  geom_histogram(binwidth = 5000)
```

<img src="lecture10_files/figure-html/unnamed-chunk-9-1.png" width="50%" style="display: block; margin: auto;" />
]
.panel[.panel-name[binwidth = 20000]

``` r
ggplot(loans, aes(x = loan_amount)) +
  geom_histogram(binwidth = 20000)
```

<img src="lecture10_files/figure-html/unnamed-chunk-10-1.png" width="50%" style="display: block; margin: auto;" />
]
]

---

## Adding labels

.panelset[
.panel[.panel-name[Plot]
<img src="lecture10_files/figure-html/unnamed-chunk-11-1.png" width="60%" style="display: block; margin: auto;" />
]
.panel[.panel-name[Code]

``` r
ggplot(loans, aes(x = loan_amount)) +
  geom_histogram(binwidth = 5000) +
  labs( #<<
    x = "Loan amount ($)", #<<
    y = "Frequency", #<<
    title = "Amounts of Lending Club loans" #<<
  ) #<<
```
]
]

---
## Population vs. sample (briefly; more later)
- A **sample** is a portion or **subset** of the larger **population**

- E.g., population may be UC Davis students; randomly sample 300 students on the Quad this morning

- Population **parameter**, e.g., population mean
  - This is a fixed quantity

- Sample **statistic**, e.g., sample mean
  - Depends on the sample

---
## Measures of central tendency

- **Mean**: "Average", sum the numbers and divide by the count (`mean()`)
  
  `$\bar{x} = \frac{x_1 + x_2 + ... + x_n}{n}$`, where `$x$` is the variable of interest, the subscripts index the `$n$` observations, and `$\bar{x}$` denotes the **sample mean**. 
  
  The **population mean** is often denoted by `$\mu$`.
  
- **Median**: "Middle value", arrange in ascending order (`median()`)
  
- **Mode**: Most frequent value (`mode()` does not do what you might think)

- Note: you will sometimes need the `na.rm = TRUE` option

---
## Measures of central tendency

- `mean(loans$loan_amount)` = 16361.92

- `median(loans$loan_amount)` = 14500

- Mode is 10000 (Exercise: how to get this in R?)

.tiny[
.pull-left[

``` r
ggplot(loans, aes(x = loan_amount)) +
  geom_histogram() +
  geom_vline(xintercept = median(loans$loan_amount),
             col = "blue") +
  geom_vline(xintercept = mean(loans$loan_amount),
             col = "red")
```

]
]

.pull-right[
<img src="lecture10_files/figure-html/unnamed-chunk-13-1.png" width="100%" style="display: block; margin: auto;" />
]

---
## Measures of shape: Skewness

- **Skewness** is to the side of the longer *tail*

- **Positive skew/right skew**: mean > median
  
  - **Negative skew/left skew**: mean < median

---
## Measures of shape: Modality
- **Mode** is the most frequent value, but in real-world data sets, there might not be any observations with the same value.

- A mode is represented by a **prominent peak in the distribution**

- **Unimodal** = one prominent peak, **bimodal** = two prominent peaks, **multimodal** = more than two prominent peaks, uniform

---
## Loans data

``` r
ggplot(loans, aes(x = loan_amount)) +
  geom_histogram(binwidth = 5000)
```

What is the skewness and modality?

---
## Measures of spread: Variance and standard deviation

- **Red distribution**: concentrated closely near the mean
- **Blue distribution**: more widely spread out from the mean
- They have the same mean, skewness, modality

---
## Measures of spread: Variance and standard deviation
- **Standard deviation** measures how far data values are from their mean

- **Deviation** is the distance of an observation from its mean, `$x_i - \bar{x}$`

- **Sample variance**: "Take the square of deviations and find the mean"
  - `$s^2 = \frac{(x_1 - \bar{x})^2 + (x_2 - \bar{x})^2 + ... + (x_n - \bar{x})^2}{n - 1}$`
  - For the denominator, use `$n-1$` instead of `$n$` to make it an *unbiased estimator of the population mean*

- **Sample standard deviation**, `$s = \sqrt{s^2}$`

- In R, `sd()` for sample standard deviation, `var()` for sample variance

- **Population** variance and standard deviation are often denoted by `$\sigma^2$` and `$\sigma$`

---
## Measures of spread: Variance and standard deviation

- Standard deviation can roughly be interpreted as the mean distance from mean

- **Rules of thumb** for symmetric, bell-shaped distributions: 68%, 95%, and 99.7% of the values lie within one, two, and three standard deviations of the mean, respectively

---
## Measures of spread: Variance and standard deviation
<img src="img/sdRules.png" width="60%" style="display: block; margin: auto;" />

Example: Wait times at a restaurant have a mean of 25 minutes and standard deviation of 5 minutes.

David waits for 35 minutes. This wait time is longer than roughly __% of wait times.

David waits for 10 minutes. This wait time is shorter than roughly __% of wait times.

---
## Measures of spread: Range and interquartile range

- **Percentile**: a number that divides ordered data into hundredths
  - Median = 50th percentile

- **Quartile**: a number that divides ordered data into quarters 
  - First quartile = 25th percentile
  - Second quartile = Median = 50th percentile
  - Third quartile = 75th percentile

- **Interquartile range** (IQR) = 3rd - 1st quartile
  - `IQR()` in R

- **Range** = Max - min

- **Five-number summary**: Min, 1Q, Median, 3Q, Max
  - `summary()` in R (also gives mean)

---
## Loans data
.tiny[

``` r
sd(loans$loan_amount)
```

```
## [1] 10301.96
```

``` r
var(loans$loan_amount)
```

```
## [1] 106130313
```

``` r
sqrt(var(loans$loan_amount))
```

```
## [1] 10301.96
```

``` r
summary(loans$loan_amount)
```

```
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1000    8000   14500   16362   24000   40000
```

``` r
IQR(loans$loan_amount)
```

```
## [1] 16000
```
]

`homeownership` is a factor variable with three levels, `MORTGAGE`, `OWN` and `RENT`. How do we calculate the variance for each type of home ownership status?

---

## Percentiles
Vertical lines for 5th, 25th percentile, median, mean, 75th and 95th percentiles

.panelset[
.panel[.panel-name[Plot]
<img src="lecture10_files/figure-html/unnamed-chunk-22-1.png" width="60%" style="display: block; margin: auto;" />
]
.panel[.panel-name[Code]

``` r
ggplot(loans, aes(x = loan_amount)) +
  geom_histogram() +
  geom_vline(xintercept = median(loans$loan_amount),
             col = "blue") +
  geom_vline(xintercept = quantile(loans$loan_amount, .05),
             col = "lightblue") +
  geom_vline(xintercept = quantile(loans$loan_amount, .25),
           col = "lightblue") + 
  geom_vline(xintercept = quantile(loans$loan_amount, .75),
           col = "lightblue") +
  geom_vline(xintercept = quantile(loans$loan_amount, .95),
           col = "lightblue") +
  geom_vline(xintercept = mean(loans$loan_amount),
             col = "red")
```
]
]

---

## Boxplots 
<img src="img/boxHist.png" width="80%" style="display: block; margin: auto;" />

---
## Boxplots

``` r
ggplot(loans, aes(x = loan_amount)) +
  geom_boxplot() +
  labs(x = "Loan amount") +
  scale_y_continuous(breaks = NULL)
```

- Lower whisker, box (1Q, median, 3Q), upper whisker
- Total length of the box is IQR
- The length of each whisker is up to 1.5*IQR
- Any points beyond that are **outliers**, observations that are unusually far from the rest of the data
- Outliers appear as points

---

## Box plot and outliers

Income data are often skewed (right or left?)

``` r
ggplot(loans, aes(x = annual_income)) +
  geom_boxplot()+
  scale_y_continuous(breaks = NULL)
```

---
## Boxplots in base R
.tiny[
.pull-left[

``` r
ggplot(loans, aes(x = loan_amount)) +
  geom_boxplot()
```

<img src="lecture10_files/figure-html/unnamed-chunk-26-1.png" width="100%" style="display: block; margin: auto;" />
]

.pull-right[

``` r
boxplot(loans$loan_amount)
```

<img src="lecture10_files/figure-html/unnamed-chunk-27-1.png" width="100%" style="display: block; margin: auto;" />
]
]

---

## Another way to remove y-axis labels

We saw `scale_y_continuous(breaks = NULL)` earlier

.panelset[
.panel[.panel-name[Plot]
<img src="lecture10_files/figure-html/unnamed-chunk-28-1.png" width="60%" style="display: block; margin: auto;" />
]
.panel[.panel-name[Code]

``` r
ggplot(loans, aes(x = loan_amount)) +
  geom_boxplot() +
  labs(
    x = "Loan amount ($)",
    y = NULL,
    title = "Loan amounts of Lending Club loans"
  ) +
  theme( #<<
    axis.ticks.y = element_blank(), #<<
    axis.text.y = element_blank() #<<
  ) #<<
```
]
]

---

## Density plot

Density plots are an alternative to histograms

.tiny[
.pull-left[

``` r
ggplot(loans, aes(x = loan_amount)) +
  geom_density()
```

<img src="lecture10_files/figure-html/unnamed-chunk-29-1.png" width="100%" style="display: block; margin: auto;" />
]

.pull-right[

``` r
plot(density(loans$loan_amount))
```

<img src="lecture10_files/figure-html/unnamed-chunk-30-1.png" width="100%" style="display: block; margin: auto;" />
]
]
---

## Density plots and adjusting bandwidth

.panelset[
.panel[.panel-name[adjust = 0.5]

``` r
ggplot(loans, aes(x = loan_amount)) +
  geom_density(adjust = 0.5)
```

<img src="lecture10_files/figure-html/unnamed-chunk-31-1.png" width="50%" style="display: block; margin: auto;" />
]
.panel[.panel-name[adjust = 1]

``` r
ggplot(loans, aes(x = loan_amount)) +
  geom_density(adjust = 1) # default bandwidth
```

<img src="lecture10_files/figure-html/unnamed-chunk-32-1.png" width="50%" style="display: block; margin: auto;" />
]
.panel[.panel-name[adjust = 2]

``` r
ggplot(loans, aes(x = loan_amount)) +
  geom_density(adjust = 2)
```

<img src="lecture10_files/figure-html/unnamed-chunk-33-1.png" width="50%" style="display: block; margin: auto;" />
]
]

---

## Title and labels

.panelset[
.panel[.panel-name[Plot]
<img src="lecture10_files/figure-html/unnamed-chunk-34-1.png" width="60%" style="display: block; margin: auto;" />
]
.panel[.panel-name[Code]

``` r
ggplot(loans, aes(x = loan_amount)) +
  geom_density(adjust = 2) +
  labs( #<<
    x = "Loan amount ($)", #<<
    y = "Density", #<<
    title = "Amounts of Lending Club loans" #<<
  ) #<<
```
]
]

---
## Relationships between numerical variables

- Paired or bivariate data

- Scatterplot

- Hexplot

- Correlation
  
  - Line graph

---
## Summary

- Describing numerical distributions

- Histograms
  
  - Measures of central tendency: mean, median, mode

- Shape: skewness and modality
  
  - Spread: variance and standard deviation, range and interquartile range

- Boxplots
    - Unusual observations
  
  - Density plot