Fundamentals of R: More Data Structures, Visualization

class: center, middle, inverse, title-slide

.title[
# Fundamentals of R: More Data Structures, Visualization
]
.subtitle[
## <br><br> STA35A: Statistical Data Science 1
]
.author[
### Xiao Hui Tai
]
.date[
### October 6, 2023
]

---

layout: true

---

## Recap
--

- Arrays
  - Accessing array values
  - Operating on arrays
  - Array functions

- Matrices
  - Matrix multiplication
  - Other matrix operators

- Introduction to lists 
  - Accessing pieces of lists 
  - Working with lists

---
## Reminders

- First homework will be posted this afternoon on course website

- Due next Thursday at 9pm

- Guidelines are the same as for labs:

- PDF files only
  - Submission through Gradescope (accessible through Canvas)
  - If you collaborate with others, write their names in your submission

---
## Today
- Lists (continued)

- Data frames, or more generally "data sets"

- Exploratory data analysis

---
## Course content

1. Fundamentals of R
  - **Overview of data types and structures**
  - **Data manipulation and data visualization tools**

2. Descriptive statistics for numerical and categorical data

3. Probability
  - Rules of probability computation; conditional probability
  - Basic probability models: Binomial, Normal and Poisson

4. Statistical inference
  - Sampling distributions of sample mean and sample proportion 
  - Hypothesis testing and confidence intervals for population mean and population proportion

---
## Naming list elements
- We saw how to name elements of a list while constructing them

- We can also add names later on:

```r
my.distribution <- list("exponential", 7, FALSE)
names(my.distribution) <- c("family", "mean", "is.symmetric")
my.distribution
```

```
## $family
## [1] "exponential"
## 
## $mean
## [1] 7
## 
## $is.symmetric
## [1] FALSE
```
---

Lists have a special short-cut way of using names, `$` (which removes names and structures):

```r
my.distribution[["family"]]
```

```
## [1] "exponential"
```

```r
my.distribution$family
```

```
## [1] "exponential"
```

```r
my.distribution[1]
```

```
## $family
## [1] "exponential"
```

---
## Names in lists

Creating a list with names:

```r
another.distribution <- list(family="gaussian", mean = 7, 
                             sd = 1, is.symmetric = TRUE)
```

Adding named elements:

```r
my.distribution$was.estimated <- FALSE
my.distribution[["last.updated"]] <- "2011-08-30"
```

Removing a named list element, by assigning it the value `NULL`:

```r
my.distribution$was.estimated <- NULL
```
---
## Structure of lists
- We saw the output of `str()` with arrays earlier on

- `str()` is particularly useful for lists, since it allows us to easily get an idea of what is in the list.

```r
str(my.distribution)
```

```
## List of 4
##  $ family      : chr "exponential"
##  $ mean        : num 7
##  $ is.symmetric: logi FALSE
##  $ last.updated: chr "2011-08-30"
```

---
## `lapply()`
When each list element has the same structure, a particularly useful function is `lapply()`

```r
myList <- replicate(8, rnorm(n = 10), simplify = FALSE)
str(myList)
```

```
## List of 8
##  $ : num [1:10] -1.207 0.277 1.084 -2.346 0.429 ...
##  $ : num [1:10] -0.4772 -0.9984 -0.7763 0.0645 0.9595 ...
##  $ : num [1:10] 0.134 -0.491 -0.441 0.46 -0.694 ...
##  $ : num [1:10] 1.102 -0.476 -0.709 -0.501 -1.629 ...
##  $ : num [1:10] 1.449 -1.069 -0.855 -0.281 -0.994 ...
##  $ : num [1:10] -1.806 -0.582 -1.109 -1.015 -0.162 ...
##  $ : num [1:10] 0.6566 2.549 -0.0348 -0.6696 -0.0076 ...
##  $ : num [1:10] 0.00689 -0.45547 -0.36652 0.64829 2.07027 ...
```
---

```r
lapply(myList, mean)
```

```
## [[1]]
## [1] -0.3831574
## 
## [[2]]
## [1] -0.1181707
## 
## [[3]]
## [1] -0.3879468
## 
## [[4]]
## [1] -0.7661931
## 
## [[5]]
## [1] -0.6097971
## 
## [[6]]
## [1] -0.2788647
## 
## [[7]]
## [1] 0.6165922
## 
## [[8]]
## [1] -0.04230209
```
---
## `lapply()`
Another useful function is `unlist()`, which removes the list structure

```r
unlist(lapply(myList, mean), use.names = FALSE)
```

```
## [1] -0.38315741 -0.11817071 -0.38794682 -0.76619306 -0.60979706
## [6] -0.27886474  0.61659223 -0.04230209
```
---
## Concept of key-value pairs

- Lists give us a way to **store and look up data** by _name_, rather than by _position_

- This is a **useful programming concept** with many names: 
  - Key-value pairs
  - Dictionaries
  - Associative arrays
  - Hashes

- E.g., if all our distributions have components named `family`, we can look that up by name, without worrying about where it is in the list

---
## Data frames
- A **data frame** is a special **list** containing vectors of equal length

- Data frame = the classic data table, `$n$` rows for observations, `$p$` columns for variables

- Lots of the statistical parts of R presume data frames

- Not just a matrix because **columns can have different types**

- Many **matrix functions** also work for data frames (`rowSums()`, `summary()`, `apply()`)

<small>but no matrix multiplying data frames, even if all columns are numeric</small>

---
## Creating data frames

Here we start with a matrix and turn it into a data frame:

```r
a.matrix <- matrix(c(35, 8, 10, 4), nrow = 2)
colnames(a.matrix) <- c("v1", "v2")
a.matrix
```

```
##      v1 v2
## [1,] 35 10
## [2,]  8  4
```

```r
a.matrix[, "v1"]  
```

```
## [1] 35  8
```
<small>Does `a.matrix$v1` work?</small>

---

```r
(a.data.frame <- data.frame(a.matrix))
```

```
##   v1 v2
## 1 35 10
## 2  8  4
```

```r
a.data.frame$v1 # now this works 
```

```
## [1] 35  8
```

```r
a.data.frame[, "v1"]
```

```
## [1] 35  8
```

```r
a.data.frame[1, ]
```

```
##   v1 v2
## 1 35 10
```

```r
colMeans(a.data.frame)
```

```
##   v1   v2 
## 21.5  7.0
```

---
## Adding rows and columns
We can add columns during construction of the data frame:

```r
(a.data.frame <- data.frame(a.matrix, logicals = c(TRUE, FALSE)))
```

```
##   v1 v2 logicals
## 1 35 10     TRUE
## 2  8  4    FALSE
```

We can also add columns by name

```r
a.data.frame$newCol <- 1:2
a.data.frame
```

```
##   v1 v2 logicals newCol
## 1 35 10     TRUE      1
## 2  8  4    FALSE      2
```

Now remove `newCol`

```r
a.data.frame <- a.data.frame[, -4]
```
---
## Adding rows and columns
We can also add rows or columns to an array or data-frame with `rbind()` and `cbind()`, but be careful about forced type conversions

```r
rbind(a.data.frame, list(v1 = -3, v2 = -5, logicals = TRUE))
```

```
##   v1 v2 logicals
## 1 35 10     TRUE
## 2  8  4    FALSE
## 3 -3 -5     TRUE
```

```r
rbind(a.data.frame, c(3, 4, 6))
```

```
##   v1 v2 logicals
## 1 35 10        1
## 2  8  4        0
## 3  3  4        6
```
<small>What happened here?</small>

---
## More complicated data structures: structures of structures
- Internally, a data frame is basically a **list of vectors**  
- List elements can even be other lists, 
  - which may contain other data structures, including other lists,  
  - which may contain other data structures...

- This **recursion** lets us build arbitrarily complicated data structures from the basic ones

---
## More complicated data structures: structures of structures

Most complicated objects are (usually) lists of data structures

```r
a <- list(a = 1:3, b = "a string", c = pi, d = list(-1, -5))
str(a)
```

```
## List of 4
##  $ a: int [1:3] 1 2 3
##  $ b: chr "a string"
##  $ c: num 3.14
##  $ d:List of 2
##   ..$ : num -1
##   ..$ : num -5
```

---

## Data frames, data sets

- We've seen data frames. This is a commonly used data structure that we get after reading in a data set into R.

- In a data set in general, 
  - Each row is an **observation**, `$n$`
  - Each column is a **variable**, `$p$`

- Often, the **first things we want to do** when given a data set are to figure out
  1. What is in it (what dimensions, what variables)
  2. What the main characteristics of the variables are.

- We've seen a few tools and functions for working with data frames in "base R," now we will look at some tools from `dplyr`

---
<img src="img/tidyverse.png" width="100%" style="display: block; margin: auto;" />
https://www.tidyverse.org/
- What we've seen so far: "base R"
- `ggplot2` for plotting, `dplyr` for data manipulation

---
## First question: What's in a data set?

### Example: Star Wars data

- `starwars` data set in the `dplyr` package

```r
dplyr::starwars
```

```
## # A tibble: 87 × 14
##   name  height  mass hair_…¹ skin_…² eye_c…³ birth…⁴ sex   gender
##   <chr>  <int> <dbl> <chr>   <chr>   <chr>     <dbl> <chr> <chr> 
## 1 Luke…    172    77 blond   fair    blue       19   male  mascu…
## 2 C-3PO    167    75 <NA>    gold    yellow    112   none  mascu…
## 3 R2-D2     96    32 <NA>    white,… red        33   none  mascu…
## 4 Dart…    202   136 none    white   yellow     41.9 male  mascu…
## 5 Leia…    150    49 brown   light   brown      19   fema… femin…
## 6 Owen…    178   120 brown,… light   blue       52   male  mascu…
## # … with 81 more rows, 5 more variables: homeworld <chr>,
## #   species <chr>, films <list>, vehicles <list>,
## #   starships <list>, and abbreviated variable names
## #   ¹hair_color, ²skin_color, ³eye_color, ⁴birth_year
```

(A `tibble` is the `tidyverse` version of the data frame.)

---

We've seen `str()`. `dplyr::glimpse()` produces cleaner output in this case:

```r
dplyr::glimpse(starwars)
```

```
## Rows: 87
## Columns: 14
## $ name       <chr> "Luke Skywalker", "C-3PO", "R2-D2", "Darth V…
## $ height     <int> 172, 167, 96, 202, 150, 178, 165, 97, 183, 1…
## $ mass       <dbl> 77.0, 75.0, 32.0, 136.0, 49.0, 120.0, 75.0, …
## $ hair_color <chr> "blond", NA, NA, "none", "brown", "brown, gr…
## $ skin_color <chr> "fair", "gold", "white, blue", "white", "lig…
## $ eye_color  <chr> "blue", "yellow", "red", "yellow", "brown", …
## $ birth_year <dbl> 19.0, 112.0, 33.0, 41.9, 19.0, 52.0, 47.0, N…
## $ sex        <chr> "male", "none", "none", "male", "female", "m…
## $ gender     <chr> "masculine", "masculine", "masculine", "masc…
## $ homeworld  <chr> "Tatooine", "Tatooine", "Naboo", "Tatooine",…
## $ species    <chr> "Human", "Droid", "Droid", "Human", "Human",…
## $ films      <list> <"The Empire Strikes Back", "Revenge of the…
## $ vehicles   <list> <"Snowspeeder", "Imperial Speeder Bike">, <…
## $ starships  <list> <"X-wing", "Imperial shuttle">, <>, <>, "TI…
```

---

How many rows and columns does this data set have? What does each row represent? What does each column represent?

```r
?starwars
```

---

How many rows and columns does this data set have?

```r
nrow(starwars) # number of rows
```

```
## [1] 87
```

```r
ncol(starwars) # number of columns
```

```
## [1] 14
```

```r
dim(starwars)  # dimensions (row column)
```

```
## [1] 87 14
```

As we've seen, columns (variables) in data frames can be accessed with `$`:

```r
dataframe$var_name
```

---

## Second question: what are the main characteristics of the data?

**Exploratory data analysis** (EDA) is an approach to summarizing the **main characteristics** of a data set

---

##  Exploratory data analysis

- Often, this is **visual**

- We might also calculate **summary statistics**, e.g., mean, median

- We might also **manipulate or transform** the data before visualizing or calculating summary statistics
  - e.g., filter certain values, group continuous variables into buckets, take log-transformation

- We will first introduce **visual summaries** and tools for data manipulation, then talk about **numerical summaries**.

- We saw a visualization example in the first lecture. Here are a few more.

---

## Visualization example 1: Mass vs. height in Star Wars data

How would you describe the **relationship** between mass and height of Starwars characters?
What other variables would help us understand data points that don't follow the **overall trend**?
Who is the not so tall but much heavier character?

---

## Jabba!

---

## Visualization Example 2: Anscombe's quartet

.small[
.pull-left[

```
##    set  x     y
## 1    I 10  8.04
## 2    I  8  6.95
## 3    I 13  7.58
## 4    I  9  8.81
## 5    I 11  8.33
## 6    I 14  9.96
## 7    I  6  7.24
## 8    I  4  4.26
## 9    I 12 10.84
## 10   I  7  4.82
## 11   I  5  5.68
## 12  II 10  9.14
## 13  II  8  8.14
## 14  II 13  8.74
## 15  II  9  8.77
## 16  II 11  9.26
## 17  II 14  8.10
## 18  II  6  6.13
## 19  II  4  3.10
## 20  II 12  9.13
## 21  II  7  7.26
## 22  II  5  4.74
```
] 
.pull-right[

```
##    set  x     y
## 23 III 10  7.46
## 24 III  8  6.77
## 25 III 13 12.74
## 26 III  9  7.11
## 27 III 11  7.81
## 28 III 14  8.84
## 29 III  6  6.08
## 30 III  4  5.39
## 31 III 12  8.15
## 32 III  7  6.42
## 33 III  5  5.73
## 34  IV  8  6.58
## 35  IV  8  5.76
## 36  IV  8  7.71
## 37  IV  8  8.84
## 38  IV  8  8.47
## 39  IV  8  7.04
## 40  IV  8  5.25
## 41  IV 19 12.50
## 42  IV  8  5.56
## 43  IV  8  7.91
## 44  IV  8  6.89
```
]
]
---

## Summary statistics are identical

```r
Tmisc::quartet %>%
  group_by(set) %>%
  summarize(
    mean_x = mean(x), 
    mean_y = mean(y),
    sd_x = sd(x),
    sd_y = sd(y),
    r = cor(x, y)
  )
```

```
## # A tibble: 4 × 6
##   set   mean_x mean_y  sd_x  sd_y     r
##   <fct>  <dbl>  <dbl> <dbl> <dbl> <dbl>
## 1 I          9   7.50  3.32  2.03 0.816
## 2 II         9   7.50  3.32  2.03 0.816
## 3 III        9   7.5   3.32  2.03 0.816
## 4 IV         9   7.50  3.32  2.03 0.817
```

(Don't worry if you don't know what a standard deviation or correlation is; we will come back to this)

---

## Visualizing Anscombe's quartet

---

## Visualization Example 3: Facebook visits
.question[ 
How are people reporting lower vs. higher values of FB visits?
]

---
## Summary
--

- Lists (continued)
  - Names in lists
  - `lapply()`

- Data frames, or more generally "data sets"
  - Creating data frames
  - `tidyverse` and `dplyr`

- Exploratory data analysis
  - Some visualization examples