class: center, middle, inverse, title-slide .title[ # Fundamentals of R: More Data Structures ] .subtitle[ ##
STA35A: Statistical Data Science 1 ] .author[ ### Xiao Hui Tai ] .date[ ### October 4, 2024 ] --- layout: true <!-- <div class="my-footer"> --> <!-- <span> --> <!-- <a href="https://datasciencebox.org" target="_blank">datasciencebox.org</a> --> <!-- </span> --> <!-- </div> --> --- <style type="text/css"> .small .remark-code { font-size: 80%; } .tiny .remark-code { font-size: 50%; } </style> ## Announcements - First homework will be posted this afternoon on course website - Due next Thursday at 9pm --- ## Today - Lists - Data frames, or more generally "data sets" --- ## Lists - Lists are a generic container - Sequence of values, _not_ necessarily all of the same type ``` r my.distribution <- list("exponential", 7, FALSE) my.distribution ``` ``` ## [[1]] ## [1] "exponential" ## ## [[2]] ## [1] 7 ## ## [[3]] ## [1] FALSE ``` - Most of what you can do with vectors you can also do with lists - This is an unnamed list --- ## Lists - Elements can be vectors of any type, or other data structures like matrices - This is a named list ``` r l <- list( x = 1:4, y = c("hi", "hello", "jello"), z = matrix(c(TRUE, FALSE, FALSE, FALSE), nrow = 2) ) l ``` ``` ## $x ## [1] 1 2 3 4 ## ## $y ## [1] "hi" "hello" "jello" ## ## $z ## [,1] [,2] ## [1,] TRUE FALSE ## [2,] FALSE FALSE ``` --- ## Lists Make an empty list to fill in later ``` r myList <- vector(mode = "list", length = 4) myList ``` ``` ## [[1]] ## NULL ## ## [[2]] ## NULL ## ## [[3]] ## NULL ## ## [[4]] ## NULL ``` --- ## Accessing pieces of lists Can use `[ ]` as with vectors or use `[[ ]]`, but only with a single index `[[ ]]` drops names and structures, `[ ]` does not ``` r l[1] ``` ``` ## $x ## [1] 1 2 3 4 ``` ``` r l[[1]] ``` ``` ## [1] 1 2 3 4 ``` <small>Does `l[[1:2]]` work?</small> --- ## Accessing pieces of lists Helpful illustration from R for Data Science (first edition, Chapter 20.5.3): .pull-left[ <img src="img/pepperShaker1.png" width="110%" style="display: block; margin: auto;" /> ] .pull-right[ <img src="img/pepperShaker2.png" width="110%" style="display: block; margin: auto;" /> ] --- ## Working with lists .pull-left[ ``` r my.distribution ``` ``` ## [[1]] ## [1] "exponential" ## ## [[2]] ## [1] 7 ## ## [[3]] ## [1] FALSE ``` ] .pull-right[ ``` r is.character(my.distribution) ``` ``` ## [1] FALSE ``` ``` r is.character(my.distribution[[1]]) ``` ``` ## [1] TRUE ``` ``` r my.distribution[[2]]^2 ``` ``` ## [1] 49 ``` ] <small>What happens if you try `my.distribution[2]^2`?</small> <small>What happens if you try `[[ ]]` on a vector?</small> --- ## Filling in lists ``` r myList[[1]] <- 1:10 myList ``` ``` ## [[1]] ## [1] 1 2 3 4 5 6 7 8 9 10 ## ## [[2]] ## NULL ## ## [[3]] ## NULL ## ## [[4]] ## NULL ``` <small>What happens if you try `myList[1] <- 1:10`?</small> --- ## Expanding and contracting lists Add to lists with `c()` (also works with vectors): ``` r my.distribution <- c(my.distribution, 7) my.distribution ``` ``` ## [[1]] ## [1] "exponential" ## ## [[2]] ## [1] 7 ## ## [[3]] ## [1] FALSE ## ## [[4]] ## [1] 7 ``` --- ## Naming list elements - We saw how to name elements of a list while constructing them - We can also add names later on: ``` r my.distribution <- list("exponential", 7, FALSE) names(my.distribution) <- c("family", "mean", "is.symmetric") my.distribution ``` ``` ## $family ## [1] "exponential" ## ## $mean ## [1] 7 ## ## $is.symmetric ## [1] FALSE ``` --- Lists have a special short-cut way of using names, `$` (which removes names and structures): ``` r my.distribution[["family"]] ``` ``` ## [1] "exponential" ``` ``` r my.distribution$family ``` ``` ## [1] "exponential" ``` ``` r my.distribution[1] ``` ``` ## $family ## [1] "exponential" ``` --- ## Names in lists Creating a list with names: ``` r another.distribution <- list(family="gaussian", mean = 7, sd = 1, is.symmetric = TRUE) ``` Adding named elements: ``` r my.distribution$was.estimated <- FALSE my.distribution[["last.updated"]] <- "2011-08-30" ``` Removing a named list element, by assigning it the value `NULL`: ``` r my.distribution$was.estimated <- NULL ``` --- ## Structure of lists - We saw the output of `str()` with arrays earlier on - `str()` is particularly useful for lists, since it allows us to easily get an idea of what is in the list. ``` r str(my.distribution) ``` ``` ## List of 4 ## $ family : chr "exponential" ## $ mean : num 7 ## $ is.symmetric: logi FALSE ## $ last.updated: chr "2011-08-30" ``` --- ## `lapply()` When each list element has the same structure, a particularly useful function is `lapply()` ``` r myList <- replicate(8, rnorm(n = 10), simplify = FALSE) str(myList) ``` ``` ## List of 8 ## $ : num [1:10] -1.207 0.277 1.084 -2.346 0.429 ... ## $ : num [1:10] -0.4772 -0.9984 -0.7763 0.0645 0.9595 ... ## $ : num [1:10] 0.134 -0.491 -0.441 0.46 -0.694 ... ## $ : num [1:10] 1.102 -0.476 -0.709 -0.501 -1.629 ... ## $ : num [1:10] 1.449 -1.069 -0.855 -0.281 -0.994 ... ## $ : num [1:10] -1.806 -0.582 -1.109 -1.015 -0.162 ... ## $ : num [1:10] 0.6566 2.549 -0.0348 -0.6696 -0.0076 ... ## $ : num [1:10] 0.00689 -0.45547 -0.36652 0.64829 2.07027 ... ``` --- ``` r lapply(myList, mean) ``` ``` ## [[1]] ## [1] -0.3831574 ## ## [[2]] ## [1] -0.1181707 ## ## [[3]] ## [1] -0.3879468 ## ## [[4]] ## [1] -0.7661931 ## ## [[5]] ## [1] -0.6097971 ## ## [[6]] ## [1] -0.2788647 ## ## [[7]] ## [1] 0.6165922 ## ## [[8]] ## [1] -0.04230209 ``` --- ## `lapply()` Another useful function is `unlist()`, which removes the list structure ``` r unlist(lapply(myList, mean), use.names = FALSE) ``` ``` ## [1] -0.38315741 -0.11817071 -0.38794682 -0.76619306 -0.60979706 ## [6] -0.27886474 0.61659223 -0.04230209 ``` --- ## Concept of key-value pairs - Lists give us a way to **store and look up data** by _name_, rather than by _position_ - This is a **useful programming concept** with many names: - Key-value pairs - Dictionaries - Associative arrays - Hashes - E.g., if all our distributions have components named `family`, we can look that up by name, without worrying about where it is in the list --- ## Data frames - A **data frame** is a special **list** containing vectors of equal length - Data frame = the classic data table, `\(n\)` rows for observations, `\(p\)` columns for variables - Lots of the statistical parts of R presume data frames - Not just a matrix because **columns can have different types** - Many **matrix functions** also work for data frames (`rowSums()`, `summary()`, `apply()`) <small>but no matrix multiplying data frames, even if all columns are numeric</small> --- ## Creating data frames Here we start with a matrix and turn it into a data frame: ``` r a.matrix <- matrix(c(35, 8, 10, 4), nrow = 2) colnames(a.matrix) <- c("v1", "v2") a.matrix ``` ``` ## v1 v2 ## [1,] 35 10 ## [2,] 8 4 ``` ``` r a.matrix[, "v1"] ``` ``` ## [1] 35 8 ``` <small>Does `a.matrix$v1` work?</small> --- ``` r (a.data.frame <- data.frame(a.matrix)) ``` ``` ## v1 v2 ## 1 35 10 ## 2 8 4 ``` ``` r a.data.frame$v1 # now this works ``` ``` ## [1] 35 8 ``` ``` r a.data.frame[, "v1"] ``` ``` ## [1] 35 8 ``` ``` r a.data.frame[1, ] ``` ``` ## v1 v2 ## 1 35 10 ``` ``` r colMeans(a.data.frame) ``` ``` ## v1 v2 ## 21.5 7.0 ``` --- ## Adding rows and columns We can add columns during construction of the data frame: ``` r (a.data.frame <- data.frame(a.matrix, logicals = c(TRUE, FALSE))) ``` ``` ## v1 v2 logicals ## 1 35 10 TRUE ## 2 8 4 FALSE ``` We can also add columns by name ``` r a.data.frame$newCol <- 1:2 a.data.frame ``` ``` ## v1 v2 logicals newCol ## 1 35 10 TRUE 1 ## 2 8 4 FALSE 2 ``` Now remove `newCol` ``` r a.data.frame <- a.data.frame[, -4] ``` --- ## Adding rows and columns We can also add rows or columns to an array or data-frame with `rbind()` and `cbind()`, but be careful about forced type conversions ``` r rbind(a.data.frame, list(v1 = -3, v2 = -5, logicals = TRUE)) ``` ``` ## v1 v2 logicals ## 1 35 10 TRUE ## 2 8 4 FALSE ## 3 -3 -5 TRUE ``` ``` r rbind(a.data.frame, c(3, 4, 6)) ``` ``` ## v1 v2 logicals ## 1 35 10 1 ## 2 8 4 0 ## 3 3 4 6 ``` <small>What happened here?</small> --- ## Data frames, data sets - We've seen data frames. This is a commonly used data structure that we get after reading in a data set into R. - In a data set in general, - Each row is an **observation**, `\(n\)` - Each column is a **variable**, `\(p\)` - Often, the **first things we want to do** when given a data set are to figure out 1. What is in it (what dimensions, what variables) 2. What the main characteristics of the variables are. - We've seen a few tools and functions for working with data frames in "base R," next we will look at some tools from `dplyr` --- <img src="img/tidyverse.png" width="100%" style="display: block; margin: auto;" /> https://www.tidyverse.org/ - What we've seen so far: "base R" - `ggplot2` for plotting, `dplyr` for data manipulation --- ## Summary -- - Lists (continued) - Names in lists - `lapply()` - Data frames, or more generally "data sets" - Creating data frames - `tidyverse` and `dplyr`