class: center, middle, inverse, title-slide .title[ # Fundamentals of R: More Data Structures, Visualization ] .subtitle[ ##
STA35A: Statistical Data Science 1 ] .author[ ### Xiao Hui Tai ] .date[ ### October 6, 2023 ] --- layout: true <!-- <div class="my-footer"> --> <!-- <span> --> <!-- <a href="https://datasciencebox.org" target="_blank">datasciencebox.org</a> --> <!-- </span> --> <!-- </div> --> --- <style type="text/css"> .small .remark-code { font-size: 80%; } .tiny .remark-code { font-size: 50%; } </style> ## Recap -- - Arrays - Accessing array values - Operating on arrays - Array functions - Matrices - Matrix multiplication - Other matrix operators - Introduction to lists - Accessing pieces of lists - Working with lists --- ## Reminders - First homework will be posted this afternoon on course website - Due next Thursday at 9pm - Guidelines are the same as for labs: - PDF files only - Submission through Gradescope (accessible through Canvas) - If you collaborate with others, write their names in your submission --- ## Today - Lists (continued) - Data frames, or more generally "data sets" - Exploratory data analysis --- ## Course content 1. Fundamentals of R - **Overview of data types and structures** - **Data manipulation and data visualization tools** 2. Descriptive statistics for numerical and categorical data 3. Probability - Rules of probability computation; conditional probability - Basic probability models: Binomial, Normal and Poisson 4. Statistical inference - Sampling distributions of sample mean and sample proportion - Hypothesis testing and confidence intervals for population mean and population proportion --- ## Naming list elements - We saw how to name elements of a list while constructing them - We can also add names later on: ```r my.distribution <- list("exponential", 7, FALSE) names(my.distribution) <- c("family", "mean", "is.symmetric") my.distribution ``` ``` ## $family ## [1] "exponential" ## ## $mean ## [1] 7 ## ## $is.symmetric ## [1] FALSE ``` --- Lists have a special short-cut way of using names, `$` (which removes names and structures): ```r my.distribution[["family"]] ``` ``` ## [1] "exponential" ``` ```r my.distribution$family ``` ``` ## [1] "exponential" ``` ```r my.distribution[1] ``` ``` ## $family ## [1] "exponential" ``` --- ## Names in lists Creating a list with names: ```r another.distribution <- list(family="gaussian", mean = 7, sd = 1, is.symmetric = TRUE) ``` Adding named elements: ```r my.distribution$was.estimated <- FALSE my.distribution[["last.updated"]] <- "2011-08-30" ``` Removing a named list element, by assigning it the value `NULL`: ```r my.distribution$was.estimated <- NULL ``` --- ## Structure of lists - We saw the output of `str()` with arrays earlier on - `str()` is particularly useful for lists, since it allows us to easily get an idea of what is in the list. ```r str(my.distribution) ``` ``` ## List of 4 ## $ family : chr "exponential" ## $ mean : num 7 ## $ is.symmetric: logi FALSE ## $ last.updated: chr "2011-08-30" ``` --- ## `lapply()` When each list element has the same structure, a particularly useful function is `lapply()` ```r myList <- replicate(8, rnorm(n = 10), simplify = FALSE) str(myList) ``` ``` ## List of 8 ## $ : num [1:10] -1.207 0.277 1.084 -2.346 0.429 ... ## $ : num [1:10] -0.4772 -0.9984 -0.7763 0.0645 0.9595 ... ## $ : num [1:10] 0.134 -0.491 -0.441 0.46 -0.694 ... ## $ : num [1:10] 1.102 -0.476 -0.709 -0.501 -1.629 ... ## $ : num [1:10] 1.449 -1.069 -0.855 -0.281 -0.994 ... ## $ : num [1:10] -1.806 -0.582 -1.109 -1.015 -0.162 ... ## $ : num [1:10] 0.6566 2.549 -0.0348 -0.6696 -0.0076 ... ## $ : num [1:10] 0.00689 -0.45547 -0.36652 0.64829 2.07027 ... ``` --- ```r lapply(myList, mean) ``` ``` ## [[1]] ## [1] -0.3831574 ## ## [[2]] ## [1] -0.1181707 ## ## [[3]] ## [1] -0.3879468 ## ## [[4]] ## [1] -0.7661931 ## ## [[5]] ## [1] -0.6097971 ## ## [[6]] ## [1] -0.2788647 ## ## [[7]] ## [1] 0.6165922 ## ## [[8]] ## [1] -0.04230209 ``` --- ## `lapply()` Another useful function is `unlist()`, which removes the list structure ```r unlist(lapply(myList, mean), use.names = FALSE) ``` ``` ## [1] -0.38315741 -0.11817071 -0.38794682 -0.76619306 -0.60979706 ## [6] -0.27886474 0.61659223 -0.04230209 ``` --- ## Concept of key-value pairs - Lists give us a way to **store and look up data** by _name_, rather than by _position_ - This is a **useful programming concept** with many names: - Key-value pairs - Dictionaries - Associative arrays - Hashes - E.g., if all our distributions have components named `family`, we can look that up by name, without worrying about where it is in the list --- ## Data frames - A **data frame** is a special **list** containing vectors of equal length - Data frame = the classic data table, `\(n\)` rows for observations, `\(p\)` columns for variables - Lots of the statistical parts of R presume data frames - Not just a matrix because **columns can have different types** - Many **matrix functions** also work for data frames (`rowSums()`, `summary()`, `apply()`) <small>but no matrix multiplying data frames, even if all columns are numeric</small> --- ## Creating data frames Here we start with a matrix and turn it into a data frame: ```r a.matrix <- matrix(c(35, 8, 10, 4), nrow = 2) colnames(a.matrix) <- c("v1", "v2") a.matrix ``` ``` ## v1 v2 ## [1,] 35 10 ## [2,] 8 4 ``` ```r a.matrix[, "v1"] ``` ``` ## [1] 35 8 ``` <small>Does `a.matrix$v1` work?</small> --- ```r (a.data.frame <- data.frame(a.matrix)) ``` ``` ## v1 v2 ## 1 35 10 ## 2 8 4 ``` ```r a.data.frame$v1 # now this works ``` ``` ## [1] 35 8 ``` ```r a.data.frame[, "v1"] ``` ``` ## [1] 35 8 ``` ```r a.data.frame[1, ] ``` ``` ## v1 v2 ## 1 35 10 ``` ```r colMeans(a.data.frame) ``` ``` ## v1 v2 ## 21.5 7.0 ``` --- ## Adding rows and columns We can add columns during construction of the data frame: ```r (a.data.frame <- data.frame(a.matrix, logicals = c(TRUE, FALSE))) ``` ``` ## v1 v2 logicals ## 1 35 10 TRUE ## 2 8 4 FALSE ``` We can also add columns by name ```r a.data.frame$newCol <- 1:2 a.data.frame ``` ``` ## v1 v2 logicals newCol ## 1 35 10 TRUE 1 ## 2 8 4 FALSE 2 ``` Now remove `newCol` ```r a.data.frame <- a.data.frame[, -4] ``` --- ## Adding rows and columns We can also add rows or columns to an array or data-frame with `rbind()` and `cbind()`, but be careful about forced type conversions ```r rbind(a.data.frame, list(v1 = -3, v2 = -5, logicals = TRUE)) ``` ``` ## v1 v2 logicals ## 1 35 10 TRUE ## 2 8 4 FALSE ## 3 -3 -5 TRUE ``` ```r rbind(a.data.frame, c(3, 4, 6)) ``` ``` ## v1 v2 logicals ## 1 35 10 1 ## 2 8 4 0 ## 3 3 4 6 ``` <small>What happened here?</small> --- ## More complicated data structures: structures of structures - Internally, a data frame is basically a **list of vectors** - List elements can even be other lists, - which may contain other data structures, including other lists, - which may contain other data structures... - This **recursion** lets us build arbitrarily complicated data structures from the basic ones --- ## More complicated data structures: structures of structures Most complicated objects are (usually) lists of data structures ```r a <- list(a = 1:3, b = "a string", c = pi, d = list(-1, -5)) str(a) ``` ``` ## List of 4 ## $ a: int [1:3] 1 2 3 ## $ b: chr "a string" ## $ c: num 3.14 ## $ d:List of 2 ## ..$ : num -1 ## ..$ : num -5 ``` --- ## Data frames, data sets - We've seen data frames. This is a commonly used data structure that we get after reading in a data set into R. - In a data set in general, - Each row is an **observation**, `\(n\)` - Each column is a **variable**, `\(p\)` - Often, the **first things we want to do** when given a data set are to figure out 1. What is in it (what dimensions, what variables) 2. What the main characteristics of the variables are. - We've seen a few tools and functions for working with data frames in "base R," now we will look at some tools from `dplyr` --- <img src="img/tidyverse.png" width="100%" style="display: block; margin: auto;" /> https://www.tidyverse.org/ - What we've seen so far: "base R" - `ggplot2` for plotting, `dplyr` for data manipulation --- ## First question: What's in a data set? ### Example: Star Wars data - `starwars` data set in the `dplyr` package ```r dplyr::starwars ``` ``` ## # A tibble: 87 × 14 ## name height mass hair_…¹ skin_…² eye_c…³ birth…⁴ sex gender ## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr> ## 1 Luke… 172 77 blond fair blue 19 male mascu… ## 2 C-3PO 167 75 <NA> gold yellow 112 none mascu… ## 3 R2-D2 96 32 <NA> white,… red 33 none mascu… ## 4 Dart… 202 136 none white yellow 41.9 male mascu… ## 5 Leia… 150 49 brown light brown 19 fema… femin… ## 6 Owen… 178 120 brown,… light blue 52 male mascu… ## # … with 81 more rows, 5 more variables: homeworld <chr>, ## # species <chr>, films <list>, vehicles <list>, ## # starships <list>, and abbreviated variable names ## # ¹hair_color, ²skin_color, ³eye_color, ⁴birth_year ``` (A `tibble` is the `tidyverse` version of the data frame.) --- We've seen `str()`. `dplyr::glimpse()` produces cleaner output in this case: ```r dplyr::glimpse(starwars) ``` ``` ## Rows: 87 ## Columns: 14 ## $ name <chr> "Luke Skywalker", "C-3PO", "R2-D2", "Darth V… ## $ height <int> 172, 167, 96, 202, 150, 178, 165, 97, 183, 1… ## $ mass <dbl> 77.0, 75.0, 32.0, 136.0, 49.0, 120.0, 75.0, … ## $ hair_color <chr> "blond", NA, NA, "none", "brown", "brown, gr… ## $ skin_color <chr> "fair", "gold", "white, blue", "white", "lig… ## $ eye_color <chr> "blue", "yellow", "red", "yellow", "brown", … ## $ birth_year <dbl> 19.0, 112.0, 33.0, 41.9, 19.0, 52.0, 47.0, N… ## $ sex <chr> "male", "none", "none", "male", "female", "m… ## $ gender <chr> "masculine", "masculine", "masculine", "masc… ## $ homeworld <chr> "Tatooine", "Tatooine", "Naboo", "Tatooine",… ## $ species <chr> "Human", "Droid", "Droid", "Human", "Human",… ## $ films <list> <"The Empire Strikes Back", "Revenge of the… ## $ vehicles <list> <"Snowspeeder", "Imperial Speeder Bike">, <… ## $ starships <list> <"X-wing", "Imperial shuttle">, <>, <>, "TI… ``` --- How many rows and columns does this data set have? What does each row represent? What does each column represent? ```r ?starwars ``` <img src="img/starwars-help.png" width="100%" style="display: block; margin: auto;" /> --- How many rows and columns does this data set have? ```r nrow(starwars) # number of rows ``` ``` ## [1] 87 ``` ```r ncol(starwars) # number of columns ``` ``` ## [1] 14 ``` ```r dim(starwars) # dimensions (row column) ``` ``` ## [1] 87 14 ``` As we've seen, columns (variables) in data frames can be accessed with `$`: ```r dataframe$var_name ``` --- ## Second question: what are the main characteristics of the data? **Exploratory data analysis** (EDA) is an approach to summarizing the **main characteristics** of a data set <img src="img/elephant.jpg" width="60%" style="display: block; margin: auto;" /> --- ## Exploratory data analysis - Often, this is **visual** - We might also calculate **summary statistics**, e.g., mean, median - We might also **manipulate or transform** the data before visualizing or calculating summary statistics - e.g., filter certain values, group continuous variables into buckets, take log-transformation - We will first introduce **visual summaries** and tools for data manipulation, then talk about **numerical summaries**. - We saw a visualization example in the first lecture. Here are a few more. --- ## Visualization example 1: Mass vs. height in Star Wars data How would you describe the **relationship** between mass and height of Starwars characters? What other variables would help us understand data points that don't follow the **overall trend**? Who is the not so tall but much heavier character? <img src="lecture5_files/figure-html/unnamed-chunk-27-1.png" width="50%" style="display: block; margin: auto;" /> --- ## Jabba! <img src="lecture5_files/figure-html/unnamed-chunk-28-1.png" width="80%" style="display: block; margin: auto;" /> --- ## Visualization Example 2: Anscombe's quartet .small[ .pull-left[ ``` ## set x y ## 1 I 10 8.04 ## 2 I 8 6.95 ## 3 I 13 7.58 ## 4 I 9 8.81 ## 5 I 11 8.33 ## 6 I 14 9.96 ## 7 I 6 7.24 ## 8 I 4 4.26 ## 9 I 12 10.84 ## 10 I 7 4.82 ## 11 I 5 5.68 ## 12 II 10 9.14 ## 13 II 8 8.14 ## 14 II 13 8.74 ## 15 II 9 8.77 ## 16 II 11 9.26 ## 17 II 14 8.10 ## 18 II 6 6.13 ## 19 II 4 3.10 ## 20 II 12 9.13 ## 21 II 7 7.26 ## 22 II 5 4.74 ``` ] .pull-right[ ``` ## set x y ## 23 III 10 7.46 ## 24 III 8 6.77 ## 25 III 13 12.74 ## 26 III 9 7.11 ## 27 III 11 7.81 ## 28 III 14 8.84 ## 29 III 6 6.08 ## 30 III 4 5.39 ## 31 III 12 8.15 ## 32 III 7 6.42 ## 33 III 5 5.73 ## 34 IV 8 6.58 ## 35 IV 8 5.76 ## 36 IV 8 7.71 ## 37 IV 8 8.84 ## 38 IV 8 8.47 ## 39 IV 8 7.04 ## 40 IV 8 5.25 ## 41 IV 19 12.50 ## 42 IV 8 5.56 ## 43 IV 8 7.91 ## 44 IV 8 6.89 ``` ] ] --- ## Summary statistics are identical ```r Tmisc::quartet %>% group_by(set) %>% summarize( mean_x = mean(x), mean_y = mean(y), sd_x = sd(x), sd_y = sd(y), r = cor(x, y) ) ``` ``` ## # A tibble: 4 × 6 ## set mean_x mean_y sd_x sd_y r ## <fct> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 I 9 7.50 3.32 2.03 0.816 ## 2 II 9 7.50 3.32 2.03 0.816 ## 3 III 9 7.5 3.32 2.03 0.816 ## 4 IV 9 7.50 3.32 2.03 0.817 ``` (Don't worry if you don't know what a standard deviation or correlation is; we will come back to this) --- ## Visualizing Anscombe's quartet <img src="lecture5_files/figure-html/quartet-plot-1.png" width="80%" style="display: block; margin: auto;" /> --- ## Visualization Example 3: Facebook visits .question[ How are people reporting lower vs. higher values of FB visits? ] <img src="lecture5_files/figure-html/unnamed-chunk-29-1.png" width="60%" style="display: block; margin: auto;" /> --- ## Summary -- - Lists (continued) - Names in lists - `lapply()` - Data frames, or more generally "data sets" - Creating data frames - `tidyverse` and `dplyr` - Exploratory data analysis - Some visualization examples