class: center, middle, inverse, title-slide .title[ # Fundamentals of R: More Data Structures ] .subtitle[ ##
STA35A: Statistical Data Science 1 ] .author[ ### Xiao Hui Tai ] .date[ ### September 30, 2024 ] --- layout: true <!-- <div class="my-footer"> --> <!-- <span> --> <!-- <a href="https://datasciencebox.org" target="_blank">datasciencebox.org</a> --> <!-- </span> --> <!-- </div> --> --- <style type="text/css"> .tiny .remark-code { font-size: 60%; } .small .remark-code { font-size: 80%; } </style> ## Recap -- - **Data types**: Logical/Booleans, Integers, Double, Characters, Complex, Raw - **Operators**: unary, binary, arithmetic, logical, assignment - **Data structures** group related values together - **Vectors** group values of the same type - Creating and accessing vectors --- ## Style guide - https://style.tidyverse.org/index.html --- ## Today - More on vectors: - Creating vectors - Attributes - Vector arithmetic - Other functions on vectors - Arrays --- ## Creating vectors `vector(length = 7)` returns a logical vector of length 7, initialized with FALSEs; helpful for filling things up later (see help page: `vector(mode = "logical", length = 0)`) ``` r weeklyHours <- vector(length = 7) weeklyHours[5] <- 8 weeklyHours ``` ``` ## [1] 0 0 0 0 8 0 0 ``` The colon operator produces a sequence ``` r mySeq <- 2:5 mySeq ``` ``` ## [1] 2 3 4 5 ``` --- ## Creating vectors Many other ways to produce sequences, e.g., ``` r (mySeq <- seq(from = 1, to = 10, by = 2)) ``` ``` ## [1] 1 3 5 7 9 ``` <small>(Enclosing an assignment statement in parentheses prints the result)</small> --- ## Vectors with additional attributes: Factors - **Factors** are built on top of integer vectors - These have a **fixed and known set of possible values**. - Factors have **two components**: level numbers (integers) and level labels (characters) ``` r tmp <- factor(c("BS", "MS", "PhD", "MS")) tmp ``` ``` ## [1] BS MS PhD MS ## Levels: BS MS PhD ``` ``` r as.integer(tmp) ``` ``` ## [1] 1 2 3 2 ``` --- ## Vectors with additional attributes: Dates - **Dates and date-times** are built on top of numeric vectors - Dates are represented internally as the number of days since the origin, 1 Jan 1970 ``` r z <- as.Date("2020-01-01") z ``` ``` ## [1] "2020-01-01" ``` ``` r typeof(z) ``` ``` ## [1] "double" ``` ``` r str(z) ``` ``` ## Date[1:1], format: "2020-01-01" ``` --- ## Vectors with additional attributes: Dates ``` r z ``` ``` ## [1] "2020-01-01" ``` ``` r as.integer(z) ``` ``` ## [1] 18262 ``` ``` r as.integer(z) / 365 # roughly 50 yrs ``` ``` ## [1] 50.03288 ``` We will talk more about packages later on in the class, but the `lubridate` package is particularly useful for dealing with dates. --- ## Vector arithmetic Operators apply to vectors "pairwise" or "elementwise": ``` r x <- c(7, 8, 10, 45) y <- c(-7, -8, -10, -45) x + y ``` ``` ## [1] 0 0 0 0 ``` ``` r x * y ``` ``` ## [1] -49 -64 -100 -2025 ``` ``` r x^c(1, 0, -1, 0.5) ``` ``` ## [1] 7.000000 1.000000 0.100000 6.708204 ``` --- ## Vector arithmetic R will implicitly coerce the types of vectors to be compatible. E.g.: ``` r TRUE + 4 ``` ``` ## [1] 5 ``` --- ## Recycling - R will also implicitly coerce the length of vectors. - This is called vector **recycling**: - When a shorter vector is combined with a longer one, elements of the shorter vector are repeated or recycled, to make it the same length as the longer vector. ``` r x <- c(7, 8, 10, 45) x + c(-7, -8) ``` ``` ## [1] 0 0 3 37 ``` Single numbers are vectors of length 1 for purposes of recycling: ``` r 2*x ``` ``` ## [1] 14 16 20 90 ``` --- ## Exercises What do we expect? ``` r myVec <- c(4, 6, 12, 2, 0) myVec + 1:5 myVec + 1 # recycling myVec * 5 rep(FALSE, 4) + 1:4 # Coercion ``` --- ## Vectorized functions Most built-in functions are **vectorized**, meaning that they will operate on a vector of numbers: ``` r sample(1:10) + 100 ``` ``` ## [1] 110 106 105 104 101 108 102 107 109 103 ``` <small>(what does `sample()` do?)</small> ``` r x ``` ``` ## [1] 7 8 10 45 ``` ``` r x > 9 # pairwise comparisons, where the scalar 9 is recycled ``` ``` ## [1] FALSE FALSE TRUE TRUE ``` --- ## Vectorized functions Lots of functions take vectors as arguments: - `mean()`, `median()`, `sd()`, `var()`, `max()`, `min()`, `length()`, `sum()`: return single numbers - `sort()` returns a new vector - `hist()` takes a vector of numbers and produces a histogram - `summary()` gives a five-number summary of numerical vectors - `any()` and `all()` are useful on Boolean vectors --- ## Comparison operators Boolean operators work elementwise: ``` r x ``` ``` ## [1] 7 8 10 45 ``` ``` r (x > 9) & (x < 20) ``` ``` ## [1] FALSE FALSE TRUE FALSE ``` To get the number of components that satisfy a certain condition: ``` r sum(x > 9) # another example of coercion ``` ``` ## [1] 2 ``` --- ## Comparison operators To compare whole vectors, best to use `identical()` or `all.equal()`: ``` r x; y ``` ``` ## [1] 7 8 10 45 ``` ``` ## [1] -7 -8 -10 -45 ``` ``` r x == -y ``` ``` ## [1] TRUE TRUE TRUE TRUE ``` ``` r identical(x, -y) ``` ``` ## [1] TRUE ``` ``` r all.equal(x, -y) ``` ``` ## [1] TRUE ``` --- ``` r identical(c(0.5 - 0.3, 0.3 - 0.1), c(0.3 - 0.1, 0.5 - 0.3)) ``` ``` ## [1] FALSE ``` ``` r all.equal(c(0.5 - 0.3, 0.3 - 0.1), c(0.3 - 0.1, 0.5 - 0.3)) ``` ``` ## [1] TRUE ``` To see what's going on: ``` r print(.5 - .3, digits = 20) ``` ``` ## [1] 0.2000000000000000111 ``` ``` r print(.3 - .1, digits = 20) ``` ``` ## [1] 0.19999999999999998335 ``` Decimal numbers are not represented exactly in computer arithmetic. For more, see https://stackoverflow.com/questions/9508518/why-are-these-numbers-not-equal. --- ## Combining indexing and comparison operators ``` r x; y ``` ``` ## [1] 7 8 10 45 ``` ``` ## [1] -7 -8 -10 -45 ``` ``` r x > 9 ``` ``` ## [1] FALSE FALSE TRUE TRUE ``` ``` r x[x > 9] ``` ``` ## [1] 10 45 ``` ``` r y[x > 9] ``` ``` ## [1] -10 -45 ``` --- ## Combining indexing and comparison operators `which()` turns a Boolean vector into a vector of TRUE indices: ``` r x; y ``` ``` ## [1] 7 8 10 45 ``` ``` ## [1] -7 -8 -10 -45 ``` ``` r (places <- which(x > 9)) ``` ``` ## [1] 3 4 ``` ``` r y[places] ``` ``` ## [1] -10 -45 ``` --- Note that the behavior of `which()` and the indexing operator is different when there are `NA` values: ``` r myNAvec <- c(1, 2, NA) which(myNAvec > 1) ``` ``` ## [1] 2 ``` ``` r myNAvec > 1 ``` ``` ## [1] FALSE TRUE NA ``` ``` r myNAvec[which(myNAvec > 1)] ``` ``` ## [1] 2 ``` ``` r myNAvec[myNAvec > 1] ``` ``` ## [1] 2 NA ``` --- ## Aside: counting the number of missing values ``` r myNAvec ``` ``` ## [1] 1 2 NA ``` ``` r is.na(myNAvec) ``` ``` ## [1] FALSE FALSE TRUE ``` ``` r sum(is.na(myNAvec)) ``` ``` ## [1] 1 ``` --- ## Exercises: comparison operators What do we expect? ``` r myVec <- c(4, 6, 12, 2, 0) myVec < 10 (myVec < 10) | (myVec > 1) myVec + 1 # recycling myVec * 5 ``` --- ## Exercises Get familiar with a few built-in R functions: ``` r myVec <- c(4, 6, 12, 2, 0) mean(myVec) median(myVec) table(myVec) summary(myVec) sort(myVec) ``` --- ## Named components You can give names to elements or components of vectors .small[ ``` r x ``` ``` ## [1] 7 8 10 45 ``` ``` r (names(x) <- c("v1", "v2", "v3", "fred")) ``` ``` ## [1] "v1" "v2" "v3" "fred" ``` ``` r x[c("fred", "v1")] ``` ``` ## fred v1 ## 45 7 ``` ``` r x[c(4, 1)] ``` ``` ## fred v1 ## 45 7 ``` ] Note the labels are in what R prints; not actually part of the value --- `names(x)` is just another vector (of characters): ``` r names(y) <- names(x) sort(names(x)) ``` ``` ## [1] "fred" "v1" "v2" "v3" ``` ``` r which(names(x) == "fred") ``` ``` ## [1] 4 ``` --- ## Arrays - Many data structures in R are made by adding bells and whistles to vectors - **Arrays** are vectors with *dimensions* - For example a two-dimensional array: ``` r x <- c(7, 8, 10, 45) x.arr <- array(x, dim = c(2, 2)) x.arr ``` ``` ## [,1] [,2] ## [1,] 7 10 ## [2,] 8 45 ``` --- ## Arrays - Filled column-wise (by columns) - `dim` says how many rows and columns ``` r dim(x.arr) ``` ``` ## [1] 2 2 ``` --- ## Arrays with more than two dimensions - Arrays can have three dimensions ( `\(r \times c \times h\)`; think about stacking `\(r \times c\)` matrices) - Can also have `\(4, 5, \ldots n\)` dimensional arrays - `dim` is a length `\(n\)` vector - Says the size/number of indices of each component - e.g., a `\(4 \times 3\)` array has `dim` length 2, elements are 4 and 3 .small[ ``` r myArr <- array(1:12, dim = c(4, 3)) myArr ``` ``` ## [,1] [,2] [,3] ## [1,] 1 5 9 ## [2,] 2 6 10 ## [3,] 3 7 11 ## [4,] 4 8 12 ``` ``` r dim(myArr) ``` ``` ## [1] 4 3 ``` ] --- ## Arrays with more than two dimensions - A `\(4 \times 3 \times 2\)` array has `dim` length 3, elements are 4, 3 and 2 ``` r myArr <- array(1:24, dim = c(4, 3, 2)) myArr ``` ``` ## , , 1 ## ## [,1] [,2] [,3] ## [1,] 1 5 9 ## [2,] 2 6 10 ## [3,] 3 7 11 ## [4,] 4 8 12 ## ## , , 2 ## ## [,1] [,2] [,3] ## [1,] 13 17 21 ## [2,] 14 18 22 ## [3,] 15 19 23 ## [4,] 16 20 24 ``` --- ## Arrays with more than two dimensions - A `\(4 \times 3 \times 2\)` array has `dim` length 3, elements are 4, 3 and 2 ``` r dim(myArr) ``` ``` ## [1] 4 3 2 ``` Some other properties of the array: ``` r is.vector(myArr) ``` ``` ## [1] FALSE ``` ``` r is.array(myArr) ``` ``` ## [1] TRUE ``` --- ``` r typeof(myArr) ``` ``` ## [1] "integer" ``` ``` r str(myArr) ``` ``` ## int [1:4, 1:3, 1:2] 1 2 3 4 5 6 7 8 9 10 ... ``` ``` r attributes(myArr) ``` ``` ## $dim ## [1] 4 3 2 ``` `typeof()` returns the type of the _elements_ `str()` gives the **structure**: here, a numeric array, with three dimensions, size/indices, and then the actual numbers --- ## Summary -- - Vectors: - Additional attributes: Factors, dates - Vector arithmetic - Other functions on vectors - Comparison operators - Indexing operators - Named components - Arrays - Vectors with dimension