Fundamentals of R: More Data Structures

class: center, middle, inverse, title-slide

.title[
# Fundamentals of R: More Data Structures
]
.subtitle[
## <br><br> STA35A: Statistical Data Science 1
]
.author[
### Xiao Hui Tai
]
.date[
### September 30, 2024
]

---

layout: true

---

## Recap

- **Data types**: Logical/Booleans, Integers, Double, Characters, Complex, Raw

- **Operators**: unary, binary, arithmetic, logical, assignment

- **Data structures** group related values together

- **Vectors** group values of the same type

- Creating and accessing vectors

---
## Style guide

- https://style.tidyverse.org/index.html

---
## Today
- More on vectors: 
  - Creating vectors 
  - Attributes
  - Vector arithmetic
  - Other functions on vectors
  
- Arrays

---
## Creating vectors
`vector(length = 7)` returns a logical vector of length 7, initialized with FALSEs; helpful for filling things up later (see help page: `vector(mode = "logical", length = 0)`)

``` r
weeklyHours <- vector(length = 7)
weeklyHours[5] <- 8
weeklyHours
```

```
## [1] 0 0 0 0 8 0 0
```

The colon operator produces a sequence

``` r
mySeq <- 2:5
mySeq
```

```
## [1] 2 3 4 5
```

---
## Creating vectors
Many other ways to produce sequences, e.g.,

``` r
(mySeq <- seq(from = 1, to = 10, by = 2))
```

```
## [1] 1 3 5 7 9
```
<small>(Enclosing an assignment statement in parentheses prints the result)</small>

---
## Vectors with additional attributes: Factors
- **Factors** are built on top of integer vectors

- These have a **fixed and known set of possible values**.

- Factors have **two components**: level numbers (integers) and level labels (characters)

``` r
tmp <- factor(c("BS", "MS", "PhD", "MS"))
tmp
```

```
## [1] BS  MS  PhD MS 
## Levels: BS MS PhD
```

``` r
as.integer(tmp)
```

```
## [1] 1 2 3 2
```

---

## Vectors with additional attributes: Dates

- **Dates and date-times** are built on top of numeric vectors

- Dates are represented internally as the number of days since the origin, 1 Jan 1970

``` r
z <- as.Date("2020-01-01")
z
```

```
## [1] "2020-01-01"
```

``` r
typeof(z)
```

```
## [1] "double"
```

``` r
str(z)
```

```
##  Date[1:1], format: "2020-01-01"
```

---

## Vectors with additional attributes: Dates

``` r
z
```

```
## [1] "2020-01-01"
```

``` r
as.integer(z)
```

```
## [1] 18262
```

``` r
as.integer(z) / 365 # roughly 50 yrs
```

```
## [1] 50.03288
```

We will talk more about packages later on in the class, but the `lubridate` package is particularly useful for dealing with dates.

---
## Vector arithmetic

Operators apply to vectors "pairwise" or "elementwise":

``` r
x <- c(7, 8, 10, 45)
y <- c(-7, -8, -10, -45)
x + y
```

```
## [1] 0 0 0 0
```

``` r
x * y
```

```
## [1]   -49   -64  -100 -2025
```

``` r
x^c(1, 0, -1, 0.5)
```

```
## [1] 7.000000 1.000000 0.100000 6.708204
```

---
## Vector arithmetic

R will implicitly coerce the types of vectors to be compatible.  E.g.:

``` r
TRUE + 4
```

```
## [1] 5
```
---
## Recycling
- R will also implicitly coerce the length of vectors.

- This is called vector **recycling**:

- When a shorter vector is combined with a longer one, elements of the shorter vector are repeated or recycled, to make it the same length as the longer vector.

``` r
x <- c(7, 8, 10, 45)
x + c(-7, -8)
```

```
## [1]  0  0  3 37
```

Single numbers are vectors of length 1 for purposes of recycling:

``` r
2*x
```

```
## [1] 14 16 20 90
```

---
## Exercises
What do we expect?

``` r
myVec <- c(4, 6, 12, 2, 0)
myVec + 1:5
myVec + 1 # recycling
myVec * 5
rep(FALSE, 4) + 1:4 # Coercion
```

---
## Vectorized functions
Most built-in functions are **vectorized**, meaning that they will operate on a vector of numbers:

``` r
sample(1:10) + 100
```

```
##  [1] 110 106 105 104 101 108 102 107 109 103
```

<small>(what does `sample()` do?)</small>

``` r
x
```

```
## [1]  7  8 10 45
```

``` r
x > 9 # pairwise comparisons, where the scalar 9 is recycled
```

```
## [1] FALSE FALSE  TRUE  TRUE
```

---
## Vectorized functions

Lots of functions take vectors as arguments:

- `mean()`, `median()`, `sd()`, `var()`, `max()`, `min()`, `length()`, `sum()`: return single numbers

- `sort()` returns a new vector

- `hist()` takes a vector of numbers and produces a histogram

- `summary()` gives a five-number summary of numerical vectors

- `any()` and `all()` are useful on Boolean vectors

---
## Comparison operators
Boolean operators work elementwise:

``` r
x
```

```
## [1]  7  8 10 45
```

``` r
(x > 9) & (x < 20)
```

```
## [1] FALSE FALSE  TRUE FALSE
```

To get the number of components that satisfy a certain condition:

``` r
sum(x > 9) # another example of coercion
```

```
## [1] 2
```

---
## Comparison operators

To compare whole vectors, best to use `identical()` or `all.equal()`:

``` r
x; y
```

```
## [1]  7  8 10 45
```

```
## [1]  -7  -8 -10 -45
```

``` r
x == -y
```

```
## [1] TRUE TRUE TRUE TRUE
```

``` r
identical(x, -y)
```

```
## [1] TRUE
```

``` r
all.equal(x, -y)
```

```
## [1] TRUE
```

---

``` r
identical(c(0.5 - 0.3, 0.3 - 0.1), c(0.3 - 0.1, 0.5 - 0.3))
```

```
## [1] FALSE
```

``` r
all.equal(c(0.5 - 0.3, 0.3 - 0.1), c(0.3 - 0.1, 0.5 - 0.3))
```

```
## [1] TRUE
```

To see what's going on:

``` r
print(.5 - .3, digits = 20)
```

```
## [1] 0.2000000000000000111
```

``` r
print(.3 - .1, digits = 20)
```

```
## [1] 0.19999999999999998335
```

Decimal numbers are not represented exactly in computer arithmetic. For more, see https://stackoverflow.com/questions/9508518/why-are-these-numbers-not-equal.
---
## Combining indexing and comparison operators

``` r
x; y
```

```
## [1]  7  8 10 45
```

```
## [1]  -7  -8 -10 -45
```

``` r
x > 9
```

```
## [1] FALSE FALSE  TRUE  TRUE
```

``` r
x[x > 9]
```

```
## [1] 10 45
```

``` r
y[x > 9]
```

```
## [1] -10 -45
```

---
## Combining indexing and comparison operators
`which()` turns a Boolean vector into a vector of TRUE indices:

``` r
x; y
```

```
## [1]  7  8 10 45
```

```
## [1]  -7  -8 -10 -45
```

``` r
(places <- which(x > 9))
```

```
## [1] 3 4
```

``` r
y[places]
```

```
## [1] -10 -45
```

---
Note that the behavior of `which()` and the indexing operator is different when there are `NA` values:

``` r
myNAvec <- c(1, 2, NA)
which(myNAvec > 1)
```

```
## [1] 2
```

``` r
myNAvec > 1
```

```
## [1] FALSE  TRUE    NA
```

``` r
myNAvec[which(myNAvec > 1)]
```

```
## [1] 2
```

``` r
myNAvec[myNAvec > 1]
```

```
## [1]  2 NA
```

---
## Aside: counting the number of missing values

``` r
myNAvec 
```

```
## [1]  1  2 NA
```

``` r
is.na(myNAvec)
```

```
## [1] FALSE FALSE  TRUE
```

``` r
sum(is.na(myNAvec))
```

```
## [1] 1
```

---
## Exercises: comparison operators
What do we expect?

``` r
myVec <- c(4, 6, 12, 2, 0)
myVec < 10
(myVec < 10) | (myVec > 1)
myVec + 1 # recycling
myVec * 5
```

---
## Exercises

Get familiar with a few built-in R functions:

``` r
myVec <- c(4, 6, 12, 2, 0)
mean(myVec)
median(myVec)
table(myVec)
summary(myVec)
sort(myVec)
```

---
## Named components

You can give names to elements or components of vectors

.small[

``` r
x
```

```
## [1]  7  8 10 45
```

``` r
(names(x) <- c("v1", "v2", "v3", "fred"))
```

```
## [1] "v1"   "v2"   "v3"   "fred"
```

``` r
x[c("fred", "v1")]
```

```
## fred   v1 
##   45    7
```

``` r
x[c(4, 1)]
```

```
## fred   v1 
##   45    7
```
]

Note the labels are in what R prints; not actually part of the value

---
`names(x)` is just another vector (of characters):

``` r
names(y) <- names(x)
sort(names(x))
```

```
## [1] "fred" "v1"   "v2"   "v3"
```

``` r
which(names(x) == "fred")
```

```
## [1] 4
```

---
## Arrays

- Many data structures in R are made by adding bells and whistles to vectors

- **Arrays** are vectors with *dimensions*

- For example a two-dimensional array:

``` r
x <- c(7, 8, 10, 45)
x.arr <- array(x, dim = c(2, 2))
x.arr
```

```
##      [,1] [,2]
## [1,]    7   10
## [2,]    8   45
```

---
## Arrays

- Filled column-wise (by columns)

- `dim` says how many rows and columns

``` r
dim(x.arr)
```

```
## [1] 2 2
```
---

## Arrays with more than two dimensions
- Arrays can have three dimensions ( `$r \times c \times h$`; think about stacking `$r \times c$` matrices)

- Can also have `$4, 5, \ldots n$` dimensional arrays

- `dim` is a length `$n$` vector
  - Says the size/number of indices of each component
  - e.g., a `$4 \times 3$` array has `dim` length 2, elements are 4 and 3

.small[

``` r
myArr <- array(1:12, dim = c(4, 3))
myArr
```

```
##      [,1] [,2] [,3]
## [1,]    1    5    9
## [2,]    2    6   10
## [3,]    3    7   11
## [4,]    4    8   12
```

``` r
dim(myArr)
```

```
## [1] 4 3
```
]

---

## Arrays with more than two dimensions
- A `$4 \times 3 \times 2$` array has `dim` length 3, elements are 4, 3 and 2

``` r
myArr <- array(1:24, dim = c(4, 3, 2))
myArr
```

```
## , , 1
## 
##      [,1] [,2] [,3]
## [1,]    1    5    9
## [2,]    2    6   10
## [3,]    3    7   11
## [4,]    4    8   12
## 
## , , 2
## 
##      [,1] [,2] [,3]
## [1,]   13   17   21
## [2,]   14   18   22
## [3,]   15   19   23
## [4,]   16   20   24
```

---

## Arrays with more than two dimensions
- A `$4 \times 3 \times 2$` array has `dim` length 3, elements are 4, 3 and 2

``` r
dim(myArr)
```

```
## [1] 4 3 2
```

Some other properties of the array:

``` r
is.vector(myArr)
```

```
## [1] FALSE
```

``` r
is.array(myArr)
```

```
## [1] TRUE
```

---

``` r
typeof(myArr)
```

```
## [1] "integer"
```

``` r
str(myArr)
```

```
##  int [1:4, 1:3, 1:2] 1 2 3 4 5 6 7 8 9 10 ...
```

``` r
attributes(myArr)
```

```
## $dim
## [1] 4 3 2
```
`typeof()` returns the type of the _elements_

`str()` gives the **structure**: here, a numeric array, with three dimensions, size/indices, and then the actual numbers

---
## Summary

--
- Vectors: 
  - Additional attributes: Factors, dates
  - Vector arithmetic
  - Other functions on vectors
    - Comparison operators 
    - Indexing operators
  - Named components

- Arrays
  - Vectors with dimension