Fundamentals of R: Data Manipulation

class: center, middle, inverse, title-slide

.title[
# Fundamentals of R: Data Manipulation
]
.subtitle[
## <br><br> STA35A: Statistical Data Science 1
]
.author[
### Xiao Hui Tai
]
.date[
### October 7, 2024
]

---

layout: true

---

## Today
- Exploratory data analysis
  - Visualization examples 
  
- Data manipulation tools

---

## Data frames, data sets

- We've seen data frames. This is a commonly used data structure that we get after reading in a data set into R.

- In a data set in general, 
  - Each row is an **observation**, `$n$`
  - Each column is a **variable**, `$p$`

- Often, the **first things we want to do** when given a data set are to figure out
  1. What is in it (what dimensions, what variables)
  2. What the main characteristics of the variables are.

- We've seen a few tools and functions for working with data frames in "base R," next we will look at some tools from `dplyr`

---
<img src="img/tidyverse.png" width="100%" style="display: block; margin: auto;" />
https://www.tidyverse.org/
- What we've seen so far: "base R"
- `ggplot2` for plotting, `dplyr` for data manipulation

---
## First question: What's in a data set?

### Example: Star Wars data

- `starwars` data set in the `dplyr` package

``` r
dplyr::starwars
```

```
## # A tibble: 87 × 14
##   name    height  mass hair_color skin_color eye_color birth_year
##   <chr>    <int> <dbl> <chr>      <chr>      <chr>          <dbl>
## 1 Luke S…    172    77 blond      fair       blue            19  
## 2 C-3PO      167    75 <NA>       gold       yellow         112  
## 3 R2-D2       96    32 <NA>       white, bl… red             33  
## 4 Darth …    202   136 none       white      yellow          41.9
## 5 Leia O…    150    49 brown      light      brown           19  
## 6 Owen L…    178   120 brown, gr… light      blue            52  
## # ℹ 81 more rows
## # ℹ 7 more variables: sex <chr>, gender <chr>, homeworld <chr>,
## #   species <chr>, films <list>, vehicles <list>,
## #   starships <list>
```

(A `tibble` is the `tidyverse` version of the data frame.)

---

We've seen `str()`. `dplyr::glimpse()` produces cleaner output in this case:

``` r
dplyr::glimpse(starwars)
```

```
## Rows: 87
## Columns: 14
## $ name       <chr> "Luke Skywalker", "C-3PO", "R2-D2", "Darth V…
## $ height     <int> 172, 167, 96, 202, 150, 178, 165, 97, 183, 1…
## $ mass       <dbl> 77.0, 75.0, 32.0, 136.0, 49.0, 120.0, 75.0, …
## $ hair_color <chr> "blond", NA, NA, "none", "brown", "brown, gr…
## $ skin_color <chr> "fair", "gold", "white, blue", "white", "lig…
## $ eye_color  <chr> "blue", "yellow", "red", "yellow", "brown", …
## $ birth_year <dbl> 19.0, 112.0, 33.0, 41.9, 19.0, 52.0, 47.0, N…
## $ sex        <chr> "male", "none", "none", "male", "female", "m…
## $ gender     <chr> "masculine", "masculine", "masculine", "masc…
## $ homeworld  <chr> "Tatooine", "Tatooine", "Naboo", "Tatooine",…
## $ species    <chr> "Human", "Droid", "Droid", "Human", "Human",…
## $ films      <list> <"A New Hope", "The Empire Strikes Back", "…
## $ vehicles   <list> <"Snowspeeder", "Imperial Speeder Bike">, <…
## $ starships  <list> <"X-wing", "Imperial shuttle">, <>, <>, "TI…
```

---

How many rows and columns does this data set have? What does each row represent? What does each column represent?

``` r
?starwars
```

---

How many rows and columns does this data set have?

``` r
nrow(starwars) # number of rows
```

```
## [1] 87
```

``` r
ncol(starwars) # number of columns
```

```
## [1] 14
```

``` r
dim(starwars)  # dimensions (row column)
```

```
## [1] 87 14
```

As we've seen, columns (variables) in data frames can be accessed with `$`:

``` r
dataframe$var_name
```

---

## Second question: what are the main characteristics of the data?

**Exploratory data analysis** (EDA) is an approach to summarizing the **main characteristics** of a data set

---

##  Exploratory data analysis

- Often, this is **visual**

- We might also calculate **summary statistics**, e.g., mean, median

- We might also **manipulate or transform** the data before visualizing or calculating summary statistics
  - e.g., filter certain values, group continuous variables into buckets, take log-transformation

- We will first introduce **visual summaries** and tools for data manipulation, then talk about **numerical summaries**.

- We saw a visualization example in the first lecture. Here are a few more.

---

## Visualization example 1: Mass vs. height in Star Wars data

How would you describe the **relationship** between mass and height of Starwars characters?
What other variables would help us understand data points that don't follow the **overall trend**?
Who is the not so tall but much heavier character?

---

## Jabba!

---

## Visualization Example 2: Anscombe's quartet

.small[
.pull-left[

```
##    set  x     y
## 1    I 10  8.04
## 2    I  8  6.95
## 3    I 13  7.58
## 4    I  9  8.81
## 5    I 11  8.33
## 6    I 14  9.96
## 7    I  6  7.24
## 8    I  4  4.26
## 9    I 12 10.84
## 10   I  7  4.82
## 11   I  5  5.68
## 12  II 10  9.14
## 13  II  8  8.14
## 14  II 13  8.74
## 15  II  9  8.77
## 16  II 11  9.26
## 17  II 14  8.10
## 18  II  6  6.13
## 19  II  4  3.10
## 20  II 12  9.13
## 21  II  7  7.26
## 22  II  5  4.74
```
] 
.pull-right[

```
##    set  x     y
## 23 III 10  7.46
## 24 III  8  6.77
## 25 III 13 12.74
## 26 III  9  7.11
## 27 III 11  7.81
## 28 III 14  8.84
## 29 III  6  6.08
## 30 III  4  5.39
## 31 III 12  8.15
## 32 III  7  6.42
## 33 III  5  5.73
## 34  IV  8  6.58
## 35  IV  8  5.76
## 36  IV  8  7.71
## 37  IV  8  8.84
## 38  IV  8  8.47
## 39  IV  8  7.04
## 40  IV  8  5.25
## 41  IV 19 12.50
## 42  IV  8  5.56
## 43  IV  8  7.91
## 44  IV  8  6.89
```
]
]
---

## Summary statistics are identical

``` r
Tmisc::quartet %>%
  group_by(set) %>%
  summarize(
    mean_x = mean(x), 
    mean_y = mean(y),
    sd_x = sd(x),
    sd_y = sd(y),
    r = cor(x, y)
  )
```

```
## # A tibble: 4 × 6
##   set   mean_x mean_y  sd_x  sd_y     r
##   <fct>  <dbl>  <dbl> <dbl> <dbl> <dbl>
## 1 I          9   7.50  3.32  2.03 0.816
## 2 II         9   7.50  3.32  2.03 0.816
## 3 III        9   7.5   3.32  2.03 0.816
## 4 IV         9   7.50  3.32  2.03 0.817
```

(Don't worry if you don't know what a standard deviation or correlation is; we will come back to this)

---

## Visualizing Anscombe's quartet

---

## Visualization Example 3: Facebook visits
.question[ 
How are people reporting lower vs. higher values of FB visits?
]

---
## Data manipulation using `dplyr`

.pull-left[
<img src="img/dplyr-part-of-tidyverse.png" width="70%" style="display: block; margin: auto;" />
]
.pull-right[
.midi[
- `select`: pick columns by name
- `arrange`: reorder rows
- `slice`: pick rows using index(es)
- `filter`: pick rows matching criteria
- `distinct`: filter for unique rows
- `mutate`: add new variables
- `summarize`: reduce variables to values
- `group_by`: for grouped operations
- ... (many more)
]
]

As we go over the examples, think about how you would do these in base R

---

## Rules of `dplyr` functions

- **First argument** is always a data frame

- Subsequent arguments say **what to do** with that data frame

- Always **return a data frame**

- **Don't modify in place**
  - Meaning that you need an assignment operation if you want an "updated" version of the data frame

---

## Data: Hotel bookings

- Data from two hotels: one resort and one city hotel

- **Observations**: Each **row** represents a hotel booking

- **Goal** for original data collection: Development of prediction models to classify a hotel booking's likelihood to be cancelled ([Antonia et al., 2019](https://www.sciencedirect.com/science/article/pii/S2352340918315191#bib5))

``` r
hotels <- readr::read_csv("data/hotels.csv")
```

.footnote[
Source: [TidyTuesday](https://github.com/rfordatascience/tidytuesday/blob/master/data/2020/2020-02-11/readme.md)
]

---

## First question: What is in the data set?

.tiny[

``` r
dplyr::glimpse(hotels)
```

```
## Rows: 119,390
## Columns: 32
## $ hotel                          <chr> "Resort Hotel", "Resort …
## $ is_canceled                    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, …
## $ lead_time                      <dbl> 342, 737, 7, 13, 14, 14,…
## $ arrival_date_year              <dbl> 2015, 2015, 2015, 2015, …
## $ arrival_date_month             <chr> "July", "July", "July", …
## $ arrival_date_week_number       <dbl> 27, 27, 27, 27, 27, 27, …
## $ arrival_date_day_of_month      <dbl> 1, 1, 1, 1, 1, 1, 1, 1, …
## $ stays_in_weekend_nights        <dbl> 0, 0, 0, 0, 0, 0, 0, 0, …
## $ stays_in_week_nights           <dbl> 0, 0, 1, 1, 2, 2, 2, 2, …
## $ adults                         <dbl> 2, 2, 1, 1, 2, 2, 2, 2, …
## $ children                       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, …
## $ babies                         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, …
## $ meal                           <chr> "BB", "BB", "BB", "BB", …
## $ country                        <chr> "PRT", "PRT", "GBR", "GB…
## $ market_segment                 <chr> "Direct", "Direct", "Dir…
## $ distribution_channel           <chr> "Direct", "Direct", "Dir…
## $ is_repeated_guest              <dbl> 0, 0, 0, 0, 0, 0, 0, 0, …
## $ previous_cancellations         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, …
## $ previous_bookings_not_canceled <dbl> 0, 0, 0, 0, 0, 0, 0, 0, …
## $ reserved_room_type             <chr> "C", "C", "A", "A", "A",…
## $ assigned_room_type             <chr> "C", "C", "C", "A", "A",…
## $ booking_changes                <dbl> 3, 4, 0, 0, 0, 0, 0, 0, …
## $ deposit_type                   <chr> "No Deposit", "No Deposi…
## $ agent                          <chr> "NULL", "NULL", "NULL", …
## $ company                        <chr> "NULL", "NULL", "NULL", …
## $ days_in_waiting_list           <dbl> 0, 0, 0, 0, 0, 0, 0, 0, …
## $ customer_type                  <chr> "Transient", "Transient"…
## $ adr                            <dbl> 0.00, 0.00, 75.00, 75.00…
## $ required_car_parking_spaces    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, …
## $ total_of_special_requests      <dbl> 0, 0, 0, 0, 1, 1, 0, 1, …
## $ reservation_status             <chr> "Check-Out", "Check-Out"…
## $ reservation_status_date        <date> 2015-07-01, 2015-07-01,…
```
]
---

## `select()`: Select a single column

View only `lead_time` (number of days between booking and arrival date):

``` r
select(hotels, lead_time)
```

```
## # A tibble: 119,390 × 1
##   lead_time
##       <dbl>
## 1       342
## 2       737
## 3         7
## 4        13
## 5        14
## 6        14
## # ℹ 119,384 more rows
```

- **First argument**: data frame we're working with , `hotels`
- **Second argument**: variable we want to select, `lead_time`
- **Result**: data frame with 119390 rows and 1 column
- This is an alternative to `hotels$lead_time`
---

## Select multiple columns

View only the `hotel` type and `lead_time` columns:

``` r
select(hotels, hotel, lead_time)
```

```
## # A tibble: 119,390 × 2
##   hotel        lead_time
##   <chr>            <dbl>
## 1 Resort Hotel       342
## 2 Resort Hotel       737
## 3 Resort Hotel         7
## 4 Resort Hotel        13
## 5 Resort Hotel        14
## 6 Resort Hotel        14
## # ℹ 119,384 more rows
```

---
## `select()` to exclude variables

- We saw earlier that `select()` keeps variables
- `select()` can also exclude variables, using the `-` sign

.small[

``` r
hotels %>%
  select(-agent) 
```

```
## # A tibble: 119,390 × 31
##   hotel        is_canceled lead_time arrival_date_year
##   <chr>              <dbl>     <dbl>             <dbl>
## 1 Resort Hotel           0       342              2015
## 2 Resort Hotel           0       737              2015
## 3 Resort Hotel           0         7              2015
## 4 Resort Hotel           0        13              2015
## 5 Resort Hotel           0        14              2015
## 6 Resort Hotel           0        14              2015
## # ℹ 119,384 more rows
## # ℹ 27 more variables: arrival_date_month <chr>,
## #   arrival_date_week_number <dbl>,
## #   arrival_date_day_of_month <dbl>,
## #   stays_in_weekend_nights <dbl>, stays_in_week_nights <dbl>,
## #   adults <dbl>, children <dbl>, babies <dbl>, meal <chr>,
## #   country <chr>, market_segment <chr>, …
```
]

---

## `select()` a range of variables

- Instead of writing out all the variable names, `select()` also accepts a **range of variables**

- This follows the order they are listed in the data frame

``` r
hotels %>%
  select(hotel:arrival_date_month) 
```

```
## # A tibble: 119,390 × 5
##   hotel        is_canceled lead_time arrival_date_year
##   <chr>              <dbl>     <dbl>             <dbl>
## 1 Resort Hotel           0       342              2015
## 2 Resort Hotel           0       737              2015
## 3 Resort Hotel           0         7              2015
## 4 Resort Hotel           0        13              2015
## 5 Resort Hotel           0        14              2015
## 6 Resort Hotel           0        14              2015
## # ℹ 119,384 more rows
## # ℹ 1 more variable: arrival_date_month <chr>
```

---

## `select()` variables with certain characteristics

``` r
hotels %>%
  select(starts_with("arrival"))
```

```
## # A tibble: 119,390 × 4
##   arrival_date_year arrival_date_month arrival_date_week_number
##               <dbl> <chr>                                 <dbl>
## 1              2015 July                                     27
## 2              2015 July                                     27
## 3              2015 July                                     27
## 4              2015 July                                     27
## 5              2015 July                                     27
## 6              2015 July                                     27
## # ℹ 119,384 more rows
## # ℹ 1 more variable: arrival_date_day_of_month <dbl>
```

---

## `select()` variables with certain characteristics

``` r
hotels %>%
  select(ends_with("type")) 
```

```
## # A tibble: 119,390 × 4
##   reserved_room_type assigned_room_type deposit_type
##   <chr>              <chr>              <chr>       
## 1 C                  C                  No Deposit  
## 2 C                  C                  No Deposit  
## 3 A                  C                  No Deposit  
## 4 A                  A                  No Deposit  
## 5 A                  A                  No Deposit  
## 6 A                  A                  No Deposit  
## # ℹ 119,384 more rows
## # ℹ 1 more variable: customer_type <chr>
```

---

## Select helpers

- `starts_with()`: Starts with a prefix
- `ends_with()`: Ends with a suffix
- `contains()`: Contains a literal string
- `num_range()`: Matches a numerical range like x01, x02, x03
- `one_of()`: Matches variable names in a character vector
- `everything()`: Matches all variables
- `last_col()`: Select last variable, possibly with an offset
- `matches()`: Matches a regular expression (a sequence of symbols/characters expressing a string/pattern to be searched for within text)

.footnote[
See help for any of these functions for more info, e.g. `?everything`.
]

---

## `select()`, then `arrange()`

What if we wanted to select these columns, and then arrange the data in order of lead time?

``` r
hotels %>%
  select(hotel, lead_time) %>%
  arrange(lead_time)
```

```
## # A tibble: 119,390 × 2
##   hotel        lead_time
##   <chr>            <dbl>
## 1 Resort Hotel         0
## 2 Resort Hotel         0
## 3 Resort Hotel         0
## 4 Resort Hotel         0
## 5 Resort Hotel         0
## 6 Resort Hotel         0
## # ℹ 119,384 more rows
```

---

## Pipes

In programming, a pipe is a technique for **passing information from one process to another**. In R, the symbol is `%>%`. Also: `|>`.

.pull-left[
- Start with the data frame `hotels`, and pass it to the `select()` function,
]
.pull-right[
.small[

``` r
hotels %>% 
  select(hotel, lead_time) %>%
  arrange(lead_time)
```

---

## Pipes

In programming, a pipe is a technique for **passing information from one process to another**. In R, the symbol is `%>%`. Also: `|>`.

.pull-left[
- Start with the data frame `hotels`, and pass it to the `select()` function,
- then we select the variables `hotel` and `lead_time`,
]
.pull-right[
.small[

``` r
hotels %>%
  select(hotel, lead_time) %>% #<<
  arrange(lead_time)
```

---

## Pipes

In programming, a pipe is a technique for **passing information from one process to another**. In R, the symbol is `%>%`. Also: `|>`.

.pull-left[
- Start with the data frame `hotels`, and pass it to the `select()` function,
- then we select the variables `hotel` and `lead_time`,
- and then we arrange the data frame by `lead_time`.
]
.pull-right[
.small[

``` r
hotels %>%
  select(hotel, lead_time) %>% 
  arrange(lead_time) #<<
```

Note that the pipe operator is implemented in the package `magrittr`, but is automatically loaded when we use `library(dplyr)` or `library(tidyverse)`.

---

## How does a pipe work?

- You can think about the following **sequence of actions** - find keys, start car, drive to work, park.

- Expressed as a set of **nested functions** in R pseudocode this would look like:

``` r
park(drive(start_car(find("keys")), to = "work"))
```

- Writing it out using pipes give it a more natural (and easier to read) 
structure:

``` r
find("keys") %>%
  start_car() %>%
  drive(to = "work") %>%
  park()
```

---
## Simple example
- We can write `exp(1)` with pipes as `1 %>% exp`, and `log(exp(1))` as `1 %>% exp %>% log`

``` r
exp(1)
```

```
## [1] 2.718282
```

``` r
1 %>% exp
```

```
## [1] 2.718282
```

``` r
1 %>% exp %>% log
```

```
## [1] 1
```

- Tidyverse functions are at their best when composed together using the pipe operator

---
## `arrange()` in ascending or descending order

- We saw earlier that `arrange()` defaults to ascending order

- For descending order, use `desc()`

.pull-left[

``` r
hotels %>%
  select(hotel, lead_time) %>% 
  arrange(lead_time)
```

``` r
hotels %>%
  select(hotel, lead_time) %>% 
  arrange(desc(lead_time))
```

```
## # A tibble: 119,390 × 2
##   hotel        lead_time
##   <chr>            <dbl>
## 1 Resort Hotel       737
## 2 Resort Hotel       709
## 3 City Hotel         629
## 4 City Hotel         629
## 5 City Hotel         629
## 6 City Hotel         629
## # ℹ 119,384 more rows
```
]

---

## `slice()` for certain row numbers

This is an alternative indexing option for `hotels[1:5, ]`

``` r
hotels %>%
  slice(1:5) 
```

```
## # A tibble: 5 × 32
##   hotel        is_canceled lead_time arrival_date_year
##   <chr>              <dbl>     <dbl>             <dbl>
## 1 Resort Hotel           0       342              2015
## 2 Resort Hotel           0       737              2015
## 3 Resort Hotel           0         7              2015
## 4 Resort Hotel           0        13              2015
## 5 Resort Hotel           0        14              2015
## # ℹ 28 more variables: arrival_date_month <chr>,
## #   arrival_date_week_number <dbl>,
## #   arrival_date_day_of_month <dbl>,
## #   stays_in_weekend_nights <dbl>, stays_in_week_nights <dbl>,
## #   adults <dbl>, children <dbl>, babies <dbl>, meal <chr>,
## #   country <chr>, market_segment <chr>,
## #   distribution_channel <chr>, is_repeated_guest <dbl>, …
```

---

## Reminder: comments in R

- Any text following `#` will be printed as is, and won't be run as code

- This is useful for leaving comments and for temporarily disabling 
certain lines of code (for debugging, trying out different things)

.tiny[

``` r
hotels %>%
  # slice the first five rows  # this line is a comment
  #select(hotel) %>%           # this one doesn't run
  slice(1:5)                   # this line runs
```

---

## `filter()` to select a subset of rows

.small[

``` r
# bookings in City Hotels
hotels %>%
  filter(hotel == "City Hotel") 
```

```
## # A tibble: 79,330 × 32
##   hotel      is_canceled lead_time arrival_date_year
##   <chr>            <dbl>     <dbl>             <dbl>
## 1 City Hotel           0         6              2015
## 2 City Hotel           1        88              2015
## 3 City Hotel           1        65              2015
## 4 City Hotel           1        92              2015
## 5 City Hotel           1       100              2015
## 6 City Hotel           1        79              2015
## # ℹ 79,324 more rows
## # ℹ 28 more variables: arrival_date_month <chr>,
## #   arrival_date_week_number <dbl>,
## #   arrival_date_day_of_month <dbl>,
## #   stays_in_weekend_nights <dbl>, stays_in_week_nights <dbl>,
## #   adults <dbl>, children <dbl>, babies <dbl>, meal <chr>,
## #   country <chr>, market_segment <chr>, …
```
]

<small>What was the base R alternative that we saw?</small>

---

## `filter()` for many conditions at once

``` r
hotels %>%
  filter( 
    adults == 0,     
    children >= 1    
    ) %>% 
  select(adults, babies, children)
```

```
## # A tibble: 223 × 3
##   adults babies children
##    <dbl>  <dbl>    <dbl>
## 1      0      0        3
## 2      0      0        2
## 3      0      0        2
## 4      0      0        2
## 5      0      0        2
## 6      0      0        3
## # ℹ 217 more rows
```

---

## `filter()` for more complex conditions

``` r
# bookings with no adults and some children or babies in the room
hotels %>%
  filter( 
    adults == 0,     
    children >= 1 | babies >= 1   
    ) %>%
  select(adults, babies, children)
```

---

## Reminder: Logical operators in R

<br>

operator    | definition                   || operator     | definition
------------|------------------------------||--------------|----------------
`<`         | less than                    ||`x`&nbsp;&#124;&nbsp;`y`     | `x` OR `y` 
`<=`        |	less than or equal to        ||`is.na(x)`    | test if `x` is `NA`
`>`         | greater than                 ||`!is.na(x)`   | test if `x` is not `NA`
`>=`        |	greater than or equal to     ||`x %in% y`    | test if `x` is in `y`
`==`        |	exactly equal to             ||`!(x %in% y)` | test if `x` is not in `y`
`!=`        |	not equal to                 ||`!x`          | not `x`
`x & y`     | `x` AND `y`                  ||              |

---
## Summary
--

- Data manipulation tools

- `select()`: selects columns by name
  
  - `arrange()`: reorders rows
  
  - `slice()`: selects rows using index(es)
  
  - `filter()`: selects rows matching criteria