Fundamentals of R: Data Manipulation

class: center, middle, inverse, title-slide

.title[
# Fundamentals of R: Data Manipulation
]
.subtitle[
## <br><br> STA35A: Statistical Data Science 1
]
.author[
### Xiao Hui Tai
]
.date[
### October 9, 2024
]

---

layout: true

---

## Reminders/Announcements

- HW 1 is due tomorrow at 9pm

- HW 2 will be posted on Friday afternoon on the course website

- Schedule for next week (week 4):
  - Monday: regular lecture; lab due 9 PM
  - Wednesday: Oscar Rivera will do review during regular lecture time (same room)
  - Thursday: **no lab**, instead 12-1PM OH (XHT; virtual, will post link on Piazza); 3-4 PM OH (OR); HW 2 due
  - Friday: midterm during regular lecture time (same room); **no homework**; 1-2 PM OH (OR)

---
## Midterm
- Midterm will cover material until Monday, Oct 14

- Closed-book

- You don't need your computers

- No make-up exams

- Drop policy for exams: 1 midterm may be dropped

---
## Today
- Data manipulation tools

---

## Data: Hotel bookings

- Data from two hotels: one resort and one city hotel

- **Observations**: Each **row** represents a hotel booking

- **Goal** for original data collection: Development of prediction models to classify a hotel booking's likelihood to be cancelled ([Antonia et al., 2019](https://www.sciencedirect.com/science/article/pii/S2352340918315191#bib5))

``` r
hotels <- readr::read_csv("data/hotels.csv")
```

.footnote[
Source: [TidyTuesday](https://github.com/rfordatascience/tidytuesday/blob/master/data/2020/2020-02-11/readme.md)
]

---

## First question: What is in the data set?

.tiny[

``` r
dplyr::glimpse(hotels)
```

```
## Rows: 119,390
## Columns: 32
## $ hotel                          <chr> "Resort Hotel", "Resort …
## $ is_canceled                    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, …
## $ lead_time                      <dbl> 342, 737, 7, 13, 14, 14,…
## $ arrival_date_year              <dbl> 2015, 2015, 2015, 2015, …
## $ arrival_date_month             <chr> "July", "July", "July", …
## $ arrival_date_week_number       <dbl> 27, 27, 27, 27, 27, 27, …
## $ arrival_date_day_of_month      <dbl> 1, 1, 1, 1, 1, 1, 1, 1, …
## $ stays_in_weekend_nights        <dbl> 0, 0, 0, 0, 0, 0, 0, 0, …
## $ stays_in_week_nights           <dbl> 0, 0, 1, 1, 2, 2, 2, 2, …
## $ adults                         <dbl> 2, 2, 1, 1, 2, 2, 2, 2, …
## $ children                       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, …
## $ babies                         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, …
## $ meal                           <chr> "BB", "BB", "BB", "BB", …
## $ country                        <chr> "PRT", "PRT", "GBR", "GB…
## $ market_segment                 <chr> "Direct", "Direct", "Dir…
## $ distribution_channel           <chr> "Direct", "Direct", "Dir…
## $ is_repeated_guest              <dbl> 0, 0, 0, 0, 0, 0, 0, 0, …
## $ previous_cancellations         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, …
## $ previous_bookings_not_canceled <dbl> 0, 0, 0, 0, 0, 0, 0, 0, …
## $ reserved_room_type             <chr> "C", "C", "A", "A", "A",…
## $ assigned_room_type             <chr> "C", "C", "C", "A", "A",…
## $ booking_changes                <dbl> 3, 4, 0, 0, 0, 0, 0, 0, …
## $ deposit_type                   <chr> "No Deposit", "No Deposi…
## $ agent                          <chr> "NULL", "NULL", "NULL", …
## $ company                        <chr> "NULL", "NULL", "NULL", …
## $ days_in_waiting_list           <dbl> 0, 0, 0, 0, 0, 0, 0, 0, …
## $ customer_type                  <chr> "Transient", "Transient"…
## $ adr                            <dbl> 0.00, 0.00, 75.00, 75.00…
## $ required_car_parking_spaces    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, …
## $ total_of_special_requests      <dbl> 0, 0, 0, 0, 1, 1, 0, 1, …
## $ reservation_status             <chr> "Check-Out", "Check-Out"…
## $ reservation_status_date        <date> 2015-07-01, 2015-07-01,…
```
]

---

## `select()`: Select a single column

View only `lead_time` (number of days between booking and arrival date):

``` r
select(hotels, lead_time)
```

```
## # A tibble: 119,390 × 1
##   lead_time
##       <dbl>
## 1       342
## 2       737
## 3         7
## 4        13
## 5        14
## 6        14
## # ℹ 119,384 more rows
```

- **First argument**: data frame we're working with , `hotels`
- **Second argument**: variable we want to select, `lead_time`
- **Result**: data frame with 119390 rows and 1 column
- This is an alternative to `hotels$lead_time`
---

## Select multiple columns

View only the `hotel` type and `lead_time` columns:

``` r
select(hotels, hotel, lead_time)
```

```
## # A tibble: 119,390 × 2
##   hotel        lead_time
##   <chr>            <dbl>
## 1 Resort Hotel       342
## 2 Resort Hotel       737
## 3 Resort Hotel         7
## 4 Resort Hotel        13
## 5 Resort Hotel        14
## 6 Resort Hotel        14
## # ℹ 119,384 more rows
```

---
## `select()` to exclude variables

- We saw earlier that `select()` keeps variables
- `select()` can also exclude variables, using the `-` sign

.small[

``` r
hotels %>%
  select(-agent) 
```

```
## # A tibble: 119,390 × 31
##   hotel        is_canceled lead_time arrival_date_year
##   <chr>              <dbl>     <dbl>             <dbl>
## 1 Resort Hotel           0       342              2015
## 2 Resort Hotel           0       737              2015
## 3 Resort Hotel           0         7              2015
## 4 Resort Hotel           0        13              2015
## 5 Resort Hotel           0        14              2015
## 6 Resort Hotel           0        14              2015
## # ℹ 119,384 more rows
## # ℹ 27 more variables: arrival_date_month <chr>,
## #   arrival_date_week_number <dbl>,
## #   arrival_date_day_of_month <dbl>,
## #   stays_in_weekend_nights <dbl>, stays_in_week_nights <dbl>,
## #   adults <dbl>, children <dbl>, babies <dbl>, meal <chr>,
## #   country <chr>, market_segment <chr>, …
```
]

---

## `select()` a range of variables

- Instead of writing out all the variable names, `select()` also accepts a **range of variables**

- This follows the order they are listed in the data frame

``` r
hotels %>%
  select(hotel:arrival_date_month) 
```

```
## # A tibble: 119,390 × 5
##   hotel        is_canceled lead_time arrival_date_year
##   <chr>              <dbl>     <dbl>             <dbl>
## 1 Resort Hotel           0       342              2015
## 2 Resort Hotel           0       737              2015
## 3 Resort Hotel           0         7              2015
## 4 Resort Hotel           0        13              2015
## 5 Resort Hotel           0        14              2015
## 6 Resort Hotel           0        14              2015
## # ℹ 119,384 more rows
## # ℹ 1 more variable: arrival_date_month <chr>
```

---

## `select()` variables with certain characteristics

``` r
hotels %>%
  select(starts_with("arrival"))
```

```
## # A tibble: 119,390 × 4
##   arrival_date_year arrival_date_month arrival_date_week_number
##               <dbl> <chr>                                 <dbl>
## 1              2015 July                                     27
## 2              2015 July                                     27
## 3              2015 July                                     27
## 4              2015 July                                     27
## 5              2015 July                                     27
## 6              2015 July                                     27
## # ℹ 119,384 more rows
## # ℹ 1 more variable: arrival_date_day_of_month <dbl>
```

---

## `select()` variables with certain characteristics

``` r
hotels %>%
  select(ends_with("type")) 
```

```
## # A tibble: 119,390 × 4
##   reserved_room_type assigned_room_type deposit_type
##   <chr>              <chr>              <chr>       
## 1 C                  C                  No Deposit  
## 2 C                  C                  No Deposit  
## 3 A                  C                  No Deposit  
## 4 A                  A                  No Deposit  
## 5 A                  A                  No Deposit  
## 6 A                  A                  No Deposit  
## # ℹ 119,384 more rows
## # ℹ 1 more variable: customer_type <chr>
```

---

## Select helpers

- `starts_with()`: Starts with a prefix
- `ends_with()`: Ends with a suffix
- `contains()`: Contains a literal string
- `num_range()`: Matches a numerical range like x01, x02, x03
- `one_of()`: Matches variable names in a character vector
- `everything()`: Matches all variables
- `last_col()`: Select last variable, possibly with an offset
- `matches()`: Matches a regular expression (a sequence of symbols/characters expressing a string/pattern to be searched for within text)

.footnote[
See help for any of these functions for more info, e.g. `?everything`.
]

---

## `select()`, then `arrange()`

What if we wanted to select these columns, and then arrange the data in order of lead time?

``` r
hotels %>%
  select(hotel, lead_time) %>%
  arrange(lead_time)
```

```
## # A tibble: 119,390 × 2
##   hotel        lead_time
##   <chr>            <dbl>
## 1 Resort Hotel         0
## 2 Resort Hotel         0
## 3 Resort Hotel         0
## 4 Resort Hotel         0
## 5 Resort Hotel         0
## 6 Resort Hotel         0
## # ℹ 119,384 more rows
```

---

## Pipes

In programming, a pipe is a technique for **passing information from one process to another**. In R, the symbol is `%>%`. Also: `|>`.

.pull-left[
- Start with the data frame `hotels`, and pass it to the `select()` function,
]
.pull-right[
.small[

``` r
hotels %>% 
  select(hotel, lead_time) %>%
  arrange(lead_time)
```

---

## Pipes

In programming, a pipe is a technique for **passing information from one process to another**. In R, the symbol is `%>%`. Also: `|>`.

.pull-left[
- Start with the data frame `hotels`, and pass it to the `select()` function,
- then we select the variables `hotel` and `lead_time`,
]
.pull-right[
.small[

``` r
hotels %>%
  select(hotel, lead_time) %>% #<<
  arrange(lead_time)
```

---

## Pipes

In programming, a pipe is a technique for **passing information from one process to another**. In R, the symbol is `%>%`. Also: `|>`.

.pull-left[
- Start with the data frame `hotels`, and pass it to the `select()` function,
- then we select the variables `hotel` and `lead_time`,
- and then we arrange the data frame by `lead_time`.
]
.pull-right[
.small[

``` r
hotels %>%
  select(hotel, lead_time) %>% 
  arrange(lead_time) #<<
```

Note that the pipe operator is implemented in the package `magrittr`, but is automatically loaded when we use `library(dplyr)` or `library(tidyverse)`.

---

## How does a pipe work?

- You can think about the following **sequence of actions** - find keys, start car, drive to work, park.

- Expressed as a set of **nested functions** in R pseudocode this would look like:

``` r
park(drive(start_car(find("keys")), to = "work"))
```

- Writing it out using pipes give it a more natural (and easier to read) 
structure:

``` r
find("keys") %>%
  start_car() %>%
  drive(to = "work") %>%
  park()
```

---
## Simple example
- We can write `exp(1)` with pipes as `1 %>% exp`, and `log(exp(1))` as `1 %>% exp %>% log`

``` r
exp(1)
```

```
## [1] 2.718282
```

``` r
1 %>% exp
```

```
## [1] 2.718282
```

``` r
1 %>% exp %>% log
```

```
## [1] 1
```

- Tidyverse functions are at their best when composed together using the pipe operator

---
## `arrange()` in ascending or descending order

- We saw earlier that `arrange()` defaults to ascending order

- For descending order, use `desc()`

.pull-left[

``` r
hotels %>%
  select(hotel, lead_time) %>% 
  arrange(lead_time)
```

``` r
hotels %>%
  select(hotel, lead_time) %>% 
  arrange(desc(lead_time))
```

```
## # A tibble: 119,390 × 2
##   hotel        lead_time
##   <chr>            <dbl>
## 1 Resort Hotel       737
## 2 Resort Hotel       709
## 3 City Hotel         629
## 4 City Hotel         629
## 5 City Hotel         629
## 6 City Hotel         629
## # ℹ 119,384 more rows
```
]

---

## `slice()` for certain row numbers

This is an alternative indexing option for `hotels[1:5, ]`

``` r
hotels %>%
  slice(1:5) 
```

```
## # A tibble: 5 × 32
##   hotel        is_canceled lead_time arrival_date_year
##   <chr>              <dbl>     <dbl>             <dbl>
## 1 Resort Hotel           0       342              2015
## 2 Resort Hotel           0       737              2015
## 3 Resort Hotel           0         7              2015
## 4 Resort Hotel           0        13              2015
## 5 Resort Hotel           0        14              2015
## # ℹ 28 more variables: arrival_date_month <chr>,
## #   arrival_date_week_number <dbl>,
## #   arrival_date_day_of_month <dbl>,
## #   stays_in_weekend_nights <dbl>, stays_in_week_nights <dbl>,
## #   adults <dbl>, children <dbl>, babies <dbl>, meal <chr>,
## #   country <chr>, market_segment <chr>,
## #   distribution_channel <chr>, is_repeated_guest <dbl>, …
```

---

## Reminder: comments in R

- Any text following `#` will be printed as is, and won't be run as code

- This is useful for leaving comments and for temporarily disabling 
certain lines of code (for debugging, trying out different things)

.tiny[

``` r
hotels %>%
  # slice the first five rows  # this line is a comment
  #select(hotel) %>%           # this one doesn't run
  slice(1:5)                   # this line runs
```

---

## `filter()` to select a subset of rows

.small[

``` r
# bookings in City Hotels
hotels %>%
  filter(hotel == "City Hotel") 
```

```
## # A tibble: 79,330 × 32
##   hotel      is_canceled lead_time arrival_date_year
##   <chr>            <dbl>     <dbl>             <dbl>
## 1 City Hotel           0         6              2015
## 2 City Hotel           1        88              2015
## 3 City Hotel           1        65              2015
## 4 City Hotel           1        92              2015
## 5 City Hotel           1       100              2015
## 6 City Hotel           1        79              2015
## # ℹ 79,324 more rows
## # ℹ 28 more variables: arrival_date_month <chr>,
## #   arrival_date_week_number <dbl>,
## #   arrival_date_day_of_month <dbl>,
## #   stays_in_weekend_nights <dbl>, stays_in_week_nights <dbl>,
## #   adults <dbl>, children <dbl>, babies <dbl>, meal <chr>,
## #   country <chr>, market_segment <chr>, …
```
]

<small>What was the base R alternative that we saw?</small>

---

## `filter()` for many conditions at once

``` r
hotels %>%
  filter( 
    adults == 0,     
    children >= 1    
    ) %>% 
  select(adults, babies, children)
```

```
## # A tibble: 223 × 3
##   adults babies children
##    <dbl>  <dbl>    <dbl>
## 1      0      0        3
## 2      0      0        2
## 3      0      0        2
## 4      0      0        2
## 5      0      0        2
## 6      0      0        3
## # ℹ 217 more rows
```

---

## `filter()` for more complex conditions

``` r
# bookings with no adults and some children or babies in the room
hotels %>%
  filter( 
    adults == 0,     
    children >= 1 | babies >= 1   
    ) %>%
  select(adults, babies, children)
```

---

## Reminder: Logical operators in R

<br>

operator    | definition                   || operator     | definition
------------|------------------------------||--------------|----------------
`<`         | less than                    ||`x`&nbsp;&#124;&nbsp;`y`     | `x` OR `y` 
`<=`        |	less than or equal to        ||`is.na(x)`    | test if `x` is `NA`
`>`         | greater than                 ||`!is.na(x)`   | test if `x` is not `NA`
`>=`        |	greater than or equal to     ||`x %in% y`    | test if `x` is in `y`
`==`        |	exactly equal to             ||`!(x %in% y)` | test if `x` is not in `y`
`!=`        |	not equal to                 ||`!x`          | not `x`
`x & y`     | `x` AND `y`                  ||              |

---
## `distinct()` to filter for unique rows

.small[
.pull-left[

``` r
hotels %>% 
  distinct(market_segment) 
```

```
## # A tibble: 8 × 1
##   market_segment
##   <chr>         
## 1 Direct        
## 2 Corporate     
## 3 Online TA     
## 4 Offline TA/TO 
## 5 Complementary 
## 6 Groups        
## # ℹ 2 more rows
```
]

.pull-left[
Recall: `arrange()` to order alphabetically

``` r
hotels %>% 
  distinct(market_segment) %>%
  arrange(market_segment)
```

```
## # A tibble: 8 × 1
##   market_segment
##   <chr>         
## 1 Aviation      
## 2 Complementary 
## 3 Corporate     
## 4 Direct        
## 5 Groups        
## 6 Offline TA/TO 
## # ℹ 2 more rows
```
]
]

---
## `distinct()` using more than one variable

``` r
hotels %>% 
  distinct(hotel, market_segment) %>% #<<
  arrange(hotel, market_segment)
```

```
## # A tibble: 14 × 2
##   hotel      market_segment
##   <chr>      <chr>         
## 1 City Hotel Aviation      
## 2 City Hotel Complementary 
## 3 City Hotel Corporate     
## 4 City Hotel Direct        
## 5 City Hotel Groups        
## 6 City Hotel Offline TA/TO 
## # ℹ 8 more rows
```

---
## `mutate()` to add a new variable

``` r
hotels %>%
  mutate(little_ones = children + babies) %>% 
  select(children, babies, little_ones) %>%
  arrange(desc(little_ones))
```

```
## # A tibble: 119,390 × 3
##   children babies little_ones
##      <dbl>  <dbl>       <dbl>
## 1       10      0          10
## 2        0     10          10
## 3        0      9           9
## 4        2      1           3
## 5        2      1           3
## 6        2      1           3
## # ℹ 119,384 more rows
```

<small>What are these functions doing? How do to the same in base R?</small>

---

## `count()` to create frequency tables

.pull-left[

``` r
# alphabetical order by default
hotels %>%
  count(market_segment) #<<
```

```
## # A tibble: 8 × 2
##   market_segment     n
##   <chr>          <int>
## 1 Aviation         237
## 2 Complementary    743
## 3 Corporate       5295
## 4 Direct         12606
## 5 Groups         19811
## 6 Offline TA/TO  24219
## # ℹ 2 more rows
```
]

.pull-right[

``` r
# descending frequency order
hotels %>%
  count(market_segment, 
        sort = TRUE) #<<
```

```
## # A tibble: 8 × 2
##   market_segment     n
##   <chr>          <int>
## 1 Online TA      56477
## 2 Offline TA/TO  24219
## 3 Groups         19811
## 4 Direct         12606
## 5 Corporate       5295
## 6 Complementary    743
## # ℹ 2 more rows
```
]

- Base R version: `table()`

---

## `count()` and `arrange()`

.pull-left[

``` r
# ascending frequency order
hotels %>%
  count(market_segment) %>%
  arrange(n) #<<
```

```
## # A tibble: 8 × 2
##   market_segment     n
##   <chr>          <int>
## 1 Undefined          2
## 2 Aviation         237
## 3 Complementary    743
## 4 Corporate       5295
## 5 Direct         12606
## 6 Groups         19811
## # ℹ 2 more rows
```
]
.pull-right[

``` r
# descending frequency order
# just like adding sort = TRUE
hotels %>%
  count(market_segment) %>%
  arrange(desc(n)) #<<
```

---

## `count()` for multiple variables

``` r
hotels %>%
  count(hotel, market_segment) 
```

```
## # A tibble: 14 × 3
##   hotel      market_segment     n
##   <chr>      <chr>          <int>
## 1 City Hotel Aviation         237
## 2 City Hotel Complementary    542
## 3 City Hotel Corporate       2986
## 4 City Hotel Direct          6093
## 5 City Hotel Groups         13975
## 6 City Hotel Offline TA/TO  16747
## # ℹ 8 more rows
```

---

## `summarize()` for summary stats

``` r
# mean average daily rate for all bookings
hotels %>%
  summarize(mean_adr = mean(adr)) 
```

```
## # A tibble: 1 × 1
##   mean_adr
##      <dbl>
## 1     102.
```

- `summarize()` **changes the data frame** entirely

- **Rows are collapsed** into a single summary statistic

- **Columns that are irrelevant** to the calculation are **removed**

---
## `summarize()` is often used with `group_by()`

- For **grouped operations**

- There are **two types** of `hotel`, city and resort hotels

- We want the mean daily rate for bookings at **city vs. resort** hotels

``` r
hotels %>%
  group_by(hotel) %>% 
  summarize(mean_adr = mean(adr))
```

```
## # A tibble: 2 × 2
##   hotel        mean_adr
##   <chr>           <dbl>
## 1 City Hotel      105. 
## 2 Resort Hotel     95.0
```

- `group_by()` can be used with **more than one group**

---

## Multiple summary statistics

`summarize` can be used for multiple summary statistics as well

``` r
hotels %>%
  summarize(
    n = n(), # frequencies
    min_adr = min(adr),
    mean_adr = mean(adr),
    median_adr = median(adr),
    max_adr = max(adr)
    )
```

```
## # A tibble: 1 × 5
##        n min_adr mean_adr median_adr max_adr
##    <int>   <dbl>    <dbl>      <dbl>   <dbl>
## 1 119390   -6.38     102.       94.6    5400
```

---
## Data manipulation using `dplyr`

.pull-left[
<img src="img/dplyr-part-of-tidyverse.png" width="70%" style="display: block; margin: auto;" />
]
.pull-right[
.midi[
- `select`: pick columns by name
- `arrange`: reorder rows
- `slice`: pick rows using index(es)
- `filter`: pick rows matching criteria
- `distinct`: filter for unique rows
- `mutate`: add new variables
- `summarize`: reduce variables to values
- `group_by`: for grouped operations
- ... (many more)
]
]

---
## Exercise: NYC Flights data
This data frame contains data on all 336,776 flights that departed from New York City in 2013. It is available as part of the `nycflights13` package.

``` r
nycflights13::flights
```

```
## # A tibble: 336,776 × 19
##    year month   day dep_time sched_dep_time dep_delay arr_time
##   <int> <int> <int>    <int>          <int>     <dbl>    <int>
## 1  2013     1     1      517            515         2      830
## 2  2013     1     1      533            529         4      850
## 3  2013     1     1      542            540         2      923
## 4  2013     1     1      544            545        -1     1004
## 5  2013     1     1      554            600        -6      812
## 6  2013     1     1      554            558        -4      740
## # ℹ 336,770 more rows
## # ℹ 12 more variables: sched_arr_time <int>, arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>,
## #   dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>
```

---
## Exercise: NYC Flights data

Select the `carrier` column.

Select the `carrier` and `tailnum` columns.

Sort the data by origin.

Filter only flights with carrier `OO`.

Filter only flights with carrier `OO`, originating in `LGA`.

Create a new variable that indicates whether or not the flight departed late.

Create a new variable for the mean departure delay by day.

Repeat all the operations using base R.

---
## Summary
--

- Data manipulation tools

- `select()`: selects columns by name
  
  - `arrange()`: reorders rows
  
  - `slice()`: selects rows using index(es)
  
  - `filter()`: selects rows matching criteria

- `distinct()`: filter for unique rows
  
  - `mutate()`: adds new variables
  
  - `summarize()`: reduces variables to values
  
  - `group_by()`: for grouped operations