class: center, middle, inverse, title-slide .title[ # Fundamentals of R: Data Manipulation ] .subtitle[ ##
STA35A: Statistical Data Science 1 ] .author[ ### Xiao Hui Tai ] .date[ ### October 7, 2024 ] --- layout: true <!-- <div class="my-footer"> --> <!-- <span> --> <!-- <a href="https://datasciencebox.org" target="_blank">datasciencebox.org</a> --> <!-- </span> --> <!-- </div> --> --- <style type="text/css"> .small .remark-code { font-size: 80%; } .tiny .remark-code { font-size: 50%; } </style> ## Today - Exploratory data analysis - Visualization examples - Data manipulation tools --- ## Data frames, data sets - We've seen data frames. This is a commonly used data structure that we get after reading in a data set into R. - In a data set in general, - Each row is an **observation**, `\(n\)` - Each column is a **variable**, `\(p\)` - Often, the **first things we want to do** when given a data set are to figure out 1. What is in it (what dimensions, what variables) 2. What the main characteristics of the variables are. - We've seen a few tools and functions for working with data frames in "base R," next we will look at some tools from `dplyr` --- <img src="img/tidyverse.png" width="100%" style="display: block; margin: auto;" /> https://www.tidyverse.org/ - What we've seen so far: "base R" - `ggplot2` for plotting, `dplyr` for data manipulation --- ## First question: What's in a data set? ### Example: Star Wars data - `starwars` data set in the `dplyr` package ``` r dplyr::starwars ``` ``` ## # A tibble: 87 × 14 ## name height mass hair_color skin_color eye_color birth_year ## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> ## 1 Luke S… 172 77 blond fair blue 19 ## 2 C-3PO 167 75 <NA> gold yellow 112 ## 3 R2-D2 96 32 <NA> white, bl… red 33 ## 4 Darth … 202 136 none white yellow 41.9 ## 5 Leia O… 150 49 brown light brown 19 ## 6 Owen L… 178 120 brown, gr… light blue 52 ## # ℹ 81 more rows ## # ℹ 7 more variables: sex <chr>, gender <chr>, homeworld <chr>, ## # species <chr>, films <list>, vehicles <list>, ## # starships <list> ``` (A `tibble` is the `tidyverse` version of the data frame.) --- We've seen `str()`. `dplyr::glimpse()` produces cleaner output in this case: ``` r dplyr::glimpse(starwars) ``` ``` ## Rows: 87 ## Columns: 14 ## $ name <chr> "Luke Skywalker", "C-3PO", "R2-D2", "Darth V… ## $ height <int> 172, 167, 96, 202, 150, 178, 165, 97, 183, 1… ## $ mass <dbl> 77.0, 75.0, 32.0, 136.0, 49.0, 120.0, 75.0, … ## $ hair_color <chr> "blond", NA, NA, "none", "brown", "brown, gr… ## $ skin_color <chr> "fair", "gold", "white, blue", "white", "lig… ## $ eye_color <chr> "blue", "yellow", "red", "yellow", "brown", … ## $ birth_year <dbl> 19.0, 112.0, 33.0, 41.9, 19.0, 52.0, 47.0, N… ## $ sex <chr> "male", "none", "none", "male", "female", "m… ## $ gender <chr> "masculine", "masculine", "masculine", "masc… ## $ homeworld <chr> "Tatooine", "Tatooine", "Naboo", "Tatooine",… ## $ species <chr> "Human", "Droid", "Droid", "Human", "Human",… ## $ films <list> <"A New Hope", "The Empire Strikes Back", "… ## $ vehicles <list> <"Snowspeeder", "Imperial Speeder Bike">, <… ## $ starships <list> <"X-wing", "Imperial shuttle">, <>, <>, "TI… ``` --- How many rows and columns does this data set have? What does each row represent? What does each column represent? ``` r ?starwars ``` <img src="img/starwars-help.png" width="100%" style="display: block; margin: auto;" /> --- How many rows and columns does this data set have? ``` r nrow(starwars) # number of rows ``` ``` ## [1] 87 ``` ``` r ncol(starwars) # number of columns ``` ``` ## [1] 14 ``` ``` r dim(starwars) # dimensions (row column) ``` ``` ## [1] 87 14 ``` As we've seen, columns (variables) in data frames can be accessed with `$`: ``` r dataframe$var_name ``` --- ## Second question: what are the main characteristics of the data? **Exploratory data analysis** (EDA) is an approach to summarizing the **main characteristics** of a data set <img src="img/elephant.jpg" width="60%" style="display: block; margin: auto;" /> --- ## Exploratory data analysis - Often, this is **visual** - We might also calculate **summary statistics**, e.g., mean, median - We might also **manipulate or transform** the data before visualizing or calculating summary statistics - e.g., filter certain values, group continuous variables into buckets, take log-transformation - We will first introduce **visual summaries** and tools for data manipulation, then talk about **numerical summaries**. - We saw a visualization example in the first lecture. Here are a few more. --- ## Visualization example 1: Mass vs. height in Star Wars data How would you describe the **relationship** between mass and height of Starwars characters? What other variables would help us understand data points that don't follow the **overall trend**? Who is the not so tall but much heavier character? <img src="lecture6_files/figure-html/unnamed-chunk-11-1.png" width="50%" style="display: block; margin: auto;" /> --- ## Jabba! <img src="lecture6_files/figure-html/unnamed-chunk-12-1.png" width="80%" style="display: block; margin: auto;" /> --- ## Visualization Example 2: Anscombe's quartet .small[ .pull-left[ ``` ## set x y ## 1 I 10 8.04 ## 2 I 8 6.95 ## 3 I 13 7.58 ## 4 I 9 8.81 ## 5 I 11 8.33 ## 6 I 14 9.96 ## 7 I 6 7.24 ## 8 I 4 4.26 ## 9 I 12 10.84 ## 10 I 7 4.82 ## 11 I 5 5.68 ## 12 II 10 9.14 ## 13 II 8 8.14 ## 14 II 13 8.74 ## 15 II 9 8.77 ## 16 II 11 9.26 ## 17 II 14 8.10 ## 18 II 6 6.13 ## 19 II 4 3.10 ## 20 II 12 9.13 ## 21 II 7 7.26 ## 22 II 5 4.74 ``` ] .pull-right[ ``` ## set x y ## 23 III 10 7.46 ## 24 III 8 6.77 ## 25 III 13 12.74 ## 26 III 9 7.11 ## 27 III 11 7.81 ## 28 III 14 8.84 ## 29 III 6 6.08 ## 30 III 4 5.39 ## 31 III 12 8.15 ## 32 III 7 6.42 ## 33 III 5 5.73 ## 34 IV 8 6.58 ## 35 IV 8 5.76 ## 36 IV 8 7.71 ## 37 IV 8 8.84 ## 38 IV 8 8.47 ## 39 IV 8 7.04 ## 40 IV 8 5.25 ## 41 IV 19 12.50 ## 42 IV 8 5.56 ## 43 IV 8 7.91 ## 44 IV 8 6.89 ``` ] ] --- ## Summary statistics are identical ``` r Tmisc::quartet %>% group_by(set) %>% summarize( mean_x = mean(x), mean_y = mean(y), sd_x = sd(x), sd_y = sd(y), r = cor(x, y) ) ``` ``` ## # A tibble: 4 × 6 ## set mean_x mean_y sd_x sd_y r ## <fct> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 I 9 7.50 3.32 2.03 0.816 ## 2 II 9 7.50 3.32 2.03 0.816 ## 3 III 9 7.5 3.32 2.03 0.816 ## 4 IV 9 7.50 3.32 2.03 0.817 ``` (Don't worry if you don't know what a standard deviation or correlation is; we will come back to this) --- ## Visualizing Anscombe's quartet <img src="lecture6_files/figure-html/quartet-plot-1.png" width="80%" style="display: block; margin: auto;" /> --- ## Visualization Example 3: Facebook visits .question[ How are people reporting lower vs. higher values of FB visits? ] <img src="lecture6_files/figure-html/unnamed-chunk-13-1.png" width="60%" style="display: block; margin: auto;" /> --- ## Data manipulation using `dplyr` .pull-left[ <img src="img/dplyr-part-of-tidyverse.png" width="70%" style="display: block; margin: auto;" /> ] .pull-right[ .midi[ - `select`: pick columns by name - `arrange`: reorder rows - `slice`: pick rows using index(es) - `filter`: pick rows matching criteria - `distinct`: filter for unique rows - `mutate`: add new variables - `summarize`: reduce variables to values - `group_by`: for grouped operations - ... (many more) ] ] As we go over the examples, think about how you would do these in base R --- ## Rules of `dplyr` functions - **First argument** is always a data frame - Subsequent arguments say **what to do** with that data frame - Always **return a data frame** - **Don't modify in place** - Meaning that you need an assignment operation if you want an "updated" version of the data frame --- ## Data: Hotel bookings - Data from two hotels: one resort and one city hotel - **Observations**: Each **row** represents a hotel booking - **Goal** for original data collection: Development of prediction models to classify a hotel booking's likelihood to be cancelled ([Antonia et al., 2019](https://www.sciencedirect.com/science/article/pii/S2352340918315191#bib5)) ``` r hotels <- readr::read_csv("data/hotels.csv") ``` .footnote[ Source: [TidyTuesday](https://github.com/rfordatascience/tidytuesday/blob/master/data/2020/2020-02-11/readme.md) ] --- ## First question: What is in the data set? .tiny[ ``` r dplyr::glimpse(hotels) ``` ``` ## Rows: 119,390 ## Columns: 32 ## $ hotel <chr> "Resort Hotel", "Resort … ## $ is_canceled <dbl> 0, 0, 0, 0, 0, 0, 0, 0, … ## $ lead_time <dbl> 342, 737, 7, 13, 14, 14,… ## $ arrival_date_year <dbl> 2015, 2015, 2015, 2015, … ## $ arrival_date_month <chr> "July", "July", "July", … ## $ arrival_date_week_number <dbl> 27, 27, 27, 27, 27, 27, … ## $ arrival_date_day_of_month <dbl> 1, 1, 1, 1, 1, 1, 1, 1, … ## $ stays_in_weekend_nights <dbl> 0, 0, 0, 0, 0, 0, 0, 0, … ## $ stays_in_week_nights <dbl> 0, 0, 1, 1, 2, 2, 2, 2, … ## $ adults <dbl> 2, 2, 1, 1, 2, 2, 2, 2, … ## $ children <dbl> 0, 0, 0, 0, 0, 0, 0, 0, … ## $ babies <dbl> 0, 0, 0, 0, 0, 0, 0, 0, … ## $ meal <chr> "BB", "BB", "BB", "BB", … ## $ country <chr> "PRT", "PRT", "GBR", "GB… ## $ market_segment <chr> "Direct", "Direct", "Dir… ## $ distribution_channel <chr> "Direct", "Direct", "Dir… ## $ is_repeated_guest <dbl> 0, 0, 0, 0, 0, 0, 0, 0, … ## $ previous_cancellations <dbl> 0, 0, 0, 0, 0, 0, 0, 0, … ## $ previous_bookings_not_canceled <dbl> 0, 0, 0, 0, 0, 0, 0, 0, … ## $ reserved_room_type <chr> "C", "C", "A", "A", "A",… ## $ assigned_room_type <chr> "C", "C", "C", "A", "A",… ## $ booking_changes <dbl> 3, 4, 0, 0, 0, 0, 0, 0, … ## $ deposit_type <chr> "No Deposit", "No Deposi… ## $ agent <chr> "NULL", "NULL", "NULL", … ## $ company <chr> "NULL", "NULL", "NULL", … ## $ days_in_waiting_list <dbl> 0, 0, 0, 0, 0, 0, 0, 0, … ## $ customer_type <chr> "Transient", "Transient"… ## $ adr <dbl> 0.00, 0.00, 75.00, 75.00… ## $ required_car_parking_spaces <dbl> 0, 0, 0, 0, 0, 0, 0, 0, … ## $ total_of_special_requests <dbl> 0, 0, 0, 0, 1, 1, 0, 1, … ## $ reservation_status <chr> "Check-Out", "Check-Out"… ## $ reservation_status_date <date> 2015-07-01, 2015-07-01,… ``` ] --- ## `select()`: Select a single column View only `lead_time` (number of days between booking and arrival date): ``` r select(hotels, lead_time) ``` ``` ## # A tibble: 119,390 × 1 ## lead_time ## <dbl> ## 1 342 ## 2 737 ## 3 7 ## 4 13 ## 5 14 ## 6 14 ## # ℹ 119,384 more rows ``` - **First argument**: data frame we're working with , `hotels` - **Second argument**: variable we want to select, `lead_time` - **Result**: data frame with 119390 rows and 1 column - This is an alternative to `hotels$lead_time` --- ## Select multiple columns View only the `hotel` type and `lead_time` columns: ``` r select(hotels, hotel, lead_time) ``` ``` ## # A tibble: 119,390 × 2 ## hotel lead_time ## <chr> <dbl> ## 1 Resort Hotel 342 ## 2 Resort Hotel 737 ## 3 Resort Hotel 7 ## 4 Resort Hotel 13 ## 5 Resort Hotel 14 ## 6 Resort Hotel 14 ## # ℹ 119,384 more rows ``` --- ## `select()` to exclude variables - We saw earlier that `select()` keeps variables - `select()` can also exclude variables, using the `-` sign .small[ ``` r hotels %>% select(-agent) ``` ``` ## # A tibble: 119,390 × 31 ## hotel is_canceled lead_time arrival_date_year ## <chr> <dbl> <dbl> <dbl> ## 1 Resort Hotel 0 342 2015 ## 2 Resort Hotel 0 737 2015 ## 3 Resort Hotel 0 7 2015 ## 4 Resort Hotel 0 13 2015 ## 5 Resort Hotel 0 14 2015 ## 6 Resort Hotel 0 14 2015 ## # ℹ 119,384 more rows ## # ℹ 27 more variables: arrival_date_month <chr>, ## # arrival_date_week_number <dbl>, ## # arrival_date_day_of_month <dbl>, ## # stays_in_weekend_nights <dbl>, stays_in_week_nights <dbl>, ## # adults <dbl>, children <dbl>, babies <dbl>, meal <chr>, ## # country <chr>, market_segment <chr>, … ``` ] --- ## `select()` a range of variables - Instead of writing out all the variable names, `select()` also accepts a **range of variables** - This follows the order they are listed in the data frame ``` r hotels %>% select(hotel:arrival_date_month) ``` ``` ## # A tibble: 119,390 × 5 ## hotel is_canceled lead_time arrival_date_year ## <chr> <dbl> <dbl> <dbl> ## 1 Resort Hotel 0 342 2015 ## 2 Resort Hotel 0 737 2015 ## 3 Resort Hotel 0 7 2015 ## 4 Resort Hotel 0 13 2015 ## 5 Resort Hotel 0 14 2015 ## 6 Resort Hotel 0 14 2015 ## # ℹ 119,384 more rows ## # ℹ 1 more variable: arrival_date_month <chr> ``` --- ## `select()` variables with certain characteristics ``` r hotels %>% select(starts_with("arrival")) ``` ``` ## # A tibble: 119,390 × 4 ## arrival_date_year arrival_date_month arrival_date_week_number ## <dbl> <chr> <dbl> ## 1 2015 July 27 ## 2 2015 July 27 ## 3 2015 July 27 ## 4 2015 July 27 ## 5 2015 July 27 ## 6 2015 July 27 ## # ℹ 119,384 more rows ## # ℹ 1 more variable: arrival_date_day_of_month <dbl> ``` --- ## `select()` variables with certain characteristics ``` r hotels %>% select(ends_with("type")) ``` ``` ## # A tibble: 119,390 × 4 ## reserved_room_type assigned_room_type deposit_type ## <chr> <chr> <chr> ## 1 C C No Deposit ## 2 C C No Deposit ## 3 A C No Deposit ## 4 A A No Deposit ## 5 A A No Deposit ## 6 A A No Deposit ## # ℹ 119,384 more rows ## # ℹ 1 more variable: customer_type <chr> ``` --- ## Select helpers - `starts_with()`: Starts with a prefix - `ends_with()`: Ends with a suffix - `contains()`: Contains a literal string - `num_range()`: Matches a numerical range like x01, x02, x03 - `one_of()`: Matches variable names in a character vector - `everything()`: Matches all variables - `last_col()`: Select last variable, possibly with an offset - `matches()`: Matches a regular expression (a sequence of symbols/characters expressing a string/pattern to be searched for within text) .footnote[ See help for any of these functions for more info, e.g. `?everything`. ] --- ## `select()`, then `arrange()` What if we wanted to select these columns, and then arrange the data in order of lead time? ``` r hotels %>% select(hotel, lead_time) %>% arrange(lead_time) ``` ``` ## # A tibble: 119,390 × 2 ## hotel lead_time ## <chr> <dbl> ## 1 Resort Hotel 0 ## 2 Resort Hotel 0 ## 3 Resort Hotel 0 ## 4 Resort Hotel 0 ## 5 Resort Hotel 0 ## 6 Resort Hotel 0 ## # ℹ 119,384 more rows ``` --- ## Pipes In programming, a pipe is a technique for **passing information from one process to another**. In R, the symbol is `%>%`. Also: `|>`. .pull-left[ - Start with the data frame `hotels`, and pass it to the `select()` function, ] .pull-right[ .small[ ``` r hotels %>% select(hotel, lead_time) %>% arrange(lead_time) ``` ``` ## # A tibble: 119,390 × 2 ## hotel lead_time ## <chr> <dbl> ## 1 Resort Hotel 0 ## 2 Resort Hotel 0 ## 3 Resort Hotel 0 ## 4 Resort Hotel 0 ## 5 Resort Hotel 0 ## 6 Resort Hotel 0 ## # ℹ 119,384 more rows ``` ] ] --- ## Pipes In programming, a pipe is a technique for **passing information from one process to another**. In R, the symbol is `%>%`. Also: `|>`. .pull-left[ - Start with the data frame `hotels`, and pass it to the `select()` function, - then we select the variables `hotel` and `lead_time`, ] .pull-right[ .small[ ``` r hotels %>% select(hotel, lead_time) %>% #<< arrange(lead_time) ``` ``` ## # A tibble: 119,390 × 2 ## hotel lead_time ## <chr> <dbl> ## 1 Resort Hotel 0 ## 2 Resort Hotel 0 ## 3 Resort Hotel 0 ## 4 Resort Hotel 0 ## 5 Resort Hotel 0 ## 6 Resort Hotel 0 ## # ℹ 119,384 more rows ``` ] ] --- ## Pipes In programming, a pipe is a technique for **passing information from one process to another**. In R, the symbol is `%>%`. Also: `|>`. .pull-left[ - Start with the data frame `hotels`, and pass it to the `select()` function, - then we select the variables `hotel` and `lead_time`, - and then we arrange the data frame by `lead_time`. ] .pull-right[ .small[ ``` r hotels %>% select(hotel, lead_time) %>% arrange(lead_time) #<< ``` ``` ## # A tibble: 119,390 × 2 ## hotel lead_time ## <chr> <dbl> ## 1 Resort Hotel 0 ## 2 Resort Hotel 0 ## 3 Resort Hotel 0 ## 4 Resort Hotel 0 ## 5 Resort Hotel 0 ## 6 Resort Hotel 0 ## # ℹ 119,384 more rows ``` ] ] Note that the pipe operator is implemented in the package `magrittr`, but is automatically loaded when we use `library(dplyr)` or `library(tidyverse)`. --- ## How does a pipe work? - You can think about the following **sequence of actions** - find keys, start car, drive to work, park. - Expressed as a set of **nested functions** in R pseudocode this would look like: ``` r park(drive(start_car(find("keys")), to = "work")) ``` - Writing it out using pipes give it a more natural (and easier to read) structure: ``` r find("keys") %>% start_car() %>% drive(to = "work") %>% park() ``` --- ## Simple example - We can write `exp(1)` with pipes as `1 %>% exp`, and `log(exp(1))` as `1 %>% exp %>% log` ``` r exp(1) ``` ``` ## [1] 2.718282 ``` ``` r 1 %>% exp ``` ``` ## [1] 2.718282 ``` ``` r 1 %>% exp %>% log ``` ``` ## [1] 1 ``` - Tidyverse functions are at their best when composed together using the pipe operator --- ## `arrange()` in ascending or descending order - We saw earlier that `arrange()` defaults to ascending order - For descending order, use `desc()` .pull-left[ ``` r hotels %>% select(hotel, lead_time) %>% arrange(lead_time) ``` ``` ## # A tibble: 119,390 × 2 ## hotel lead_time ## <chr> <dbl> ## 1 Resort Hotel 0 ## 2 Resort Hotel 0 ## 3 Resort Hotel 0 ## 4 Resort Hotel 0 ## 5 Resort Hotel 0 ## 6 Resort Hotel 0 ## # ℹ 119,384 more rows ``` ] .pull-right[ ``` r hotels %>% select(hotel, lead_time) %>% arrange(desc(lead_time)) ``` ``` ## # A tibble: 119,390 × 2 ## hotel lead_time ## <chr> <dbl> ## 1 Resort Hotel 737 ## 2 Resort Hotel 709 ## 3 City Hotel 629 ## 4 City Hotel 629 ## 5 City Hotel 629 ## 6 City Hotel 629 ## # ℹ 119,384 more rows ``` ] --- ## `slice()` for certain row numbers This is an alternative indexing option for `hotels[1:5, ]` ``` r hotels %>% slice(1:5) ``` ``` ## # A tibble: 5 × 32 ## hotel is_canceled lead_time arrival_date_year ## <chr> <dbl> <dbl> <dbl> ## 1 Resort Hotel 0 342 2015 ## 2 Resort Hotel 0 737 2015 ## 3 Resort Hotel 0 7 2015 ## 4 Resort Hotel 0 13 2015 ## 5 Resort Hotel 0 14 2015 ## # ℹ 28 more variables: arrival_date_month <chr>, ## # arrival_date_week_number <dbl>, ## # arrival_date_day_of_month <dbl>, ## # stays_in_weekend_nights <dbl>, stays_in_week_nights <dbl>, ## # adults <dbl>, children <dbl>, babies <dbl>, meal <chr>, ## # country <chr>, market_segment <chr>, ## # distribution_channel <chr>, is_repeated_guest <dbl>, … ``` --- ## Reminder: comments in R - Any text following `#` will be printed as is, and won't be run as code - This is useful for leaving comments and for temporarily disabling certain lines of code (for debugging, trying out different things) .tiny[ ``` r hotels %>% # slice the first five rows # this line is a comment #select(hotel) %>% # this one doesn't run slice(1:5) # this line runs ``` ``` ## # A tibble: 5 × 32 ## hotel is_canceled lead_time arrival_date_year ## <chr> <dbl> <dbl> <dbl> ## 1 Resort Hotel 0 342 2015 ## 2 Resort Hotel 0 737 2015 ## 3 Resort Hotel 0 7 2015 ## 4 Resort Hotel 0 13 2015 ## 5 Resort Hotel 0 14 2015 ## # ℹ 28 more variables: arrival_date_month <chr>, ## # arrival_date_week_number <dbl>, ## # arrival_date_day_of_month <dbl>, ## # stays_in_weekend_nights <dbl>, stays_in_week_nights <dbl>, ## # adults <dbl>, children <dbl>, babies <dbl>, meal <chr>, ## # country <chr>, market_segment <chr>, ## # distribution_channel <chr>, is_repeated_guest <dbl>, … ``` ] --- ## `filter()` to select a subset of rows .small[ ``` r # bookings in City Hotels hotels %>% filter(hotel == "City Hotel") ``` ``` ## # A tibble: 79,330 × 32 ## hotel is_canceled lead_time arrival_date_year ## <chr> <dbl> <dbl> <dbl> ## 1 City Hotel 0 6 2015 ## 2 City Hotel 1 88 2015 ## 3 City Hotel 1 65 2015 ## 4 City Hotel 1 92 2015 ## 5 City Hotel 1 100 2015 ## 6 City Hotel 1 79 2015 ## # ℹ 79,324 more rows ## # ℹ 28 more variables: arrival_date_month <chr>, ## # arrival_date_week_number <dbl>, ## # arrival_date_day_of_month <dbl>, ## # stays_in_weekend_nights <dbl>, stays_in_week_nights <dbl>, ## # adults <dbl>, children <dbl>, babies <dbl>, meal <chr>, ## # country <chr>, market_segment <chr>, … ``` ] <small>What was the base R alternative that we saw?</small> --- ## `filter()` for many conditions at once ``` r hotels %>% filter( adults == 0, children >= 1 ) %>% select(adults, babies, children) ``` ``` ## # A tibble: 223 × 3 ## adults babies children ## <dbl> <dbl> <dbl> ## 1 0 0 3 ## 2 0 0 2 ## 3 0 0 2 ## 4 0 0 2 ## 5 0 0 2 ## 6 0 0 3 ## # ℹ 217 more rows ``` --- ## `filter()` for more complex conditions ``` r # bookings with no adults and some children or babies in the room hotels %>% filter( adults == 0, children >= 1 | babies >= 1 ) %>% select(adults, babies, children) ``` ``` ## # A tibble: 223 × 3 ## adults babies children ## <dbl> <dbl> <dbl> ## 1 0 0 3 ## 2 0 0 2 ## 3 0 0 2 ## 4 0 0 2 ## 5 0 0 2 ## 6 0 0 3 ## # ℹ 217 more rows ``` --- ## Reminder: Logical operators in R <br> operator | definition || operator | definition ------------|------------------------------||--------------|---------------- `<` | less than ||`x` | `y` | `x` OR `y` `<=` | less than or equal to ||`is.na(x)` | test if `x` is `NA` `>` | greater than ||`!is.na(x)` | test if `x` is not `NA` `>=` | greater than or equal to ||`x %in% y` | test if `x` is in `y` `==` | exactly equal to ||`!(x %in% y)` | test if `x` is not in `y` `!=` | not equal to ||`!x` | not `x` `x & y` | `x` AND `y` || | --- ## Summary -- - Data manipulation tools - `select()`: selects columns by name - `arrange()`: reorders rows - `slice()`: selects rows using index(es) - `filter()`: selects rows matching criteria