class: center, middle, inverse, title-slide .title[ # Fundamentals of R: Data Manipulation and Visualization ] .subtitle[ ##
STA35A: Statistical Data Science 1 ] .author[ ### Xiao Hui Tai ] .date[ ### October 11, 2023 ] --- layout: true <!-- <div class="my-footer"> --> <!-- <span> --> <!-- <a href="https://datasciencebox.org" target="_blank">datasciencebox.org</a> --> <!-- </span> --> <!-- </div> --> --- <style type="text/css"> .tiny .remark-code { font-size: 50%; } .small .remark-code { font-size: 80%; } </style> ## Reminders - HW 1 is due tomorrow at 9pm - HW 2 will be posted on Friday afternoon on the course website - Schedule for next week (week 4): - Monday: regular lecture; 2:30-3:30 OH (XHT); lab 3 due 9pm - Tuesday: 10-11 OH (JH) - Wednesday: Jed Harwood will do review during regular lecture time (same room) - Thursday: 10-11 OH (JH); **no lab**, instead 2-3PM OH (XHT; virtual, will post link on Piazza) - Friday: midterm during regular lecture time (same room); **no homework** --- ## Midterm - Midterm will cover material until Monday, Oct 16 - Closed-book - You don't need your computers - No make-up exams - Drop policy for exams: 1 midterm may be dropped --- ## Recap -- - Exploratory data analysis - Manipulate or transform the data before visualizing or calculating summary statistics - Data manipulation tools - `select()` - `arrange()` - `slice()` - `filter()` --- ## Today - Data manipulation tools - `distinct()`: filter for unique rows - `mutate()`: adds new variables - `summarize()`: reduces variables to values - `group_by()`: for grouped operations - Data visualization using `ggplot2` --- ## Data: Hotel bookings - Data from two hotels: one resort and one city hotel - **Observations**: Each **row** represents a hotel booking - **Goal** for original data collection: Development of prediction models to classify a hotel booking's likelihood to be cancelled ([Antonia et al., 2019](https://www.sciencedirect.com/science/article/pii/S2352340918315191#bib5)) ```r hotels <- readr::read_csv("data/hotels.csv") ``` .footnote[ Source: [TidyTuesday](https://github.com/rfordatascience/tidytuesday/blob/master/data/2020/2020-02-11/readme.md) ] --- ## First question: What is in the data set? .tiny[ ```r dplyr::glimpse(hotels) ``` ``` ## Rows: 119,390 ## Columns: 32 ## $ hotel <chr> "Resort Hotel", "Resort … ## $ is_canceled <dbl> 0, 0, 0, 0, 0, 0, 0, 0, … ## $ lead_time <dbl> 342, 737, 7, 13, 14, 14,… ## $ arrival_date_year <dbl> 2015, 2015, 2015, 2015, … ## $ arrival_date_month <chr> "July", "July", "July", … ## $ arrival_date_week_number <dbl> 27, 27, 27, 27, 27, 27, … ## $ arrival_date_day_of_month <dbl> 1, 1, 1, 1, 1, 1, 1, 1, … ## $ stays_in_weekend_nights <dbl> 0, 0, 0, 0, 0, 0, 0, 0, … ## $ stays_in_week_nights <dbl> 0, 0, 1, 1, 2, 2, 2, 2, … ## $ adults <dbl> 2, 2, 1, 1, 2, 2, 2, 2, … ## $ children <dbl> 0, 0, 0, 0, 0, 0, 0, 0, … ## $ babies <dbl> 0, 0, 0, 0, 0, 0, 0, 0, … ## $ meal <chr> "BB", "BB", "BB", "BB", … ## $ country <chr> "PRT", "PRT", "GBR", "GB… ## $ market_segment <chr> "Direct", "Direct", "Dir… ## $ distribution_channel <chr> "Direct", "Direct", "Dir… ## $ is_repeated_guest <dbl> 0, 0, 0, 0, 0, 0, 0, 0, … ## $ previous_cancellations <dbl> 0, 0, 0, 0, 0, 0, 0, 0, … ## $ previous_bookings_not_canceled <dbl> 0, 0, 0, 0, 0, 0, 0, 0, … ## $ reserved_room_type <chr> "C", "C", "A", "A", "A",… ## $ assigned_room_type <chr> "C", "C", "C", "A", "A",… ## $ booking_changes <dbl> 3, 4, 0, 0, 0, 0, 0, 0, … ## $ deposit_type <chr> "No Deposit", "No Deposi… ## $ agent <chr> "NULL", "NULL", "NULL", … ## $ company <chr> "NULL", "NULL", "NULL", … ## $ days_in_waiting_list <dbl> 0, 0, 0, 0, 0, 0, 0, 0, … ## $ customer_type <chr> "Transient", "Transient"… ## $ adr <dbl> 0.00, 0.00, 75.00, 75.00… ## $ required_car_parking_spaces <dbl> 0, 0, 0, 0, 0, 0, 0, 0, … ## $ total_of_special_requests <dbl> 0, 0, 0, 0, 1, 1, 0, 1, … ## $ reservation_status <chr> "Check-Out", "Check-Out"… ## $ reservation_status_date <date> 2015-07-01, 2015-07-01,… ``` ] --- ## `distinct()` to filter for unique rows .small[ .pull-left[ ```r hotels %>% distinct(market_segment) ``` ``` ## # A tibble: 8 × 1 ## market_segment ## <chr> ## 1 Direct ## 2 Corporate ## 3 Online TA ## 4 Offline TA/TO ## 5 Complementary ## 6 Groups ## # … with 2 more rows ``` ] .pull-left[ Recall: `arrange()` to order alphabetically ```r hotels %>% distinct(market_segment) %>% arrange(market_segment) ``` ``` ## # A tibble: 8 × 1 ## market_segment ## <chr> ## 1 Aviation ## 2 Complementary ## 3 Corporate ## 4 Direct ## 5 Groups ## 6 Offline TA/TO ## # … with 2 more rows ``` ] ] --- ## `distinct()` using more than one variable ```r hotels %>% distinct(hotel, market_segment) %>% #<< arrange(hotel, market_segment) ``` ``` ## # A tibble: 14 × 2 ## hotel market_segment ## <chr> <chr> ## 1 City Hotel Aviation ## 2 City Hotel Complementary ## 3 City Hotel Corporate ## 4 City Hotel Direct ## 5 City Hotel Groups ## 6 City Hotel Offline TA/TO ## # … with 8 more rows ``` --- ## `mutate()` to add a new variable ```r hotels %>% mutate(little_ones = children + babies) %>% select(children, babies, little_ones) %>% arrange(desc(little_ones)) ``` ``` ## # A tibble: 119,390 × 3 ## children babies little_ones ## <dbl> <dbl> <dbl> ## 1 10 0 10 ## 2 0 10 10 ## 3 0 9 9 ## 4 2 1 3 ## 5 2 1 3 ## 6 2 1 3 ## # … with 119,384 more rows ``` <small>What are these functions doing? How do to the same in base R?</small> --- ## `count()` to create frequency tables .pull-left[ ```r # alphabetical order by default hotels %>% count(market_segment) #<< ``` ``` ## # A tibble: 8 × 2 ## market_segment n ## <chr> <int> ## 1 Aviation 237 ## 2 Complementary 743 ## 3 Corporate 5295 ## 4 Direct 12606 ## 5 Groups 19811 ## 6 Offline TA/TO 24219 ## # … with 2 more rows ``` ] .pull-right[ ```r # descending frequency order hotels %>% count(market_segment, sort = TRUE) #<< ``` ``` ## # A tibble: 8 × 2 ## market_segment n ## <chr> <int> ## 1 Online TA 56477 ## 2 Offline TA/TO 24219 ## 3 Groups 19811 ## 4 Direct 12606 ## 5 Corporate 5295 ## 6 Complementary 743 ## # … with 2 more rows ``` ] - Base R version: `table()` --- ## `count()` and `arrange()` .pull-left[ ```r # ascending frequency order hotels %>% count(market_segment) %>% arrange(n) #<< ``` ``` ## # A tibble: 8 × 2 ## market_segment n ## <chr> <int> ## 1 Undefined 2 ## 2 Aviation 237 ## 3 Complementary 743 ## 4 Corporate 5295 ## 5 Direct 12606 ## 6 Groups 19811 ## # … with 2 more rows ``` ] .pull-right[ ```r # descending frequency order # just like adding sort = TRUE hotels %>% count(market_segment) %>% arrange(desc(n)) #<< ``` ``` ## # A tibble: 8 × 2 ## market_segment n ## <chr> <int> ## 1 Online TA 56477 ## 2 Offline TA/TO 24219 ## 3 Groups 19811 ## 4 Direct 12606 ## 5 Corporate 5295 ## 6 Complementary 743 ## # … with 2 more rows ``` ] --- ## `count()` for multiple variables ```r hotels %>% count(hotel, market_segment) ``` ``` ## # A tibble: 14 × 3 ## hotel market_segment n ## <chr> <chr> <int> ## 1 City Hotel Aviation 237 ## 2 City Hotel Complementary 542 ## 3 City Hotel Corporate 2986 ## 4 City Hotel Direct 6093 ## 5 City Hotel Groups 13975 ## 6 City Hotel Offline TA/TO 16747 ## # … with 8 more rows ``` --- ## `summarize()` for summary stats ```r # mean average daily rate for all bookings hotels %>% summarize(mean_adr = mean(adr)) ``` ``` ## # A tibble: 1 × 1 ## mean_adr ## <dbl> ## 1 102. ``` - `summarize()` **changes the data frame** entirely - **Rows are collapsed** into a single summary statistic - **Columns that are irrelevant** to the calculation are **removed** --- ## `summarize()` is often used with `group_by()` - For **grouped operations** - There are **two types** of `hotel`, city and resort hotels - We want the mean daily rate for bookings at **city vs. resort** hotels ```r hotels %>% group_by(hotel) %>% summarize(mean_adr = mean(adr)) ``` ``` ## # A tibble: 2 × 2 ## hotel mean_adr ## <chr> <dbl> ## 1 City Hotel 105. ## 2 Resort Hotel 95.0 ``` - `group_by()` can be used with **more than one group** --- ## Multiple summary statistics `summarize` can be used for multiple summary statistics as well ```r hotels %>% summarize( n = n(), # frequencies min_adr = min(adr), mean_adr = mean(adr), median_adr = median(adr), max_adr = max(adr) ) ``` ``` ## # A tibble: 1 × 5 ## n min_adr mean_adr median_adr max_adr ## <int> <dbl> <dbl> <dbl> <dbl> ## 1 119390 -6.38 102. 94.6 5400 ``` --- ## Data manipulation using `dplyr` .pull-left[ <img src="img/dplyr-part-of-tidyverse.png" width="70%" style="display: block; margin: auto;" /> ] .pull-right[ .midi[ - `select`: pick columns by name - `arrange`: reorder rows - `slice`: pick rows using index(es) - `filter`: pick rows matching criteria - `distinct`: filter for unique rows - `mutate`: add new variables - `summarize`: reduce variables to values - `group_by`: for grouped operations - ... (many more) ] ] --- ## Exercise: NYC Flights data This data frame contains data on all 336,776 flights that departed from New York City in 2013. It is available as part of the `nycflights13` package. ```r nycflights13::flights ``` ``` ## # A tibble: 336,776 × 19 ## year month day dep_time sched_dep_…¹ dep_d…² arr_t…³ sched…⁴ ## <int> <int> <int> <int> <int> <dbl> <int> <int> ## 1 2013 1 1 517 515 2 830 819 ## 2 2013 1 1 533 529 4 850 830 ## 3 2013 1 1 542 540 2 923 850 ## 4 2013 1 1 544 545 -1 1004 1022 ## 5 2013 1 1 554 600 -6 812 837 ## 6 2013 1 1 554 558 -4 740 728 ## # … with 336,770 more rows, 11 more variables: arr_delay <dbl>, ## # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, ## # dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>, ## # minute <dbl>, time_hour <dttm>, and abbreviated variable ## # names ¹sched_dep_time, ²dep_delay, ³arr_time, ## # ⁴sched_arr_time ``` --- ## Exercise: NYC Flights data Select the `carrier` column. Select the `carrier` and `tailnum` columns. Sort the data by origin. Filter only flights with carrier `OO`. Filter only flights with carrier `OO`, originating in `LGA`. Create a new variable that indicates whether or not the flight departed late. Create a new variable for the mean departure delay by day. Repeat all the operations using base R. --- ## Data visualization using `ggplot2` .pull-left[ <img src="img/ggplot2-part-of-tidyverse.png" width="80%" style="display: block; margin: auto;" /> ] .pull-right[ - ggplot2 is the tidyverse's **data visualization package** - `gg` in "ggplot2" stands for Grammar of Graphics - Inspired by the book Grammar of Graphics by Leland Wilkinson - We will also look at some **plotting functions in base R** ] --- ## `ggplot2` vs. base R Mass vs. height from Star Wars data set that we've seen .tiny[ .pull-left[ ```r ggplot(data = starwars, mapping = aes(x = height, y = mass)) + geom_point() + labs(title = "Mass vs. height of Starwars characters", x = "Height (cm)", y = "Weight (kg)") ``` <img src="lecture7_files/figure-html/unnamed-chunk-20-1.png" width="100%" style="display: block; margin: auto;" /> ] .pull-right[ ```r plot(starwars$height, starwars$mass, main = "Mass vs. height of Starwars characters", xlab = "Height (cm)", ylab = "Weight (kg)") ``` <img src="lecture7_files/figure-html/unnamed-chunk-21-1.png" width="100%" style="display: block; margin: auto;" /> ] ] --- ## Grammar of Graphics .pull-left-narrow[ A grammar of graphics is a tool that enables us to concisely describe the **components of a graphic** ] .pull-right-wide[ <img src="img/grammar-of-graphics.png" width="75%" style="display: block; margin: auto;" /> ] How these are implemented in `ggplot2`: https://ggplot2.tidyverse.org/reference/ .footnote[ Source: [BloggoType](http://bloggotype.blogspot.com/2016/08/holiday-notes2-grammar-of-graphics.html)] --- ## Mass vs. height from Star Wars data set ```r ggplot(data = starwars, mapping = aes(x = height, y = mass)) + geom_point() + labs(title = "Mass vs. height of Starwars characters", x = "Height (cm)", y = "Weight (kg)") ``` ``` ## Warning: Removed 28 rows containing missing values (geom_point). ``` <img src="lecture7_files/figure-html/mass-height-1.png" width="50%" style="display: block; margin: auto;" /> --- ## Mass vs. height from Star Wars data set ```r ggplot(data = starwars, mapping = aes(x = height, y = mass)) + geom_point() + labs(title = "Mass vs. height of Starwars characters", x = "Height (cm)", y = "Weight (kg)") ``` ``` ## Warning: Removed 28 rows containing missing values (geom_point). ``` - What are the functions doing the plotting? - What is the data set being plotted? - Which variables map to which features (aesthetics) of the plot? - What does the warning mean? --- ## ggplot2 - `ggplot()` is the **main function** in the `ggplot2` package - Plots are constructed in **layers** - **Structure of the code** for plots can be summarized as ```r ggplot(data = [dataset], mapping = aes(x = [x-variable], y = [y-variable])) + geom_xxx() + other options ``` --- ## More involved example: Palmer Penguins Data contains information on 344 penguins, including: penguin species, island in Palmer Archipelago, size (flipper length, body mass, bill dimensions), and sex. <img src="img/penguins.png" width="40%" style="display: block; margin: auto;" /> --- ```r library(palmerpenguins) dplyr::glimpse(penguins) ``` ``` ## Rows: 344 ## Columns: 8 ## $ species <fct> Adelie, Adelie, Adelie, Adelie, Adeli… ## $ island <fct> Torgersen, Torgersen, Torgersen, Torg… ## $ bill_length_mm <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.… ## $ bill_depth_mm <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.… ## $ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195… ## $ body_mass_g <int> 3750, 3800, 3250, NA, 3450, 3650, 362… ## $ sex <fct> male, female, female, NA, female, mal… ## $ year <int> 2007, 2007, 2007, 2007, 2007, 2007, 2… ``` --- .panelset[ .panel[.panel-name[Plot] <img src="lecture7_files/figure-html/unnamed-chunk-27-1.png" width="70%" style="display: block; margin: auto;" /> ] .panel[.panel-name[Code] .small[ ```r ggplot(data = penguins, mapping = aes(x = bill_depth_mm, y = bill_length_mm, color = species)) + geom_point() + labs(title = "Bill depth and length", subtitle = "Dimensions for Adelie, Chinstrap, and Gentoo Penguins", x = "Bill depth (mm)", y = "Bill length (mm)", color = "Species", caption = "Source: Palmer Station LTER / palmerpenguins package") + scale_color_viridis_d() ``` ``` ## Warning: Removed 2 rows containing missing values (geom_point). ``` ] ] ] --- .midi[ > **Start with the `penguins` data frame** ] .tiny[ .pull-left[ ```r ggplot(data = penguins) #<< ``` ] ] .pull-right[ <img src="lecture7_files/figure-html/unnamed-chunk-28-1.png" width="100%" style="display: block; margin: auto;" /> ] --- .midi[ > Start with the `penguins` data frame, > **map bill depth to the x-axis** ] .tiny[ .pull-left[ ```r ggplot(data = penguins, mapping = aes(x = bill_depth_mm)) #<< ``` ] ] .pull-right[ <img src="lecture7_files/figure-html/unnamed-chunk-29-1.png" width="100%" style="display: block; margin: auto;" /> ] --- .midi[ > Start with the `penguins` data frame, > map bill depth to the x-axis > **and map bill length to the y-axis.** ] .tiny[ .pull-left[ ```r ggplot(data = penguins, mapping = aes(x = bill_depth_mm, y = bill_length_mm)) #<< ``` ] ] .pull-right[ <img src="lecture7_files/figure-html/unnamed-chunk-30-1.png" width="100%" style="display: block; margin: auto;" /> ] --- .midi[ > Start with the `penguins` data frame, > map bill depth to the x-axis > and map bill length to the y-axis. > **Represent each observation with a point** ] .tiny[ .pull-left[ ```r ggplot(data = penguins, mapping = aes(x = bill_depth_mm, y = bill_length_mm)) + geom_point() #<< ``` ] ] .pull-right[ <img src="lecture7_files/figure-html/unnamed-chunk-31-1.png" width="100%" style="display: block; margin: auto;" /> ] --- .midi[ > Start with the `penguins` data frame, > map bill depth to the x-axis > and map bill length to the y-axis. > Represent each observation with a point > **and map species to the color of each point.** ] .tiny[ .pull-left[ ```r ggplot(data = penguins, mapping = aes(x = bill_depth_mm, y = bill_length_mm, color = species)) + #<< geom_point() ``` ] ] .pull-right[ <img src="lecture7_files/figure-html/unnamed-chunk-32-1.png" width="100%" style="display: block; margin: auto;" /> ] --- .midi[ > Start with the `penguins` data frame, > map bill depth to the x-axis > and map bill length to the y-axis. > Represent each observation with a point > and map species to the color of each point. > **Title the plot "Bill depth and length"** ] .tiny[ .pull-left[ ```r ggplot(data = penguins, mapping = aes(x = bill_depth_mm, y = bill_length_mm, color = species)) + geom_point() + labs(title = "Bill depth and length") #<< ``` ] ] .pull-right[ <img src="lecture7_files/figure-html/unnamed-chunk-33-1.png" width="100%" style="display: block; margin: auto;" /> ] --- .midi[ > Start with the `penguins` data frame, > map bill depth to the x-axis > and map bill length to the y-axis. > Represent each observation with a point > and map species to the color of each point. > Title the plot "Bill depth and length", > **add the subtitle "Dimensions for Adelie, Chinstrap, and Gentoo Penguins"** ] .tiny[ .pull-left[ ```r ggplot(data = penguins, mapping = aes(x = bill_depth_mm, y = bill_length_mm, color = species)) + geom_point() + labs(title = "Bill depth and length", subtitle = "Dimensions for Adelie, Chinstrap, and Gentoo Penguins") #<< ``` ] ] .pull-right[ <img src="lecture7_files/figure-html/unnamed-chunk-34-1.png" width="100%" style="display: block; margin: auto;" /> ] --- .midi[ > Start with the `penguins` data frame, > map bill depth to the x-axis > and map bill length to the y-axis. > Represent each observation with a point > and map species to the color of each point. > Title the plot "Bill depth and length", > add the subtitle "Dimensions for Adelie, Chinstrap, and Gentoo Penguins", > **label the x and y axes as "Bill depth (mm)" and "Bill length (mm)", respectively** ] .tiny[ .pull-left[ ```r ggplot(data = penguins, mapping = aes(x = bill_depth_mm, y = bill_length_mm, color = species)) + geom_point() + labs(title = "Bill depth and length", subtitle = "Dimensions for Adelie, Chinstrap, and Gentoo Penguins", x = "Bill depth (mm)", y = "Bill length (mm)") #<< ``` ] ] .pull-right[ <img src="lecture7_files/figure-html/unnamed-chunk-35-1.png" width="100%" style="display: block; margin: auto;" /> ] --- .midi[ > Start with the `penguins` data frame, > map bill depth to the x-axis > and map bill length to the y-axis. > Represent each observation with a point > and map species to the color of each point. > Title the plot "Bill depth and length", > add the subtitle "Dimensions for Adelie, Chinstrap, and Gentoo Penguins", > label the x and y axes as "Bill depth (mm)" and "Bill length (mm)", respectively, > **label the legend "Species"** ] .tiny[ .pull-left[ ```r ggplot(data = penguins, mapping = aes(x = bill_depth_mm, y = bill_length_mm, color = species)) + geom_point() + labs(title = "Bill depth and length", subtitle = "Dimensions for Adelie, Chinstrap, and Gentoo Penguins", x = "Bill depth (mm)", y = "Bill length (mm)", color = "Species") #<< ``` ] ] .pull-right[ <img src="lecture7_files/figure-html/unnamed-chunk-36-1.png" width="100%" style="display: block; margin: auto;" /> ] --- .midi[ > Start with the `penguins` data frame, > map bill depth to the x-axis > and map bill length to the y-axis. > Represent each observation with a point > and map species to the color of each point. > Title the plot "Bill depth and length", > add the subtitle "Dimensions for Adelie, Chinstrap, and Gentoo Penguins", > label the x and y axes as "Bill depth (mm)" and "Bill length (mm)", respectively, > label the legend "Species", > **and add a caption for the data source.** ] .tiny[ .pull-left[ ```r ggplot(data = penguins, mapping = aes(x = bill_depth_mm, y = bill_length_mm, color = species)) + geom_point() + labs(title = "Bill depth and length", subtitle = "Dimensions for Adelie, Chinstrap, and Gentoo Penguins", x = "Bill depth (mm)", y = "Bill length (mm)", color = "Species", caption = "Source: Palmer Station LTER / palmerpenguins package") #<< ``` ] ] .pull-right[ <img src="lecture7_files/figure-html/unnamed-chunk-37-1.png" width="100%" style="display: block; margin: auto;" /> ] --- .midi[ > Start with the `penguins` data frame, > map bill depth to the x-axis > and map bill length to the y-axis. > Represent each observation with a point > and map species to the color of each point. > Title the plot "Bill depth and length", > add the subtitle "Dimensions for Adelie, Chinstrap, and Gentoo Penguins", > label the x and y axes as "Bill depth (mm)" and "Bill length (mm)", respectively, > label the legend "Species", > and add a caption for the data source. > **Finally, use a discrete color scale that is designed to be perceived by viewers with common forms of color blindness.** ] .tiny[ .pull-left[ ```r ggplot(data = penguins, mapping = aes(x = bill_depth_mm, y = bill_length_mm, color = species)) + geom_point() + labs(title = "Bill depth and length", subtitle = "Dimensions for Adelie, Chinstrap, and Gentoo Penguins", x = "Bill depth (mm)", y = "Bill length (mm)", color = "Species", caption = "Source: Palmer Station LTER / palmerpenguins package") + scale_color_viridis_d() #<< ``` ] ] .pull-right[ <img src="lecture7_files/figure-html/unnamed-chunk-38-1.png" width="100%" style="display: block; margin: auto;" /> ] --- .panelset[ .panel[.panel-name[Plot] <img src="lecture7_files/figure-html/unnamed-chunk-39-1.png" width="70%" style="display: block; margin: auto;" /> ] .panel[.panel-name[Code] .small[ ```r ggplot(data = penguins, mapping = aes(x = bill_depth_mm, y = bill_length_mm, color = species)) + geom_point() + labs(title = "Bill depth and length", subtitle = "Dimensions for Adelie, Chinstrap, and Gentoo Penguins", x = "Bill depth (mm)", y = "Bill length (mm)", color = "Species", caption = "Source: Palmer Station LTER / palmerpenguins package") + scale_color_viridis_d() ``` ``` ## Warning: Removed 2 rows containing missing values (geom_point). ``` ] ] .panel[.panel-name[Narrative] .pull-left-wide[ .midi[ Start with the `penguins` data frame, map bill depth to the x-axis and map bill length to the y-axis. Represent each observation with a point and map species to the color of each point. Title the plot "Bill depth and length", add the subtitle "Dimensions for Adelie, Chinstrap, and Gentoo Penguins", label the x and y axes as "Bill depth (mm)" and "Bill length (mm)", respectively, label the legend "Species", and add a caption for the data source. Finally, use a discrete color scale that is designed to be perceived by viewers with common forms of color blindness. ] ] ] ] --- ## Summary -- - Data visualization using `ggplot2` - `ggplot2` vs. base R - Basic structure of `ggplot2()` code - Data, mapping, geom, labels