Instructions

Upload a PDF file, named with your UC Davis email ID and homework number (e.g., xtai_hw1.pdf), to Gradescope (accessible through Canvas). You will give the commands to answer each question in its own code block, which will also produce output that will be automatically embedded in the output file. Each answer must be supported by any written statements as well as any code used.

All code used to produce your results must be shown in your PDF file (e.g., do not use echo = FALSE or include = FALSE as options anywhere). Rmd files do not need to be submitted, but may be requested by the TA and must be available when the assignment is submitted.

Students may choose to collaborate with each other on the homework, but must clearly indicate with whom they collaborated.

Problem 1: Syntax and class-typing (continued) (5 points)

For the following series of commands, either explain their results, or why they should produce errors.

dataframe3 <- data.frame(z1 = "5", z2 = 7, z3 = 12)
dataframe3[1, 2] + dataframe3[1, 3]

Problem 2: Working with a data frame (50 points)

The data set here records hourly rainfall at a certain location in Canada, every day from 1960 to 1980.

  1. Read the data set into R using the command read.table(). Use the help function to learn what arguments this function takes. Once you have the necessary input, read the data set into R and make it a data frame called rain.df.

  2. What is the structure of the data set? How many rows and columns does rain.df have? (If there are not 5070 rows and 27 columns, something is wrong; check the previous part to see what might have gone wrong.)

  3. What are the names of the columns of rain.df?

  4. What is the value of row 5, column 7 of rain.df?

  5. Display the second row of rain.df in its entirety.

  6. Explain what this command does:

names(rain.df) <- c("year", "month", "day", paste0("hour", seq(0, 23)))

by running it on your data and examining the object. (You may find the display functions head() and tail() useful here.)

  1. Create a new column in the data frame for the sum of the rightmost 24 columns. Do this using base R and call the column daily1. Do this using dplyr and call it daily2. Check that they are the same.

  2. Create a histogram of the values in your new column of daily rainfall values (read the documentation and use the hist() command in base R). What is wrong with this picture?

  3. Create a new data frame rain.df.fixed that takes the original data frame and fixes it for the apparent flaw you have discovered. Having done this, produce a new histogram with the corrected data and explain why this is more reasonable.

Problem 3: Data manipulation and visualization (45 points)

R includes a number of pre-specified data objects as part of its default installation. We will load and manipulate one of these, a data frame of 93 cars with model year 1993. Begin by ensuring that you can load this data with the commands

library(MASS)
data(Cars93)
  1. What is the structure of the data set? How many rows and columns does Cars93 have?

  2. What types of drive trains are in the data set, and how many of each type? What is the mean price of cars in each of these groups? (Use dplyr.)

  3. Assuming that these cars are exactly as fuel efficient as this table indicates, find the cars that have the maximum, minimum and median distance travellable for highway driving. You will need at least two columns to work this out; why those two?

  4. Create a binary variable for whether the car can take 7 or fewer passengers. Only for cars that take 7 or fewer passengers, use ggplot2 to create a scatterplot of Horsepower vs. Price. Change the color to depend on the variable Type. Make all the points have a transparency of .5. Add labels for title, x and y axes. What can we learn from this plot about the relationship between horsepower, price, and type of car?

Appendix

sessionInfo()
## R version 4.4.0 (2024-04-24)
## Platform: x86_64-apple-darwin20
## Running under: macOS Sonoma 14.6.1
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.4-x86_64/Resources/lib/libRblas.0.dylib 
## LAPACK: /Library/Frameworks/R.framework/Versions/4.4-x86_64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.0
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## time zone: America/Los_Angeles
## tzcode source: internal
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] MASS_7.3-60.2
## 
## loaded via a namespace (and not attached):
##  [1] digest_0.6.36     R6_2.5.1          fastmap_1.2.0     xfun_0.46        
##  [5] cachem_1.1.0      knitr_1.48        htmltools_0.5.8.1 rmarkdown_2.27   
##  [9] lifecycle_1.0.4   cli_3.6.2         sass_0.4.9        jquerylib_0.1.4  
## [13] compiler_4.4.0    rstudioapi_0.16.0 tools_4.4.0       evaluate_0.24.0  
## [17] bslib_0.8.0       yaml_2.3.10       rlang_1.1.4       jsonlite_1.8.8