Instructions

Upload a PDF file, named with your UC Davis email ID and homework number (e.g., xtai_hw1.pdf), to Gradescope (accessible through Canvas). You will give the commands to answer each question in its own code block, which will also produce output that will be automatically embedded in the output file. Each answer must be supported by any written statements as well as any code used.

All code used to produce your results must be shown in your PDF file (e.g., do not use echo = FALSE or include = FALSE as options anywhere). Rmd files do not need to be submitted, but may be requested by the TA and must be available when the assignment is submitted.

Students may choose to collaborate with each other on the homework, but must clearly indicate with whom they collaborated.

Housing stock data

The data set here contains information about the housing stock of California and Pennsylvania, as of 2011. Information is aggregated into “Census tracts”, geographic regions of a few thousand people which are supposed to be fairly homogeneous economically and socially. Each row in the data set is a Census tract.

Problem 1: Loading and cleaning (25 points)

  1. Load the data into a dataframe called ca_pa. (Hint: one way is to use read.csv().)

  2. How many rows and columns does the dataframe have?

  3. Run this command, and explain, in words, what this does:

colSums(is.na(ca_pa))
  1. Remove any row containing an NA value. There are many ways to do this; one possibility is using the function na.omit(), which takes a dataframe and returns a new dataframe, omitting any row containing an NA value. You may also use dplyr operations.

  2. How many rows did (d) eliminate?

  3. Are your answers in (c) and (e) compatible? Explain. (Hint: (c) looks at missingness by column and (e) by row.)

Problem 2: This Very New House (25 points)

  1. The variable Built_2005_or_later indicates the percentage of houses in each Census tract built since 2005. Plot Median_house_value against this variable (Median_house_value should be on the y-axis). Is there overplotting? How can you improve on this scatterplot? Produce this plot.

  2. Make a new plot, or pair of plots, which breaks the plot in (a) out by state (use your improved version of the scatterplot), for just California and Pennsylvania. Note that the state is recorded in the STATEFP variable, with California being state 6 and Pennsylvania state 42. What do you learn from this figure? Is there a difference between the two states?

  3. What is the median percentage of houses built in 2005 or later (in the entire data set, i.e., California and Pennsylvania)? Create a new binary variable for whether the Census tract has percentage greater or less than this median. Make a visualization for the median house prices, broken down by this new variable. What do you learn from this figure?

Problem 3: Nobody Home (25 points)

The vacancy rate is the fraction of housing units which are not occupied. The dataframe contains columns giving the total number of housing units for each Census tract, and the number of vacant housing units.

  1. Add a new column to the dataframe which contains the vacancy rate. What is mean vacancy rate?

  2. What is the standard deviation? Calculate this using just basic arithmetic operations (+ or sum(), -, …) and length(), then use the sd() function to make sure that you get the same result.

Problem 4: County Investigation (25 points)

The column COUNTYFP contains a numerical code for counties within each state.

  1. We are interested in Alameda County (county 1 in California), Yolo (county 113 in California), and Allegheny County (county 3 in Pennsylvania). Create a new data frame with just the relevant rows. What is the median home value in Yolo county?

  2. For Alameda, Yolo and Allegheny Counties, what were the average percentages of housing built since 2005?

  3. What is the (Pearson) correlation coefficient between median house value and the median household income in (i) the whole data, (ii) all of California, (iii) all of Pennsylvania, (iv) Alameda County? First make scatterplots and guess, then compute these in R. What do you learn about the relationship between median house values and median household income?

Appendix

sessionInfo()
## R version 4.0.2 (2020-06-22)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS  10.16
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRblas.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## loaded via a namespace (and not attached):
##  [1] digest_0.6.33   R6_2.5.1        jsonlite_1.7.1  magrittr_2.0.3 
##  [5] evaluate_0.16   stringi_1.7.8   cachem_1.0.5    rlang_1.1.1    
##  [9] cli_3.3.0       rstudioapi_0.13 jquerylib_0.1.4 bslib_0.5.1    
## [13] rmarkdown_2.24  tools_4.0.2     stringr_1.4.1   xfun_0.40      
## [17] yaml_2.2.1      fastmap_1.1.1   compiler_4.0.2  htmltools_0.5.6
## [21] knitr_1.40      sass_0.4.1