Upload a PDF file, named with your UC Davis email ID and homework number (e.g., xtai_hw1.pdf
), to Gradescope (accessible through Canvas). You will give the commands to answer each question in its own code block, which will also produce output that will be automatically embedded in the output file. Each answer must be supported by any written statements as well as any code used.
All code used to produce your results must be shown in your PDF file (e.g., do not use echo = FALSE
or include = FALSE
as options anywhere). qmd/Rmd
files do not need to be submitted, but may be requested by the TA and must be available when the assignment is submitted.
Students may choose to collaborate with each other on the homework, but must clearly indicate with whom they collaborated.
Please assign the pages with your answers to the corresponding questions when submitting your homework on Gradescope. Points will be taken off if you fail to do so.
Assume that 70% of 18-20 year olds consume alcoholic beverages in any given year. Consider a random sample of 500 18-20 year olds. For each person, we record whether or not they have consumed alcoholic beverages in the past year.
Define a random variable \(X_i\) that describes the binary outcomes recorded (to be clear, we have 500 observations from this distribution). What distribution does your random variable follow? What is (are) the parameter(s)? What is the mean? What is the variance?
Calculating the fraction of 1’s in our sample gives us a single observation from the sampling distribution of the sample proportion. What is the approximate distribution of this sampling distribution? Now, what is the approximate distribution of \(\sum_{i = 1}^n X_i\)? Use this to calculate the probability that more than 361 students consumed alcoholic beverages.
A different way to think about \(\sum_{i = 1}^n X_i\) is a single draw from a different distribution. What distribution is this, and what are the parameters? Use R to calculate the exact probability that more than 361 students consumed alcoholic beverages. Is this different from your answer in (b)? Is this what you would expect? Please explain.
Let’s go back to the Palmer Penguins example. This data set contains information on 344 penguins in the Palmer Archipelago. Assume these are independent samples from the population of penguins in the Palmer Archipelago. You can load the data set using the following code.
library(palmerpenguins)
dplyr::glimpse(penguins)
## Rows: 344
## Columns: 8
## $ species <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel…
## $ island <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse…
## $ bill_length_mm <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, …
## $ bill_depth_mm <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, …
## $ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186…
## $ body_mass_g <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, …
## $ sex <fct> male, female, female, NA, female, male, female, male…
## $ year <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…
What is a point estimate for the population mean body mass of penguins in the Palmer Archipelago? What about for the population variance?
Find a 99% confidence interval for the population mean body mass of penguins in the Palmer Archipelago.
Does your confidence interval in (b) require any additional assumptions on the distribution of body mass of penguins in the Palmer Archipelago? Why or why not?
What is the interpretation of your confidence interval in (b)?
Construct a binary variable for whether the bill length is greater than 45mm.
Find a 95% confidence interval for the population proportion of penguins with bills longer than 45mm.
If we had a sample size of 1000, would the confidence intervals be narrower or wider? Why?
Construct a 90% confidence interval for the population mean flipper length of penguins in the Palmer Archipelago. What is the interpretation of your confidence interval?
We are interested in testing the hypothesis that the population mean flipper length is 200mm. Conduct a hypothesis test at the 10% level.
sessionInfo()
## R version 4.4.0 (2024-04-24)
## Platform: x86_64-apple-darwin20
## Running under: macOS Sonoma 14.6.1
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.4-x86_64/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.4-x86_64/Resources/lib/libRlapack.dylib; LAPACK version 3.12.0
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## time zone: America/Los_Angeles
## tzcode source: internal
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] palmerpenguins_0.1.1
##
## loaded via a namespace (and not attached):
## [1] vctrs_0.6.5 cli_3.6.2 knitr_1.48 rlang_1.1.4
## [5] xfun_0.46 generics_0.1.3 jsonlite_1.8.8 glue_1.7.0
## [9] htmltools_0.5.8.1 sass_0.4.9 fansi_1.0.6 rmarkdown_2.27
## [13] evaluate_0.24.0 jquerylib_0.1.4 tibble_3.2.1 fastmap_1.2.0
## [17] yaml_2.3.10 lifecycle_1.0.4 compiler_4.4.0 dplyr_1.1.4
## [21] pkgconfig_2.0.3 rstudioapi_0.16.0 digest_0.6.36 R6_2.5.1
## [25] tidyselect_1.2.1 utf8_1.2.4 pillar_1.9.0 magrittr_2.0.3
## [29] bslib_0.8.0 tools_4.4.0 cachem_1.1.0