class: center, middle, inverse, title-slide .title[ # Introduction and R Basics ] .subtitle[ ##
STA35A: Statistical Data Science 1 ] .author[ ### Xiao Hui Tai ] .date[ ### September 25, 2024 ] --- layout: true <!-- <div class="my-footer"> --> <!-- <span> --> <!-- <a href="https://datasciencebox.org" target="_blank">datasciencebox.org</a> --> <!-- </span> --> <!-- </div> --> --- <style type="text/css"> .small .remark-code { font-size: 80%; } </style> ## Agenda - Course overview - Course logistics - Fundamentals of R --- ## Statistical data science -- - Statistics is the study of how to collect, analyze, and draw conclusions from **data**. - Data science is typically thought of as an **interdisciplinary field**, combining statistical thinking with elements more traditionally thought of as coming from other fields, such as programming, database management and optimization. - There is a stronger focus on the **practical aspects** of working with data, in particular **computing**, as well as **applications** in different domains, such as the sciences, business, sports, and government. - This is a course on **introduction to data science**, with an emphasis on statistical thinking. - **Intro statistics** course, with a focus on computing. --- ## Examples of data science in practice <img src="img/target.png" width="50%" style="display: block; margin: auto;" /> --- <img src="img/strava.png" width="60%" style="display: block; margin: auto;" /> --- <img src="img/googleMobility.png" width="85%" style="display: block; margin: auto;" /> --- <img src="img/pools.png" width="60%" style="display: block; margin: auto;" /> --- ## Data science life cycle <!-- https://sta199-fa21-003.netlify.app/slides/01-intro.html#11 --> <!-- https://sta199-fa21-003.netlify.app/appex/appex01-unvotes.html --> <img src="img/data-science-cycle.006.png" width="90%" style="display: block; margin: auto;" /> --- ## Course content 1. **Fundamentals of R** - Overview of data types and structures - Data manipulation and data visualization tools 2. **Descriptive statistics** for numerical and categorical data 3. **Probability** - Rules of probability computation; conditional probability - Basic probability models: Binomial, Normal and Poisson 4. **Statistical inference** - Sampling distributions of sample mean and sample proportion - Hypothesis testing and confidence intervals for population mean and population proportion - No statistics, data science or programming knowledge presumed --- ## Course logistics - **Lectures** Monday, Wednesday and Friday - Thursday **lab** (run by Oscar Rivera) - **Office hours** - Oscar Rivera: Thursday 3-4 PM and Friday 11 AM-12 PM at MSB 1117 - Xiao Hui Tai: Wednesdays 1-2 PM at MSB 4242 - **Course website**: https://xhtai.github.io/statdatasci/ - Lecture notes, homework, supplementary materials, etc. - **Canvas** for lab materials, turning in labs and homework (through Gradescope), solutions and grade-book - **Piazza** for announcements and discussion - **Email** for personal matters only (**do not** send me messages on Canvas) --- ## Course logistics - **Waitlist**: - If you are no longer interested in taking the course, please drop sooner rather than later; there are many students on the waitlist - I have no control of the waitlist --- ## Grading - 10% labs - Due at 9pm, Monday after lab - 24% homework (roughly weekly; 6*4%) - Assigned on Friday afternoon, due Thursday 9pm - Except last week - One homework can be dropped - 1% participation - 30% midterms (two midterms, one dropped) - 35% final - See syllabus on course webpage for full details --- ## Software <img src="img/excel.png" width="80%" style="display: block; margin: auto;" /> --- .pull-left[ <img src="img/R_logo.svg.png" width="25%" style="display: block; margin: auto auto auto 0;" /> <img src="img/r.png" width="100%" style="display: block; margin: auto;" /> ] .pull-right[ <br> <br> - R is a free, open-source **statistical programming language** for statistical computing - It is also an **interactive environment** for doing data science - Data science teams often use a **mix of languages**, including R, Python, Julia, ... ] --- .pull-left[ <img src="img/R_logo.svg.png" width="25%" style="display: block; margin: auto auto auto 0;" /> <img src="img/r.png" width="100%" style="display: block; margin: auto;" /> ] .pull-right[ <br> <br> - R **Console**: Basic interaction with R is by typing in the console, a.k.a. terminal or command-line - You type in commands, R gives back answers (or errors) - It is easily **extensible** with packages - **Menus and other graphical interfaces** are extras built on top of the console ] --- ## Quick R demonstration - Arithmetic ``` r 1 + 2 ``` ``` [1] 3 ``` - Comparisons ``` r 1 == 2 ``` ``` [1] FALSE ``` --- .pull-left[ <img src="img/RStudio-Logo-Flat.png" width="55%" style="display: block; margin: auto auto auto 0;" /> <img src="img/rstudio.png" width="100%" style="display: block; margin: auto;" /> ] .pull-right[ <br> <br> - **RStudio** is a free, open-source R **programming environment** - It is called an **integrated development environment**, or IDE, for R programming - It contains a built-in code editor, many features to make working with R easier, and works the same way across different operating systems. ] --- ## Example of a data visualization <img src="lecture1_files/figure-html/unnamed-chunk-17-1.png" width="100%" style="display: block; margin: auto;" /> --- <br><br> .small[ ``` r un_votes %>% filter(country %in% c("United States", "United Kingdom", "China", "Singapore")) %>% inner_join(un_roll_calls, by = "rcid") %>% inner_join(un_roll_call_issues, by = "rcid") %>% mutate(year = lubridate::year(date)) %>% group_by(country, year, issue) %>% summarize(votes = n(), percent_yes = mean(vote == "yes")) %>% filter(votes > 5) %>% # Only use records where there are more than 5 votes ggplot(mapping = aes(x = year, y = percent_yes, color = country)) + geom_point(alpha = 0.4) + geom_smooth(method = "loess", se = FALSE) + facet_wrap(~issue) + scale_y_continuous(labels = scales::percent) + labs( title = "Percentage of 'Yes' votes in the UN General Assembly", subtitle = "1946 to 2019", y = "% Yes", x = "Year", color = "Country" ) + scale_color_viridis_d() + theme(text = element_text(size = 9)) ``` ] --- ## Computing set up - UC Davis **JupyterHub** (https://jupyterhub.ucdavis.edu/) has RStudio set up - Alternatively, you can install R and RStudio on **your own computers** - You will need **regular, reliable access to a computer** either with a working browser, or running an up-to-date version of R and RStudio - If this is a problem, please let us know right away. There are resources available to support you. - Labs will be at TLC 2212; either use computers available in the lab, or your own laptops (make sure your laptop is charged before class) --- ## JupyterHub <img src="img/jupyter1.png" width="70%" style="display: block; margin: auto;" /> Clicking the "Sign in with Google" button will take you to a login page. Log in using your UC Davis credentials. --- ## JupyterHub <img src="img/jupyter2.png" width="80%" style="display: block; margin: auto;" /> --- ## JupyterHub <img src="img/jupyter3.png" width="100%" style="display: block; margin: auto;" /> --- ## Tour: RStudio <img src="img/jupyter3annotated.png" width="80%" style="display: block; margin: auto;" /> To upload `qmd/Rmd` files or data files, download them to your computer (from the course webpage or from Canvas), and then upload them to the server using the Upload button in the Files pane. --- ## A short list of R essentials - **Functions** are (most often) verbs, followed by what they will be applied to in parentheses: ``` r do_this(to_this) do_that(to_this, to_that, with_those) ``` - **Packages** are extensions to R - They include functions, documentation, and sample data - They are hosted on **CRAN** (the Comprehensive R Archive Network) - They are installed using `install.packages()` and loaded using `library()`, once per session: ``` r install.packages("package_name") library(package_name) ``` Alternatively, you can call a function from a package without loading it, using `package_name::do_this()`. --- ## A short list of R essentials - Object **documentation** can be accessed with `?` or `help()` ``` r ?mean ``` - The **environment** is where R finds the value associated with a name - **Comments** are lines or text that start with the hash # - These will not be executed --- ## Quarto/R Markdown <img src="img/rmarkdown.png" width="30%" style="display: block; margin: auto;" /> - Quarto/R Markdown (old version) are tools to **integrate code and written prose** in reproducible computational documents - Quarto files have the `qmd` extension. R Markdown files have the `Rmd` extension. Each time you "knit," the analysis is run from the beginning. - To learn more, go to [quarto.org](https://quarto.org/) or [rmarkdown.rstudio.com](https://rmarkdown.rstudio.com/) - Labs will be completed in Quarto or R Markdown - Code goes in **chunks**, defined by three backticks, narrative goes outside of chunks --- ## Tour: R Markdown <img src="img/tour-rmarkdown.png" width="90%" style="display: block; margin: auto;" />