Fundamentals of R: Overview and data structures

class: center, middle, inverse, title-slide

.title[
# Fundamentals of R: Overview and data structures
]
.subtitle[
## STA35A: Statistical Data Science 1
]
.author[
### Xiao Hui Tai
]
.date[
### September 27, 2024
]

---

layout: true

---

## Recap

--
.pull-left[
<img src="img/r-logo.png" width="25%" style="display: block; margin: auto;" />
- R is a free, open-source **statistical programming language**

- R is also an environment for statistical computing and graphics

- It is easily extensible with **packages**
]
.pull-right[
<img src="img/rstudio-logo.png" width="50%" style="display: block; margin: auto;" />
- **RStudio** is a convenient interface for R called an **IDE** (integrated development environment)

- RStudio is not a requirement for programming with R, but it's very commonly used by R programmers and data scientists

]

---
## Quarto/R Markdown
<img src="img/rmarkdown.png" width="30%" style="display: block; margin: auto;" />

- Quarto/R Markdown (old version) are tools to **integrate code and written prose** in reproducible computational documents

- Quarto files have the `qmd` extension. R Markdown files have the `Rmd` extension. Each time you "knit," the analysis is run from the beginning.

- To learn more, go to [quarto.org](https://quarto.org/) or [rmarkdown.rstudio.com](https://rmarkdown.rstudio.com/)

- Labs will be completed in Quarto or R Markdown

- Code goes in **chunks**, defined by three backticks, narrative goes outside of chunks

---
## Quarto/R Markdown

Hopefully you would have made your first Quarto/R Markdown document and first visualization!

---
## Today

- Data types

- Operators

- Data structures

- Vectors

---
## Course content

1. **Fundamentals of R**
  - **Overview of data types and structures**
  - Data manipulation and data visualization tools

2. Descriptive statistics for numerical and categorical data

3. Probability
  - Rules of probability computation; conditional probability
  - Basic probability models: Binomial, Normal and Poisson

4. Statistical inference
  - Sampling distributions of sample mean and sample proportion 
  - Hypothesis testing and confidence intervals for population mean and population proportion

---
## Data types

- **Logical/Booleans**: binary values: `TRUE` or `FALSE` in R

- **Integers**: whole numbers (positive, negative or zero)

- **Double**: a floating point number, like `\(1.87 \times {10}^{6}\)`

- **Characters**: e.g., letters; R displays it using double quotes. **strings** = sequences of characters

- **Complex**: complex numbers, e.g., 1 + 2i (rarely used in data analysis)

- **Raw**: raw bytes (very rare)

--
### Note: 
- each of these data types can have missing or ill-defined values: NA, NaN, etc.
- Integers and doubles are both *numeric*

---
## Functions related to data types

`typeof()` function returns the type

`is.`_foo_`()` functions return Booleans for whether the argument is of type _foo_

`as.`_foo_`()` (tries to) "cast" its argument to type _foo_ --- to translate it sensibly into a _foo_-type value

`is.na()` checks if the value is NA

``` r
typeof(7)
```

```
## [1] "double"
```

``` r
is.numeric(7)
```

```
## [1] TRUE
```

---
## Functions related to data types

``` r
is.character(7)
is.character("7")
is.character("seven")
is.na("7")
```

---
## Functions related to data types

``` r
is.character(7)
```

```
## [1] FALSE
```

``` r
is.character("7")
```

```
## [1] TRUE
```

``` r
is.character("seven")
```

```
## [1] TRUE
```

``` r
is.na("7")
```

```
## [1] FALSE
```

---
## Functions related to data types

``` r
as.character(5/6)
as.numeric(as.character(5/6))
6*as.numeric(as.character(5/6))
5/6 == as.numeric(as.character(5/6))
```

---
## Functions related to data types

``` r
as.character(5/6)
```

```
## [1] "0.833333333333333"
```

``` r
as.numeric(as.character(5/6))
6*as.numeric(as.character(5/6))
5/6 == as.numeric(as.character(5/6))
```

---
## Functions related to data types

``` r
as.character(5/6)
```

```
## [1] "0.833333333333333"
```

``` r
as.numeric(as.character(5/6))
```

```
## [1] 0.8333333
```

``` r
6*as.numeric(as.character(5/6))
5/6 == as.numeric(as.character(5/6))
```
---
## Functions related to data types

``` r
as.character(5/6)
```

```
## [1] "0.833333333333333"
```

``` r
as.numeric(as.character(5/6))
```

```
## [1] 0.8333333
```

``` r
6*as.numeric(as.character(5/6))
```

```
## [1] 5
```

``` r
5/6 == as.numeric(as.character(5/6))
```

---
## Functions related to data types

``` r
as.character(5/6)
```

```
## [1] "0.833333333333333"
```

``` r
as.numeric(as.character(5/6))
```

```
## [1] 0.8333333
```

``` r
6*as.numeric(as.character(5/6))
```

```
## [1] 5
```

``` r
5/6 == as.numeric(as.character(5/6))
```

```
## [1] FALSE
```
(why is that last FALSE?)

---
## Functions related to data types
To see what's going on:

``` r
print(5/6, digits = 20)
```

```
## [1] 0.83333333333333337034
```

``` r
print(as.numeric(as.character(5/6)), digits = 20)
```

```
## [1] 0.83333333333333303727
```

`as.character()` rounds to 15 significant digits. For more, see https://stackoverflow.com/questions/64131808/can-as-numericas-characterx-where-x-is-originally-a-numeric-ever-change-x.

``` r
all.equal(5/6, as.numeric(as.character(5/6)))
```

```
## [1] TRUE
```

---
## Operators

- **Unary** operators take a single input, e.g., `-` for arithmetic negation, `!` for Boolean
- **Binary** operators take two inputs and give a output, e.g., usual arithmetic operators, modulo 
- See https://stat.ethz.ch/R-manual/R-devel/library/base/html/Arithmetic.html for more details

``` r
7 + 5
```

```
## [1] 12
```

``` r
7 - 5
```

```
## [1] 2
```

``` r
7*5
```

```
## [1] 35
```

---

``` r
7^5
```

```
## [1] 16807
```

``` r
7/5
```

```
## [1] 1.4
```

``` r
7 %% 5
```

```
## [1] 2
```

``` r
7 %/% 5
```

```
## [1] 1
```

---
## Arithmetic Operators vs. Logical Operators

- Operators on previous slides were arithmetic operators

- **Logical operators** are **comparisons** (also binary operators; they take two objects, like numbers, and give a Boolean)

``` r
7 > 5
```

```
## [1] TRUE
```

``` r
7 < 5
```

```
## [1] FALSE
```

``` r
7 >= 7
```

```
## [1] TRUE
```

---

``` r
7 <= 5
```

```
## [1] FALSE
```

``` r
7 == 5
```

```
## [1] FALSE
```

``` r
7 != 5
```

```
## [1] TRUE
```

- "and" and "or" are also logical operators

``` r
(5 > 7) & (6*7 == 42)
(5 > 7) | (6*7 == 42)
```
(what should these give?)

---

``` r
7 <= 5
```

```
## [1] FALSE
```

``` r
7 == 5
```

```
## [1] FALSE
```

``` r
7 != 5
```

```
## [1] TRUE
```

- "and" and "or" are also logical operators

``` r
(5 > 7) & (6*7 == 42)
```

```
## [1] FALSE
```

``` r
(5 > 7) | (6*7 == 42)
```

```
## [1] TRUE
```

---
## Assignment operators: Data can have names

We can give names to data objects; these give us **variables**

A few variables are built in:

``` r
pi
```

```
## [1] 3.141593
```

``` r
letters
```

```
##  [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o"
## [16] "p" "q" "r" "s" "t" "u" "v" "w" "x" "y" "z"
```

Variables can be arguments to functions or operators, just like constants:

``` r
pi*10
```

```
## [1] 31.41593
```

---

Most variables are created with the **assignment operator**, `<-` or `=`

``` r
myVar <- 100
myVar
```

```
## [1] 100
```

``` r
myVar2 <- 10
myVar*myVar2
```

```
## [1] 1000
```

---
The assignment operator also changes values:

``` r
myVar3 <- myVar*myVar2
myVar3
```

```
## [1] 1000
```

``` r
myVar3 <- 30
myVar3
```

```
## [1] 30
```
--

Using names and variables makes code easier to design, easier to debug, less prone to bugs, easier to improve, and easier for others to read

Avoid "magic constants" (using numbers directly in source code); use named variables

---
## The workspace/environment

What names have you defined values for?

``` r
ls()
```

```
## [1] "myVar"  "myVar2" "myVar3"
```

``` r
objects()
```

```
## [1] "myVar"  "myVar2" "myVar3"
```

---
## The workspace/environment

Getting rid of variables:

``` r
rm("myVar3")
ls()
```

```
## [1] "myVar"  "myVar2"
```

``` r
rm(list = ls()) # removes all variables in workspace
ls()
```

```
## character(0)
```

---

## Aside: R markdown
The environment of your R Markdown document is separate from the Console!

.pull-left[
First, run the following in the console

.small[

``` r
x <- 2
x * 3
```
]
]

.pull-right[
Then, add the following in an R chunk in your R Markdown document

.small[

``` r
x * 3
```
]

.question[
What happens? Why the error?
]
]

---
## First data structure: vectors

Group related data values into one object, a **data structure**

A **vector** is a sequence of values, *all of the same type*
(what are data types?)

``` r
x <- c(7, 8, 10, 45)
x
```

```
## [1]  7  8 10 45
```

``` r
is.vector(x)
```

```
## [1] TRUE
```

`c()` function returns a vector containing all its arguments in order

---
## Indexing and length
`x[1]` is the first element, `x[4]` is the 4th element  
`x[-4]` is a vector containing all but the fourth element

Vector of indices:

``` r
x
```

```
## [1]  7  8 10 45
```

``` r
x[c(2, 4)]
```

```
## [1]  8 45
```

Vector of negative indices

``` r
x[c(-1, -3)]
```

```
## [1] 8 45
```
(why that, and not `8 10`?)

---
## Indexing and length

To get the length of the vector, use `length()`

``` r
length(x)
```

```
## [1] 4
```

---
## Creating vectors
`vector(length = 7)` returns a logical vector of length 7, initialized with FALSEs; helpful for filling things up later (see help page: `vector(mode = "logical", length = 0)`)

``` r
weeklyHours <- vector(length = 7)
weeklyHours[5] <- 8
weeklyHours
```

```
## [1] 0 0 0 0 8 0 0
```

The colon operator produces a sequence

``` r
mySeq <- 2:5
mySeq
```

```
## [1] 2 3 4 5
```

---
## Summary

- Functions are used to **manipulate data**

- The basic data types represent **Booleans, numbers, and characters**

- **Data structures** group related values together

- **Vectors** group values of the same type