Data Literacy: Introduction to R

.title[
# Data Literacy: Introduction to R
]
.subtitle[
## Data Types, Import & Export
]
.author[
### Veronika Batzdorfer
]
.date[
### 2025-05-23
]

---

---

## Getting data into `R`
Thus far, we've already learned what `R` and `RStudio` are. There's one essential prerequisite:

---

## Content of this session

- What are `R`'s internal data types?
- How to work with different data types?
- How to import data in different formats?
- How to export data in different formats

---

## Data built in

``` r
data()
```

---

## It boils all down to...

.pull-left[
**How your data are stored (data types)**
- 'Numbers' (Integers & Doubles)
- Character Strings
- Logical
- Factors
- ...
- There's more, e.g., expressions, but let's leave it at that
]

.pull-right[
**Where your data are stored (data formats)**
- Vectors
- Matrices
- Arrays
- Data frames / Tibbles
- Lists
]

---

## Numeric data
.small[
*Integers* are values without a decimal value. To be explicit in `R` in using them, you have to place an `L` behind the actual value.

``` r
1L
```

```
## [1] 1
```

By contrast, *doubles* are values with a decimal value.

``` r
1.1
```

```
## [1] 1.1
```

We can check data types by using the `typeof()` function.

``` r
typeof(1L) 
```

```
## [1] "integer"
```

``` r
typeof(1.1)
```

```
## [1] "double"
```
]

---

## Character strings
At first glance, a *character* is a letter somewhere between a-z. *String* in this context might mean that we have a series of characters. However, numbers and other symbols can be part of a *character string*, which can then be, e.g., part of a text. In `R`, character strings are wrapped in quotation marks.

``` r
"Hi. I am a character string, the 1st of its kind!"
```

```
## [1] "Hi. I am a character string, the 1st of its kind!"
```

*Note*: There are no values associated with the content of character strings unless we change that, e.g., with factors.

---

## Factors

Factors are data types that assume that their values are not continuous, e.g., as in [ordinal](https://en.wikipedia.org/wiki/Level_of_measurement#Ordinal_scale) or [nominal](https://en.wikipedia.org/wiki/Level_of_measurement#Nominal_level) data.

``` r
factor(1.1)
```

```
## [1] 1.1
## Levels: 1.1
```

``` r
factor("Hi. I am a character string, the 1st of its kind!")
```

```
## [1] Hi. I am a character string, the 1st of its kind!
## Levels: Hi. I am a character string, the 1st of its kind!
```

Factors take *numeric* data or *character* strings as input as they simply convert them into so-called **levels**.

---

## Logical values

Logical values are basically either `TRUE` or `FALSE` values. These values are produced by making logical requests on your data.

``` r
2 > 1
```

```
## [1] TRUE
```

``` r
2 < 1
```

```
## [1] FALSE
```

Logical values are at the heart of creating loops. For this purpose, however, we need more logical operators to request `TRUE` or `FALSE` values.

---

## Logical operators

There are quite a few logical operators in `R`:

.pull-left[
- `<` 	less than
- `<=` 	less than or equal to
- `>` 	greater than
- `>=` 	greater than or equal to
- `== `	exactly equal to
- `!=` 	not equal to
]

.pull-right[
- `!x` 	Not x
- `x | y` 	x OR y
- `x & y `	x AND y
- `isTRUE(x)` 	test if X is TRUE 
- `isFALSE(x)` 	test if X is FALSE 
]

Moreover, there are some more `is.PROPERTY_ASKED_FOR()` functions, such as `is.numeric()`, which also return `TRUE` or `FALSE` values.

---

## `R`'s data formats

`R`'s different data types can be put into 'containers'.

---

## Vectors

Vectors are built by enclosing your content with `c()` ("c" for "concatenate")

``` r
numeric_vector   <- c(1, 2, 3, 4)
character_vector <- c("a", "b", "c", "d")

numeric_vector
```

```
## [1] 1 2 3 4
```

``` r
character_vector
```

```
## [1] "a" "b" "c" "d"
```

Vectors are really like vectors in mathematics. Initially, it doesn't matter if you look at them as column or row vectors.

---

## ...but it matters when you combine vectors

Using the function `cbind()` or `rbind()` you can either combine vectors column-wise or row-wise. Thus, they become matrices.

``` r
cbind(numeric_vector, character_vector)
```

```
##      numeric_vector character_vector
## [1,] "1"            "a"             
## [2,] "2"            "b"             
## [3,] "3"            "c"             
## [4,] "4"            "d"
```

``` r
rbind(numeric_vector, character_vector)
```

```
##                  [,1] [,2] [,3] [,4]
## numeric_vector   "1"  "2"  "3"  "4" 
## character_vector "a"  "b"  "c"  "d"
```

.small[
*Note*: The numeric values are [coerced](https://www.oreilly.com/library/view/r-in-a/9781449358204/ch05s08.html) into strings here.
]

---

## Matrices

Matrices are the basic rectangular data format in R.

``` r
fancy_matrix <- matrix(1:16, nrow = 4)

fancy_matrix
```

```
##      [,1] [,2] [,3] [,4]
## [1,]    1    5    9   13
## [2,]    2    6   10   14
## [3,]    3    7   11   15
## [4,]    4    8   12   16
```

You cannot store multiple data types, such as strings and numeric values in the same matrix.  Otherwise, your data will get coerced to a common type, as seen in the previous slide. This is something that happens already within vectors:

``` r
c(1, 2, "evil string")
```

```
## [1] "1"           "2"           "evil string"
```

---

## Data frames

``` r
library(randomNames) # a name generator package

fancy_data <-
  data.frame( 
    who = 
      randomNames(n = 10, which.names = "first"),
    age = 
      sample(14:49, 10, replace = TRUE), # you see what we are doing here?   
    salary_2018 = 
      sample(15:100, 10, replace = TRUE),  
    salary_2019 = 
      sample(15:100, 10, replace = TRUE)
  )
 
fancy_data
```
]

---
class: middle

```
##         who age salary_2018 salary_2019
## 1    Joseph  27          30          93
## 2      Asha  37          40          99
## 3     Emily  23          86          18
## 4  Michaela  23          77          68
## 5    Jordan  38          52          66
## 6   Burhaan  40          92          48
## 7  Lasandra  36          45          41
## 8     Caleb  31          28          15
## 9  Angelica  39          91          80
## 10   Alfred  38          97          60
```

---

## Tibbles

- only the first ten observations are printed
  - the output is tidier!
- you get some additional metadata about rows and columns that you would normally only get when using `dim()` and other functions

You can check the [tibble vignette](https://cran.r-project.org/web/packages/tibble/vignettes/tibble.html) for technical details.
]

.pull-right[
<img src="data:image/png;base64,#../img/tibble.png" width="60%" style="display: block; margin: auto;" />
]

---

## Tibble conversion

``` r
library(tibble)
as_tibble(fancy_data)
```

```
## # A tibble: 10 × 4
##    who        age salary_2018 salary_2019
##    <chr>    <int>       <int>       <int>
##  1 Joseph      27          30          93
##  2 Asha        37          40          99
##  3 Emily       23          86          18
##  4 Michaela    23          77          68
##  5 Jordan      38          52          66
##  6 Burhaan     40          92          48
##  7 Lasandra    36          45          41
##  8 Caleb       31          28          15
##  9 Angelica    39          91          80
## 10 Alfred      38          97          60
```

---

## One last type you should know: lists

Lists are perfect for storing numerous and potentially diverse pieces of information in one place.

``` r
fancy_list <- 
  list(
    numeric_vector,
    character_vector,
    fancy_matrix,
    fancy_data
  )

fancy_list
```

---
class: middle
.tinyish[

```
## [[1]]
## [1] 1 2 3 4
## 
## [[2]]
## [1] "a" "b" "c" "d"
## 
## [[3]]
##      [,1] [,2] [,3] [,4]
## [1,]    1    5    9   13
## [2,]    2    6   10   14
## [3,]    3    7   11   15
## [4,]    4    8   12   16
## 
## [[4]]
##         who age salary_2018 salary_2019
## 1    Joseph  27          30          93
## 2      Asha  37          40          99
## 3     Emily  23          86          18
## 4  Michaela  23          77          68
## 5    Jordan  38          52          66
## 6   Burhaan  40          92          48
## 7  Lasandra  36          45          41
## 8     Caleb  31          28          15
## 9  Angelica  39          91          80
## 10   Alfred  38          97          60
```
]

---

## Nested lists

``` r
fancy_nested_list <-
  list(
    fancy_vectors = list(numeric_vector, character_vector),
    data_stuff = list(fancy_matrix, fancy_data)
  )

fancy_nested_list
```

---
class: middle
.tinyish[

```
## $fancy_vectors
## $fancy_vectors[[1]]
## [1] 1 2 3 4
## 
## $fancy_vectors[[2]]
## [1] "a" "b" "c" "d"
## 
## 
## $data_stuff
## $data_stuff[[1]]
##      [,1] [,2] [,3] [,4]
## [1,]    1    5    9   13
## [2,]    2    6   10   14
## [3,]    3    7   11   15
## [4,]    4    8   12   16
## 
## $data_stuff[[2]]
##         who age salary_2018 salary_2019
## 1    Joseph  27          30          93
## 2      Asha  37          40          99
## 3     Emily  23          86          18
## 4  Michaela  23          77          68
## 5    Jordan  38          52          66
## 6   Burhaan  40          92          48
## 7  Lasandra  36          45          41
## 8     Caleb  31          28          15
## 9  Angelica  39          91          80
## 10   Alfred  38          97          60
```
]

---

## Accessing elements by index

Generally, the logic of <span style="background-color: yellow;" > `[index_number]`</span> is to access only a subset of information in an object, no matter if we have vectors or data frames.

Say, we want to extract the 2nd element of our `character_vector` object, we could do that like this:

``` r
character_vector[2]
```

```
## [1] "b"
```

---

## More complicated cases: matrices

Matrices can have more dimensions, often you want information from a specific row and column.

``` r
a_wonderful_matrix[number_of_row, number_of_column]
```

*Note*: You can do the same indexing with `data.frame`s.

---

## Matrices and subscripts (as in mathematical notation)

Identifying rows, columns, or elements using subscripts is similar to matrix  notation:

``` r
fancy_matrix[, 4] # 4th column of matrix
fancy_matrix[3,] # 3rd row of matrix
fancy_matrix[2:4, 1:3] # rows 2,3,4 of columns 1,2,3 
```

---

## The case of data frames

A nice feature of `data.frames` or `tibbles` is that their columns are names, just as variable names in ordinary data.

``` r
fancy_data$who
```

```
##  [1] "Joseph"   "Asha"     "Emily"    "Michaela" "Jordan"   "Burhaan" 
##  [7] "Lasandra" "Caleb"    "Angelica" "Alfred"
```

Just place a `$`-sign between the data object and the variable name.

---

## `[]` in data frames

Sometimes we also have to rely on character strings as input information, e.g., for iterating over data. We can also use `[]` to access variables by name.

``` r
fancy_data[1]
```

```
##         who
## 1    Joseph
## 2      Asha
## 3     Emily
## 4  Michaela
## 5    Jordan
## 6   Burhaan
## 7  Lasandra
## 8     Caleb
## 9  Angelica
## 10   Alfred
```
]

``` r
fancy_data["who"]
```

```
##         who
## 1    Joseph
## 2      Asha
## 3     Emily
## 4  Michaela
## 5    Jordan
## 6   Burhaan
## 7  Lasandra
## 8     Caleb
## 9  Angelica
## 10   Alfred
```
]
 
---

## Difference between `[]` and `[[]]`

<img src="data:image/png;base64,#../img/index.png" width="1832" style="display: block; margin: auto;" />
https://twitter.com/hadleywickham/status/643381054758363136

---

## Data frame check 1, 2, 1, 2!

The most high-level information you can get is about the object type and its dimensions.

``` r
# object type
class(fancy_data)
```

```
## [1] "data.frame"
```

``` r
# number of rows and columns
dim(fancy_data)
```

```
## [1] 10  4
```

``` r
# number of rows
nrow(fancy_data)
```

```
## [1] 10
```

``` r
# number of columns
ncol(fancy_data)
```

```
## [1] 4
```
]

---

## Data frame check 1, 2, 1, 2!

You can also print the first 6 lines of the data frame with `head()`. You can easily change the number of lines by providing the number as the second argument to the `head()` function.

``` r
head(fancy_data, 3)
```

```
##      who age salary_2018 salary_2019
## 1 Joseph  27          30          93
## 2   Asha  37          40          99
## 3  Emily  23          86          18
```

---

## Data frame check 1, 2, 1, 2!

If we want some more (detailed) information about the data set or object, we can use the `base R` function `str()`.

``` r
str(fancy_data)
```

```
## 'data.frame':	10 obs. of  4 variables:
##  $ who        : chr  "Joseph" "Asha" "Emily" "Michaela" ...
##  $ age        : int  27 37 23 23 38 40 36 31 39 38
##  $ salary_2018: int  30 40 86 77 52 92 45 28 91 97
##  $ salary_2019: int  93 99 18 68 66 48 41 15 80 60
```

---

## Data frame check 1, 2, 1, 2!

If you want to have a look at your full data set, you can use the `View()` function. In *RStudio*, this will open a new tab in the source pane through which you can explore the data set (including a search function). You can also click on the small spreadsheet symbol on the right side of the object in the environment tab to open this view.

``` r
View(fancy_data)
```

---

## Viewing and changing names

We can print all names of an object using the `names()` function...

``` r
names(fancy_data)
```

```
## [1] "who"         "age"         "salary_2018" "salary_2019"
```

...and we can also change names with it.

``` r
names(fancy_data) <- c("name", "age", "salary_2018", "salary_2019")

names(fancy_data)
```

```
## [1] "name"        "age"         "salary_2018" "salary_2019"
```

---

# [Exercise](https://rawcdn.githack.com/nika-akin/r-intro/9d05476f895e390df08662eecbefd4137f67acf4/exercises/Exercise_1_2_1_Data_Types.html) time 🏋️‍♀️💪🏃🚴

## [Solutions](https://rawcdn.githack.com/nika-akin/r-intro/9d05476f895e390df08662eecbefd4137f67acf4/solutions/Exercise_1_2_1_Data_Types.html)

---

## TidyTuesday Dataset

.left-column[
<img src="data:image/png;base64,#../img/tidy.png" width="1365" style="display: block; margin: auto;" />
]

.right-column[
Data: [Stack Overflow Annual Developer Survey 2024](https://github.com/rfordatascience/tidytuesday/blob/master/data/2024/2024-09-03/readme.md).

``` r
# Option 1: tidytuesdayR package 
## install.packages("tidytuesdayR")

tuesdata <- tidytuesdayR::tt_load('2024-09-03')

qname_levels_single_response_crosswalk <- tuesdata$qname_levels_single_response_crosswalk
stackoverflow_survey_questions <- tuesdata$stackoverflow_survey_questions
stackoverflow_survey_single_response <- tuesdata$stackoverflow_survey_single_response

# Option 2: Read directly from GitHub

qname_levels_single_response_crosswalk <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2024/2024-09-03/qname_levels_single_response_crosswalk.csv')
stackoverflow_survey_questions <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2024/2024-09-03/stackoverflow_survey_questions.csv')
stackoverflow_survey_single_response <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2024/2024-09-03/stackoverflow_survey_single_response.csv')
```
]

---

## Gapminder Data
.left-column[ 
<img src="data:image/png;base64,#../img/gapminder_logo.png" width="100%" style="display: block; margin: auto;" />
]

.right-column[
We will also use [data from *Gapminder*](https://www.gapminder.org/data/). During the course and the exercises, we work with data we have downloaded from their website. There also is an `R` package that bundles some of the *Gapminder* data: `install.packages("gapminder")`.

This `R` package provides ["[a]n excerpt of the data available at Gapminder.org. For each of 142 countries, the package provides values for life expectancy, GDP per capita, and population, every five years, from 1952 to 2007."](https://cran.r-project.org/web/packages/gapminder/index.html)
]

---

## How to use the data in general

To code along and be able to do the exercises, you should store the data files for the *tuesdata* in a folder called `./data` in the same folder as the other materials for this course.

---

## `R` is data-agnostic
<img src="data:image/png;base64,#../img/Datenimport.PNG" width="65%" style="display: block; margin: auto;" />

---

## Data formats & packages

**What you will learn**
- Getting the most common data formats into `R`
  - e.g., CSV, *Stata*, *SPSS*, or *Excel* spreadsheets
- Using the different methods of doing that
- We will rely a lot on packages and functions from the `tidyverse` instead of using `base R`

---

## Before writing any code: *RStudio* functionality for importing data
You can use the *RStudio* GUI for importing data via `Environment - Import data set - Choose file type`.

---

## Where to find data

**Browse Button in `RStudio`**
<img src="data:image/png;base64,#../img/importBrowse.PNG" width="75%" style="display: block; margin: auto;" />

**Code preview in `Rstudio`**
<img src="data:image/png;base64,#../img/codepreview.PNG" width="75%" style="display: block; margin: auto;" />

---

## Simple vs. not so simple file formats

Basic file formats, such as CSV (comma-separated value file), can directly be imported into `R`
- they are 'flat'
- few metadata
- basically text files

Other file formats, particularly the proprietary ones, require the use of additional packages
- they are complex
- a lot of metadata (think of all the labels in an *SPSS* file)
- they are binary (1110101)

---

## Disclaimer

**In the following slides, we'll jump right into importing data. We use a lot of different packages for this purpose, and you don't have to remember everything. It's just for making a point of how agnostic `R` actually is regarding the file type. Later on, we will dive more into the specifics of importing.**

---

## Importing a CSV file using `base R`

``` r
titanic <- read.csv("./data/titanic.csv")

titanic
```

```
##   PassengerId Survived Pclass
## 1           1        0      3
## 2           2        1      1
## 3           3        1      3
## 4           4        1      1
## 5           5        0      3
## 6           6        0      3
##                                                  Name    Sex Age SibSp
## 1                             Braund, Mr. Owen Harris   male  22     1
## 2 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female  38     1
## 3                              Heikkinen, Miss. Laina female  26     0
## 4        Futrelle, Mrs. Jacques Heath (Lily May Peel) female  35     1
## 5                            Allen, Mr. William Henry   male  35     0
## 6                                    Moran, Mr. James   male  NA     0
##   Parch           Ticket    Fare Cabin Embarked
## 1     0        A/5 21171  7.2500              S
## 2     0         PC 17599 71.2833   C85        C
## 3     0 STON/O2. 3101282  7.9250              S
## 4     0           113803 53.1000  C123        S
## 5     0           373450  8.0500              S
## 6     0           330877  8.4583              Q
```
]

---

## A `readr` example: `CSV` files

``` r
library(readr)

titanic <- read_csv("./data/titanic.csv")
```

---
class: middle

``` r
titanic
```

```
## # A tibble: 891 × 12
##    PassengerId Survived Pclass Name       Sex     Age SibSp Parch Ticket
##          <dbl>    <dbl>  <dbl> <chr>      <chr> <dbl> <dbl> <dbl> <chr> 
##  1           1        0      3 Braund, M… male     22     1     0 A/5 2…
##  2           2        1      1 Cumings, … fema…    38     1     0 PC 17…
##  3           3        1      3 Heikkinen… fema…    26     0     0 STON/…
##  4           4        1      1 Futrelle,… fema…    35     1     0 113803
##  5           5        0      3 Allen, Mr… male     35     0     0 373450
##  6           6        0      3 Moran, Mr… male     NA     0     0 330877
##  7           7        0      1 McCarthy,… male     54     0     0 17463 
##  8           8        0      3 Palsson, … male      2     3     1 349909
##  9           9        1      3 Johnson, … fema…    27     0     2 347742
## 10          10        1      2 Nasser, M… fema…    14     1     0 237736
## # ℹ 881 more rows
## # ℹ 3 more variables: Fare <dbl>, Cabin <chr>, Embarked <chr>
```
]

Note the column specifications: `readr` 'guesses' them based on the first 1000 observations (we will come back to this later).

---

## Importing *Excel* files with `readxl`

``` r
library(readxl)

unicorns <- read_xlsx("./data/observations.xlsx")
```

No output ☹️

---
class: middle

``` r
unicorns
```

```
## # A tibble: 42 × 3
##    countryname  year   pop
##    <chr>       <dbl> <dbl>
##  1 Austria      1670    85
##  2 Austria      1671    83
##  3 Austria      1674    75
##  4 Austria      1675    82
##  5 Austria      1676    79
##  6 Austria      1677    70
##  7 Austria      1678    81
##  8 Austria      1680    80
##  9 France       1673    70
## 10 France       1674    79
## # ℹ 32 more rows
```

---

## Other data import options

These were just some very first examples of applying functions for data import from the different packages. There are many more...

.pull-left[
`readr`
- `read_csv()`
- `read_tsv()`
- `read_delim()`
- `read_fwf()`
- `read_table()`
- `read_log()`
]

---

## Data type specifications for `tibbles`

- characters
  - indicated by `<chr>`
  - specified by `col_character()`
- integers
  - indicated by `<int>`
  - specified by `col_integer()`
- doubles
  - indicated by `<dbl>`
  - specified by `col_double()`
- factors
  - indicated by `<fct>`
  - specified by `col_factor()`
- logical
  - indicated by `<lgl>`
  - specified by `col_logical()`

---

## Changing variable types

As mentioned before, `read_csv` 'guesses' the variable types by scanning the first 1000 observations. **NB**: This can go wrong!

Luckily, we can change the variable type...
- before/while loading the data
- and after loading the data

---

## While loading the data in `read_csv`

``` r
titanic <-
  read_csv(
    "./data/titanic.csv",
    col_types = cols(
      PassengerId = col_double(),
      Survived = col_double(),
      Pclass = col_double(),
      Name = col_character(),
      Sex = col_character(),
      Age = col_double(),
      SibSp = col_double(),
      Parch = col_double(),
      Ticket = col_character(),
      Fare = col_double(),
      Cabin = col_character(),
      Embarked = col_character()
    )
  )

titanic
```

---

---

## While loading the data in `read_csv`

``` r
titanic <-
  read_csv(
    "./data/titanic.csv",
    col_types = cols(
      PassengerId = col_double(),
      Survived = col_double(),
      Pclass = col_double(),
      Name = col_character(),
      Sex = col_factor(), # This one changed!
      Age = col_double(),
      SibSp = col_double(),
      Parch = col_double(),
      Ticket = col_character(),
      Fare = col_double(),
      Cabin = col_character(),
      Embarked = col_character()
    )
  )

titanic
```

---

```
## # A tibble: 891 × 12
##    PassengerId Survived Pclass Name       Sex     Age SibSp Parch Ticket
##          <dbl>    <dbl>  <dbl> <chr>      <fct> <dbl> <dbl> <dbl> <chr> 
##  1           1        0      3 Braund, M… male     22     1     0 A/5 2…
##  2           2        1      1 Cumings, … fema…    38     1     0 PC 17…
##  3           3        1      3 Heikkinen… fema…    26     0     0 STON/…
##  4           4        1      1 Futrelle,… fema…    35     1     0 113803
##  5           5        0      3 Allen, Mr… male     35     0     0 373450
##  6           6        0      3 Moran, Mr… male     NA     0     0 330877
##  7           7        0      1 McCarthy,… male     54     0     0 17463 
##  8           8        0      3 Palsson, … male      2     3     1 349909
##  9           9        1      3 Johnson, … fema…    27     0     2 347742
## 10          10        1      2 Nasser, M… fema…    14     1     0 237736
## # ℹ 881 more rows
## # ℹ 3 more variables: Fare <dbl>, Cabin <chr>, Embarked <chr>
```

---

## After loading the data

``` r
titanic <-
  type_convert(
    titanic,
    col_types = cols(
      PassengerId = col_double(),
      Survived = col_double(),
      Pclass = col_double(),
      Name = col_character(),
      Sex = col_factor(),
      Age = col_double(),
      SibSp = col_double(),
      Parch = col_double(),
      Ticket = col_character(),
      Fare = col_double(),
      Cabin = col_character(),
      Embarked = col_character()
    )
  )
```

---

# [Exercise](https://rawcdn.githack.com/nika-akin/r-intro/9d05476f895e390df08662eecbefd4137f67acf4/exercises/Exercise_1_2_2_Flat_Files.html) time 🏋️‍♀️💪🏃🚴

## [Solutions](https://rawcdn.githack.com/nika-akin/r-intro/9d05476f895e390df08662eecbefd4137f67acf4/solutions/Exercise_1_2_2_Flat_Files.html)

---

## Exporting data

Sometimes our data have to leave `R`, for example, if we....
- share data with colleagues who do not use `R`
- want to continue where we left off
  - particularly if data wrangling took a long time
  
For such purposes, we also need a way to export our data.

All of the packages we have discussed in this session also have designated functions for that.

---

## Examples: CSV and Stata files

``` r
write_csv(titanic, "titanic_own.csv")
```

---

## `R`'s native file formats

There are 2 native 'file formats' to choose from. The advantage of using them is that they are compressed files, so that they don't occupy unnecessarily large disk space. These two formats are `.Rdata`/`.rda` and `.rds`.
**The key difference between them is that `.rds` can only hold one object, whereas `.Rdata`/`.rda` can also be used for storing several objects in one file.**

---

## `.Rdata`/`.rda`

Saving

``` r
save(mydata, file = "mydata.RData")
```

``` r
load("mydata.RData")
```

---

## `.rds`

Saving

``` r
saveRDS(mydata, "mydata.rds")
```

``` r
mydata <- readRDS("mydata.rds")
```
  
*Note*: A nice property of `saveRDS()` is that just saves a representation of the object, which means you can name it whatever you want when loading.

---

## Saving just everything

If you have not changed the General Global Options in *RStudio* as suggested in the *Getting Started* session, you may have noticed that, when closing *Rstudio*, by default, the programs asks you whether you want to save the workspace image.

You can also do that whenever you want using the `save.image()` function:

``` r
save.image()
```

.small[
*Note*: As we've said before, though, this is not something we'd recommend as a worfklow. Instead, you should (explicitly and separately) save your `R` scripts and data sets (in appropriate formats).
]

---

## Other packages for data import

For data import (and export) in general, there are even more options, such as...

- [`data.table`](https://cran.r-project.org/web/packages/data.table/index.html) or [`fst`](https://www.fstpackage.org/) for large data sets

- [`jsonlite`](https://cran.r-project.org/web/packages/jsonlite/index.html) for `.json` files

---

## Reminder regarding file paths

In general, you should avoid using absolute file paths to maintain your code reproducible and future-proof. We already talked about this in the introduction, but this is particularly important for importing and exporting data.

As a reminder: Absolute file paths look like this (on different OS):

``` r
# Windows
load("C:/Users/cool_user/data/fancy_data.Rdata")

# Mac
load("/Users/cool_user/data/fancy_data.Rdata")

# GNU/Linux
load("/home/cool_user/data/fancy_data.Rdata")
```

---

## Use relative paths

Instead of using absolute paths, it is recommended to use relative file paths. The general principle here is to start from a directory where your current script currently exists and navigate to your target location. Say we are in the "C:/Users/cool_user/" location on a Windows machine. To load your data, we would use:

``` r
load("./data/fancy_data.Rdata")
```

If we were in a different folder, e.g., "C:/Users/cool_user/cat_pics/mittens/", we would use:

``` r
load("../../data/fancy_data.Rdata") 
```