Data Literacy: Introduction to R

.title[
# Data Literacy: Introduction to R
]
.subtitle[
## Data Wrangling - Part 2
]
.author[
### Veronika Batzdorfer
]
.date[
### 2026-05-08
]

---

---

## Data wrangling continued 🤠

Before we focused on the structure of our data by **selecting**, **renaming**, and **relocating** columns and **filtering** and **arranging** rows, in this part we focus on :

- creating and computing new variables

- recoding the values of a variable

- dealing with missing values

---

## Creating & transforming variables

You can change the data type of an existing variable with base R:

``` r
stackoverflow_survey_single_response$country <- as.factor(
  stackoverflow_survey_single_response$country
  )

typeof(stackoverflow_survey_single_response$country)
```

```
## [1] "integer"
```

---

## Creating & transforming variables

The `dplyr` package provides: `mutate()`, which you can also use to create a new variable that is a constant, ...

``` r
library(conflicted)
library(tidyverse)

# Specify which version of a function to use when there's a conflict
conflict_prefer("filter", "dplyr")
conflict_prefer("mutate", "dplyr")

#-----------create constant-------------------------------
tuesdata_2024 <- stackoverflow_survey_single_response %>% 
  dplyr::mutate(year = 2024)

tuesdata_2024 %>% 
  dplyr::select(year) %>% 
  head()
```

```
## # A tibble: 6 × 1
##    year
##   <dbl>
## 1  2024
## 2  2024
## 3  2024
## 4  2024
## 5  2024
## 6  2024
```

---

## Creating & transforming variables

... applies a simple transformation to an existing variable, ...

``` r
tuesdata_2024 <- stackoverflow_survey_single_response %>% 
  dplyr::mutate(freq_new = so_part_freq - 1)

tuesdata_2024 %>% 
  dplyr::select(starts_with("freq")) %>% 
  head
```

```
## # A tibble: 6 × 1
##   freq_new
##      <dbl>
## 1       NA
## 2        5
## 3        5
## 4       NA
## 5        5
## 6        5
```

---

## Creating & transforming variables

... or changes the data type of an existing variable.

``` r
tuesdata_2024 <- tuesdata_2024 %>% 
  dplyr::mutate(age_fac = as.factor(age))

tuesdata_2024 %>% 
  dplyr::select(age, age_fac) %>% 
  glimpse()
```

```
## Rows: 65,437
## Columns: 2
## $ age     <dbl> 8, 3, 4, 1, 1, 8, 3, 1, 4, 3, 3, 4, 3, 3, 2, 4, 8, 1, 2, 3, 2, 3, 4, 5, 3, 4, 3, 2, 2, 2, 7, 2…
## $ age_fac <fct> 8, 3, 4, 1, 1, 8, 3, 1, 4, 3, 3, 4, 3, 3, 2, 4, 8, 1, 2, 3, 2, 3, 4, 5, 3, 4, 3, 2, 2, 2, 7, 2…
```

---

## Recoding values

Sometimes the numeric coding of a variable does not match its meaning. E.g., we want to recode `org_size` so that higher values represent higher employee size.

We can combine `mutate()` with `recode()`.
See `qname_levels_single_response_crosswalk`

``` r
stackoverflow_survey_single_response <- stackoverflow_survey_single_response %>% 
  dplyr::mutate(org_size_R = dplyr::recode(org_size,
                           `5` = 1, # `old value` = new value
                           `2` = 2,
                           `6` = 3,
                           `4` = 4,
                           `8` = 5,
                           `1` = 6,
                           `7` = 7,
                           `3` = 8,
                           `9` = 99,
                           ))
```

*Note*: Values you do not mention in recode() are left unchanged.
---

## Wrangling missing values

When preparing data for analysis we need to do one of two things with  missing values:

- define specific values as missings (i.e., set them to `NA`)

- recode `NA` values into something else

---

## The missings of `naniar` 🦁

The `naniar` package is useful for handling missing data. 
`replace_with_na_all` codes every value in our data set that is < 0 as `NA`.

``` r
tuesdata <- stackoverflow_survey_single_response %>%
  
  replace_with_na_all(condition = ~.x < 0)
```

**Caution**: This checks every column.
Using `replace_with_na_at()` and `replace_with_na_if()`, we can also recode values as `NA` for a selection or specific type of variables (e.g., all numeric variables).

---

## Dealing with missing values in `R`

A good starting point is the [chapter on missing values on the work-in progress 2nd edition of *R for Data Science*](https://r4ds.hadley.nz/missing-values.html).

There also are various packages for different imputation techniques. A popular one is the [`mice` package](https://amices.org/mice/). However, we won't cover the topic of imputation in this course.

---

## Excluding cases with missing values

You can use `!is.na(variable_name)` with your filtering method of choice. However, there are also methods for only keeping complete cases (i.e., cases without missing data). The `base R` function for that is `na.omit()`

``` r
tuesdata_complete <- na.omit(stackoverflow_survey_single_response)

nrow(tuesdata_complete)
```

```
## [1] 10166
```

*NB*: Of course, the number of excluded/included cases depends on how you have defined your missings values before.

---

## Excluding cases with missing values

The `tidyverse` equivalent of `na.omit()` is `drop_na()` from the `tidyr` package. You can use this function to remove cases that have missings on any variable in a data set or only on specific variables.

``` r
stackoverflow_survey_single_response %>% 
  drop_na() %>% 
  nrow()
```

```
## [1] 10166
```

``` r
stackoverflow_survey_single_response %>% 
  drop_na(ai_threat) %>% 
  nrow()
```

```
## [1] 44689
```

*NB*: Of course, the number of excluded/included cases depends on how you have defined your missings values before.

---

## Recode `NA` into something else

An easy option for replacing `NA` with another value for a single variable is the `replace_na()` function from the `tidyr` package in combination with `mutate()`.

``` r
tuesdata <- stackoverflow_survey_single_response %>% 
  mutate(ai_threat = replace_na(ai_threat, -99))
```

---

## Conditional variable transformation
Sometimes we need to make the values of a new variable conditional on values of (multiple) other variables.

---

## Simple conditional transformation

The simplest version is using an `ifelse()`:

``` r
stackoverflow_survey_single_response <- stackoverflow_survey_single_response %>% 
  dplyr::mutate(
    ed_char = ifelse(ed_level == 1, "professional", "beginner")
    )

stackoverflow_survey_single_response %>% 
  dplyr::select(ed_level, ed_char) %>% 
  dplyr::sample_n(5) # randomly sample 5 cases from the df
```

```
## # A tibble: 5 × 2
##   ed_level ed_char 
##      <dbl> <chr>   
## 1        6 beginner
## 2        7 beginner
## 3       NA <NA>    
## 4        2 beginner
## 5        2 beginner
```

.small[
*Note*: A more versatile option for creating dummy variables is the [`fastDummies` package](https://jacobkap.github.io/fastDummies/).
]

---

## Advanced conditional transformation

For more flexible (or complex) conditions, use the `case_when()` function from `dyplyr`:

``` r
stackoverflow_survey_single_response <- stackoverflow_survey_single_response %>% 
  dplyr::mutate(ed_level_cat = dplyr::case_when(
    dplyr::between(ed_level, 2, 4) ~ "beginner",
    dplyr::between(ed_level, 0, 1) ~ "expert",
    ed_level > 5 ~ "other"
    ))

stackoverflow_survey_single_response %>% 
  dplyr::select(ed_level, ed_level_cat) %>% 
  dplyr::sample_n(5)
```

```
## # A tibble: 5 × 2
##   ed_level ed_level_cat
##      <dbl> <chr>       
## 1        7 other       
## 2        2 beginner    
## 3        3 beginner    
## 4        2 beginner    
## 5        2 beginner
```

---

## `dplyr::case_when()`

A few things to note about `case_when()`:

- you can have multiple conditions per value

- conditions are evaluated *consecutively* (order matters!)

- when none of the specified conditions are met, by default, the new variable will have a **`NA`**

- if you want some other value in the new variables when the specified conditions are not met,add .highlight[`TRUE ~ value`] as the last argument of the `case_when()` call

- to explore the full range of options for `case_when()` check out its [online documentation](https://dplyr.tidyverse.org/reference/case_when.html)

---

## Recode values `across()` defined variables

We can also use .highlight[`across()`] to recode multiple variables. Here, we want to reverse recode the items measuring trust. In this case, we create new variables. We can do so by using the `.names` argument of the `across()` function

``` r
stackoverflow_survey_single_response <- 
  stackoverflow_survey_single_response %>% 
  dplyr::mutate(
#-- apply the same operation across multiple columns--    
    across(
      ai_acc:ai_complex,
#-- recode values within selected columns (reverse likert scale)--
      ~dplyr::recode(
        .x,      # .x is current column value
        `5` = 1, # old value 5 becomes 1
        `4` = 2,
        `3` = 3,
        `2` = 4,
        `1` = 5,
      ),
#-- name the new columns by appending _R to the original name--
      .names = "{.col}_R"))
```

---

## Other options for using `across()`

- logical conditions such as `is.numeric()`
- `dplyr` selection helpers such as `starts_with()`

See also the [documentation for the `across()` function](https://dplyr.tidyverse.org/reference/across.html).

---

## Aggregate variables

`dplyr` operations are applied **per column**. Creating aggregate variables (e.g., scale means) requires a function that works per row.
---

## Aggregate variables

The most common type of aggregate variables are sum and mean scores.<sup>1</sup> An easy way to create those is combining the `base R` functions `rowSums()` and `rowMeans()` with `across()` from `dplyr`.

.small[
.footnote[
[1] Of course, `R` offers many other options for dimension reduction, such as PCA, factor analyis, etc. However, we won't cover those in this course.
]
]

---

## Mean score

In this example, we create a mean score for trust. `across()` selects the columns; `rowMeans()` computes the mean for each respondent:

``` r
stackoverflow_survey_single_response <- stackoverflow_survey_single_response %>%
  
  dplyr::mutate(
    mean_ai_trust = rowMeans(
      dplyr::across(ai_acc:ai_threat),
      na.rm = TRUE)
    )
```

Why `na.rm = TRUE`? If a respondent has a missing value on any item, `rowMeans()` returns NA unless we explicitly ignore missings.

---

## More options for aggregate variables

If you need other aggregates than `mean()` or `sum()` use the [`rowwise()` function from `dplyr`](https://dplyr.tidyverse.org/articles/rowwise.html) in combination with [`c_across()`](https://dplyr.tidyverse.org/reference/c_across.html) which is a special variant of the `dplyr` function `across()` for row-wise operations.

---

## Outlook: Other variable types

In the examples in this session, we almost exclusively worked with numeric variables. There are, however, other variable types that occur frequently in data sets in the social sciences:
- factors
- strings
- time and dates

---

## Factors

Factor are a special type of variable in `R` that represent categorical data.

Internally, factors are stored as integers, but they have (character) labels (so-called *levels*).

Notably, as factors are a native data type to `R`, they do not cause the issues that labelled variables often do.

---

## Factors

Factors in `R` can be **unordered** - in which case they are similar to **nominal** level variables  or **ordered** - in which case they are similar to **ordinal** level variables

Using factors can be necessary for certain statistical analysis and plots (e.g., if you want to compare groups). Working with factors in `R` is a big topic, and we will only briefly touch upon it in this workshop. For a more in-depth discussion of factors in `R` you can, e.g., have a look at the [chapter on factors](https://r4ds.had.co.nz/factors.html) in *R for Data Science*.

---

## Factors 4 🐱s

There are many functions for working with factors in `base R`, such as `factor()` or `as.factor()`. However, a generally more versatile and easier-to-use option is the [`forcats` package](https://forcats.tidyverse.org/) from the `tidyverse`.

*Note*: There is a good [introduction to working with factors using `forcats` by Vebash Naidoo](https://sciencificity-blog.netlify.app/posts/2021-01-30-control-your-factors-with-forcats/) and *RStudio* also offers a [`forcats` cheatsheet](https://raw.githubusercontent.com/rstudio/cheatsheets/master/factors.pdf).

---

## Unordered factor

Using the `recode_factor()` function (together with `mutate()`) from `dplyr`, we can create a factor from a numeric (or a character) variable.

``` r
tuesdata_ai_threat <- stackoverflow_survey_single_response  %>% 
  dplyr::mutate(
    ai_threat_fac = dplyr::recode_factor(
      ai_threat,
      `1` = "I'm not sure",
      `2` = "No",
      `3` = "Yes")
    )

tuesdata_ai_threat %>% 
  dplyr::select(ai_threat, ai_threat_fac) %>% 
  dplyr::filter(!is.na(ai_threat)) %>% 
  dplyr::sample_n(5)
```

```
## # A tibble: 5 × 2
##   ai_threat ai_threat_fac
##       <dbl> <fct>        
## 1         2 No           
## 2         3 Yes          
## 3         2 No           
## 4         2 No           
## 5         2 No
```

---

## Summary statistics

To make sense of quantitative data we can reduce them to unique values.

**That's a simple definition of summary statistics**

As such, we can use summarizing functions of
- location (e.g., the mean),
- spread (e.g., standard deviation),
- the shape of the distribution (e.g., skewness), and
- relations between variables (e.g., correlation coefficients)

---

## Summary statistics: `summary()`

A quick check for your data set is the `base R` function `summary()` which can be applied to individual variables...

``` r
summary(stackoverflow_survey_single_response$ai_acc)
```

```
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   1.000   3.000   4.000   3.847   5.000   5.000   28135
```

as well as whole data frames:

``` r
summary(stackoverflow_survey_single_response)
```

---

```
##   response_id     main_branch         age         remote_work       ed_level       years_code   years_code_pro 
##  Min.   :    1   Min.   :1.000   Min.   :1.000   Min.   :1.00    Min.   :1.000   Min.   : 0.0   Min.   : 0.00  
##  1st Qu.:16360   1st Qu.:1.000   1st Qu.:2.000   1st Qu.:1.00    1st Qu.:2.000   1st Qu.: 6.0   1st Qu.: 3.00  
##  Median :32719   Median :1.000   Median :2.000   Median :2.00    Median :3.000   Median :11.0   Median : 7.00  
##  Mean   :32719   Mean   :1.503   Mean   :2.629   Mean   :1.96    Mean   :3.513   Mean   :14.2   Mean   :10.18  
##  3rd Qu.:49078   3rd Qu.:1.000   3rd Qu.:3.000   3rd Qu.:3.00    3rd Qu.:5.000   3rd Qu.:20.0   3rd Qu.:15.00  
##  Max.   :65437   Max.   :5.000   Max.   :8.000   Max.   :3.00    Max.   :8.000   Max.   :51.0   Max.   :51.00  
##                                                  NA's   :10631   NA's   :4653    NA's   :5568   NA's   :13827  
##     dev_type        org_size      purchase_influence  buildvs_buy   
##  Min.   : 1.00   Min.   : 1.000   Min.   :1.000      Min.   :1.0    
##  1st Qu.:12.00   1st Qu.: 3.000   1st Qu.:2.000      1st Qu.:1.0    
##  Median :16.00   Median : 5.000   Median :2.000      Median :1.0    
##  Mean   :17.17   Mean   : 4.774   Mean   :2.188      Mean   :1.6    
##  3rd Qu.:19.00   3rd Qu.: 6.000   3rd Qu.:3.000      3rd Qu.:2.0    
##  Max.   :34.00   Max.   :10.000   Max.   :3.000      Max.   :3.0    
##  NA's   :5992    NA's   :17957    NA's   :18031      NA's   :22079  
##                                                  country        currency           comp_total        
##  United States of America                            :11095   Length:65437       Min.   : 0.000e+00  
##  Germany                                             : 4947   Class :character   1st Qu.: 6.000e+04  
##  India                                               : 4231   Mode  :character   Median : 1.100e+05  
##  United Kingdom of Great Britain and Northern Ireland: 3224                      Mean   :2.964e+145  
##  Ukraine                                             : 2672                      3rd Qu.: 2.500e+05  
##  (Other)                                             :32761                      Max.   :1.000e+150  
##  NA's                                                : 6507                      NA's   :31697       
##  so_visit_freq     so_account     so_part_freq      so_comm       ai_select        ai_sent          ai_acc     
##  Min.   :1.000   Min.   :1.000   Min.   :1.000   Min.   :1.00   Min.   :1.000   Min.   :1.000   Min.   :1.000  
##  1st Qu.:2.000   1st Qu.:3.000   1st Qu.:4.000   1st Qu.:2.00   1st Qu.:2.000   1st Qu.:1.000   1st Qu.:3.000  
##  Median :2.000   Median :3.000   Median :5.000   Median :3.00   Median :3.000   Median :2.000   Median :4.000  
##  Mean   :2.511   Mean   :2.601   Mean   :4.016   Mean   :3.36   Mean   :2.375   Mean   :2.386   Mean   :3.847  
##  3rd Qu.:3.000   3rd Qu.:3.000   3rd Qu.:5.000   3rd Qu.:5.00   3rd Qu.:3.000   3rd Qu.:4.000   3rd Qu.:5.000  
##  Max.   :5.000   Max.   :3.000   Max.   :6.000   Max.   :6.00   Max.   :3.000   Max.   :6.000   Max.   :5.000  
##  NA's   :5901    NA's   :5877    NA's   :20200   NA's   :6274   NA's   :4530    NA's   :19564   NA's   :28135  
##    ai_complex      ai_threat     survey_length    survey_ease    converted_comp_yearly     r_used     
##  Min.   :1.000   Min.   :1.000   Min.   :1.000   Min.   :1.000   Min.   :       1      Min.   :0.000  
##  1st Qu.:1.000   1st Qu.:2.000   1st Qu.:1.000   1st Qu.:2.000   1st Qu.:   32712      1st Qu.:0.000  
##  Median :2.000   Median :2.000   Median :1.000   Median :2.000   Median :   65000      Median :0.000  
##  Mean   :2.232   Mean   :1.922   Mean   :1.326   Mean   :2.409   Mean   :   86155      Mean   :0.043  
##  3rd Qu.:3.000   3rd Qu.:2.000   3rd Qu.:2.000   3rd Qu.:3.000   3rd Qu.:  107972      3rd Qu.:0.000  
##  Max.   :5.000   Max.   :3.000   Max.   :3.000   Max.   :3.000   Max.   :16256603      Max.   :1.000  
##  NA's   :28416   NA's   :20748   NA's   :9255    NA's   :9199    NA's   :42002         NA's   :5692   
##  r_want_to_use     org_size_R       ed_char          ed_level_cat          ai_acc_R      ai_complex_R  
##  Min.   :0.000   Min.   : 1.000   Length:65437       Length:65437       Min.   :1.000   Min.   :1.000  
##  1st Qu.:0.000   1st Qu.: 3.000   Class :character   Class :character   1st Qu.:1.000   1st Qu.:3.000  
##  Median :0.000   Median : 4.000   Mode  :character   Mode  :character   Median :2.000   Median :4.000  
##  Mean   :0.039   Mean   : 6.723                                         Mean   :2.153   Mean   :3.768  
##  3rd Qu.:0.000   3rd Qu.: 6.000                                         3rd Qu.:3.000   3rd Qu.:5.000  
##  Max.   :1.000   Max.   :99.000                                         Max.   :5.000   Max.   :5.000  
##  NA's   :9685    NA's   :17957                                          NA's   :28135   NA's   :28416  
##  mean_ai_trust  
##  Min.   :1.000  
##  1st Qu.:2.000  
##  Median :2.667  
##  Mean   :2.540  
##  3rd Qu.:3.000  
##  Max.   :5.000  
##  NA's   :19733
```
]

---

## Summary statistics with the `datawizard` package 🧙

The [`datawizard` package](https://easystats.github.io/datawizard/) provides a function for summary statistics.

``` r
library(datawizard)

stackoverflow_survey_single_response %>%
  select(where(is.numeric)) %>%
  describe_distribution(quartiles = TRUE)
```

---

```
## Variable              |      Mean |        SD |    IQR |             Range |          Quartiles | Skewness
## ----------------------------------------------------------------------------------------------------------
## response_id           |  32719.00 |  18890.18 |  32719 |  [1.00, 65437.00] | 16360.00, 49078.00 |     0.00
## main_branch           |      1.50 |      1.02 |      0 |      [1.00, 5.00] |         1.00, 1.00 |     1.93
## age                   |      2.63 |      1.58 |      1 |      [1.00, 8.00] |         2.00, 3.00 |     1.69
## remote_work           |      1.96 |      0.89 |      2 |      [1.00, 3.00] |         1.00, 3.00 |     0.08
## ed_level              |      3.51 |      1.93 |      3 |      [1.00, 8.00] |         2.00, 5.00 |     0.89
## years_code            |     14.20 |     10.66 |     14 |     [0.00, 51.00] |        6.00, 20.00 |     1.19
## years_code_pro        |     10.18 |      9.11 |     12 |     [0.00, 51.00] |        3.00, 15.00 |     1.33
## dev_type              |     17.17 |      7.75 |      7 |     [1.00, 34.00] |       12.00, 19.00 |     0.61
## org_size              |      4.77 |      2.48 |      3 |     [1.00, 10.00] |         3.00, 6.00 |     0.36
## purchase_influence    |      2.19 |      0.77 |      1 |      [1.00, 3.00] |         2.00, 3.00 |    -0.33
## buildvs_buy           |      1.60 |      0.80 |      1 |      [1.00, 3.00] |         1.00, 2.00 |     0.84
## comp_total            | 2.96e+145 | 5.44e+147 | 190000 | [0.00, 1.00e+150] | 60000.00, 2.50e+05 |         
## so_visit_freq         |      2.51 |      1.27 |      1 |      [1.00, 5.00] |         2.00, 3.00 |     0.64
## so_account            |      2.60 |      0.75 |      0 |      [1.00, 3.00] |         3.00, 3.00 |    -1.50
## so_part_freq          |      4.02 |      1.43 |      1 |      [1.00, 6.00] |         4.00, 5.00 |    -1.25
## so_comm               |      3.36 |      1.84 |      3 |      [1.00, 6.00] |         2.00, 5.00 |     0.26
## ai_select             |      2.37 |      0.85 |      1 |      [1.00, 3.00] |         2.00, 3.00 |    -0.80
## ai_sent               |      2.39 |      1.68 |      3 |      [1.00, 6.00] |         1.00, 4.00 |     0.77
## ai_acc                |      3.85 |      1.21 |      2 |      [1.00, 5.00] |         3.00, 5.00 |    -0.86
## ai_complex            |      2.23 |      1.11 |      2 |      [1.00, 5.00] |         1.00, 3.00 |     0.62
## ai_threat             |      1.92 |      0.56 |      0 |      [1.00, 3.00] |         2.00, 2.00 |    -0.02
## survey_length         |      1.33 |      0.50 |      1 |      [1.00, 3.00] |         1.00, 2.00 |     1.12
## survey_ease           |      2.41 |      0.55 |      1 |      [1.00, 3.00] |         2.00, 3.00 |    -0.15
## converted_comp_yearly |  86155.29 |  1.87e+05 |  75288 |  [1.00, 1.63e+07] | 32712.00, 1.08e+05 |    52.92
## r_used                |      0.04 |      0.20 |      0 |      [0.00, 1.00] |         0.00, 0.00 |     4.48
## r_want_to_use         |      0.04 |      0.19 |      0 |      [0.00, 1.00] |         0.00, 0.00 |     4.76
## org_size_R            |      6.72 |     14.22 |      3 |     [1.00, 99.00] |         3.00, 6.00 |     6.14
## ai_acc_R              |      2.15 |      1.21 |      2 |      [1.00, 5.00] |         1.00, 3.00 |     0.86
## ai_complex_R          |      3.77 |      1.11 |      2 |      [1.00, 5.00] |         3.00, 5.00 |    -0.62
## mean_ai_trust         |      2.54 |      0.64 |      1 |      [1.00, 5.00] |         2.00, 3.00 |    -0.19
## 
## Variable              | Kurtosis |     n | n_Missing
## ----------------------------------------------------
## response_id           |    -1.20 | 65437 |         0
## main_branch           |     2.62 | 65437 |         0
## age                   |     3.22 | 65437 |         0
## remote_work           |    -1.74 | 54806 |     10631
## ed_level              |    -0.70 | 60784 |      4653
## years_code            |     0.90 | 59869 |      5568
## years_code_pro        |     1.60 | 51610 |     13827
## dev_type              |     0.12 | 59445 |      5992
## org_size              |    -0.51 | 47480 |     17957
## purchase_influence    |    -1.23 | 47406 |     18031
## buildvs_buy           |    -0.92 | 43358 |     22079
## comp_total            |          | 33740 |     31697
## so_visit_freq         |    -0.53 | 59536 |      5901
## so_account            |     0.45 | 59560 |      5877
## so_part_freq          |     0.11 | 45237 |     20200
## so_comm               |    -1.33 | 59163 |      6274
## ai_select             |    -1.14 | 60907 |      4530
## ai_sent               |    -1.11 | 45873 |     19564
## ai_acc                |    -0.04 | 37302 |     28135
## ai_complex            |    -0.47 | 37021 |     28416
## ai_threat             |     0.13 | 44689 |     20748
## survey_length         |     0.07 | 56182 |      9255
## survey_ease           |    -0.98 | 56238 |      9199
## converted_comp_yearly |  3950.78 | 23435 |     42002
## r_used                |    18.07 | 59745 |      5692
## r_want_to_use         |    20.65 | 55752 |      9685
## org_size_R            |    36.92 | 47480 |     17957
## ai_acc_R              |    -0.04 | 37302 |     28135
## ai_complex_R          |    -0.47 | 37021 |     28416
## mean_ai_trust         |     0.21 | 45704 |     19733
```
]

---

## Summary statistics with `dplyr`

`dplyr` provides a helpful function for creating summary statistics: `summarize()`

`summarize()` is a [vectorized](https://win-vector.com/2019/01/03/what-does-it-mean-to-write-vectorized-code-in-r/) function that can be used to create summary statistics for variables using functions like...

- `mean()`
- `sd()`
- `min()`
- `max()`

- etc.

---

## Summary statistics with `dplyr`

While creating summary statistics using `summarize()` from `dplyr()` requires writing more code, it is the most flexible option. Another nice benefit of `summarize()` is that it produces a .highlight[`tibble`] which can be used for further analyses or for creating plots or tables.
---

## Excercises

- A. Load the dataset (gapminder) from the library `gapminder` and determine the mean life expectancy in Asia.

- B. Determine the mean and median GDP per capita in Europe.

- C. Determine the mean life expectancy in 2007 for South America.

- D. Determine the top 5 countries in 2007 with the highest GDP per capita among those with life expectancy over 75.

---

## Solutions

``` r
library(gapminder)

#Mean life expect. Asia
gapminder %>%
  filter(continent == "Asia") %>%
  summarize(mean_lifeExp = mean(lifeExp))
```

```
## # A tibble: 1 × 1
##   mean_lifeExp
##          <dbl>
## 1         60.1
```

``` r
# Mean, Median GDP Europe
gapminder %>%
  filter(continent == "Europe") %>%
  summarize(mean_gdp = mean(gdpPercap),
            median_gdp = median(gdpPercap))
```

```
## # A tibble: 1 × 2
##   mean_gdp median_gdp
##      <dbl>      <dbl>
## 1   14469.     12082.
```

``` r
# Mean life expect. in 2007 S.America
gapminder %>%
  filter(continent == "Americas", year == 2007) %>%
  summarize(mean_lifeExp = mean(lifeExp))
```

```
## # A tibble: 1 × 1
##   mean_lifeExp
##          <dbl>
## 1         73.6
```

---

## Solutions

``` r
# Top 5 coutnries
gapminder %>%
  filter(year == 2007, lifeExp > 65) %>%
  arrange(desc(gdpPercap))
```

```
## # A tibble: 90 × 6
##    country          continent  year lifeExp       pop gdpPercap
##    <fct>            <fct>     <int>   <dbl>     <int>     <dbl>
##  1 Norway           Europe     2007    80.2   4627926    49357.
##  2 Kuwait           Asia       2007    77.6   2505559    47307.
##  3 Singapore        Asia       2007    80.0   4553009    47143.
##  4 United States    Americas   2007    78.2 301139947    42952.
##  5 Ireland          Europe     2007    78.9   4109086    40676.
##  6 Hong Kong, China Asia       2007    82.2   6980412    39725.
##  7 Switzerland      Europe     2007    81.7   7554661    37506.
##  8 Netherlands      Europe     2007    79.8  16570613    36798.
##  9 Canada           Americas   2007    80.7  33390141    36319.
## 10 Iceland          Europe     2007    81.8    301931    36181.
## # ℹ 80 more rows
```

---

## Outlook: Working with strings in `R`

As stated before, we won't be able to cover the specifics of working with strings in `R` in this course. However, it may be good to know that the `tidyverse` package [`stringr`](https://stringr.tidyverse.org/index.html) offers a collection of convenient functions for working with strings.

The `stringr` package provides a good [introduction vignette](https://cran.r-project.org/web/packages/stringr/vignettes/stringr.html), the book *R for Data Science* has a whole section on [strings with `stringr`](https://r4ds.had.co.nz/strings.html).

---

## Outlook: Times and dates

[Working with times and dates can be quite a pain in programming](https://www.youtube.com/watch?v=-5wpm-gesOY) (as well as data analysis). Luckily, there are a couple of neat options for working with times and dates in `R` that can reduce the headache.

---

## Outlook: Times and dates

If you want/need to work with times and dates in `R`, you may want to look into the [`lubridate` package](https://lubridate.tidyverse.org/) which is part of the `tidyverse`, and for which *RStudio* also provides a [cheatsheet](https://raw.githubusercontent.com/rstudio/cheatsheets/master/lubridate.pdf).

*Note*: If you work with time series data, it is also worth checking out the [`tsibble` package](https://tsibble.tidyverts.org/) for your wrangling tasks.