class: center, middle, inverse, title-slide .title[ # Data Literacy: Introduction to R ] .subtitle[ ## Data Wrangling - Part 2 ] .author[ ### Veronika Batzdorfer ] .date[ ### 2025-11-07 ] --- layout: true --- ## Data wrangling continued 🤠 Before we focused on the structure of our data by **selecting**, **renaming**, and **relocating** columns and **filtering** and **arranging** rows, in this part we focus on : - creating and computing new variables (in various ways) - recoding the values of a variable - dealing with missing values --- ## Creating & transforming variables Add new variables by changing the data type of an existing variable. ``` r stackoverflow_survey_single_response$country <- as.factor(stackoverflow_survey_single_response$country) typeof(stackoverflow_survey_single_response$country) ``` ``` ## [1] "integer" ``` --- ## Creating & transforming variables The `dplyr` package provides: `mutate()`, which you can also use to create a new variable that is a constant, ... ``` r library(conflicted) library(tidyverse) # Specify which version of a function to use when there's a conflict conflict_prefer("filter", "dplyr") conflict_prefer("mutate", "dplyr") #-----------mutate--------------------------------------- tuesdata_2024 <- stackoverflow_survey_single_response %>% dplyr::mutate(year = 2024) tuesdata_2024 %>% dplyr::select(year) %>% head() ``` ``` ## # A tibble: 6 × 1 ## year ## <dbl> ## 1 2024 ## 2 2024 ## 3 2024 ## 4 2024 ## 5 2024 ## 6 2024 ``` --- ## Creating & transforming variables ... applies a simple transformation to an existing variable, ... ``` r tuesdata_2024 <- stackoverflow_survey_single_response %>% dplyr::mutate(freq_new = so_part_freq - 1) tuesdata_2024 %>% dplyr::select(starts_with("freq")) %>% head ``` ``` ## # A tibble: 6 × 1 ## freq_new ## <dbl> ## 1 NA ## 2 5 ## 3 5 ## 4 NA ## 5 5 ## 6 5 ``` --- ## Creating & transforming variables ... or changes the data type of an existing variable. ``` r tuesdata_2024 <- tuesdata_2024 %>% dplyr::mutate(age_fac = as.factor(age)) tuesdata_2024 %>% dplyr::select(age, age_fac) %>% glimpse() ``` ``` ## Rows: 65,437 ## Columns: 2 ## $ age <dbl> 8, 3, 4, 1, 1, 8, 3, 1, 4, 3, 3, 4, 3, 3, 2, 4, 8, 1, 2, 3, 2, 3,… ## $ age_fac <fct> 8, 3, 4, 1, 1, 8, 3, 1, 4, 3, 3, 4, 3, 3, 2, 4, 8, 1, 2, 3, 2, 3,… ``` --- ## Recoding values For example, we want to recode the item on organisation size (`org_size`) so that higher values represent higher employee size. For that purpose, we can combine the two `dplyr` functions `mutate()` and `recode()`. See `qname_levels_single_response_crosswalk` ``` r stackoverflow_survey_single_response <- stackoverflow_survey_single_response %>% dplyr::mutate(org_size_R = dplyr::recode(org_size, `5` = 1, # `old value` = new value `2` = 2, `6` = 3, `4` = 4, `8` = 5, `1` = 6, `7` = 7, `3` = 8, `9` = 99, )) ``` --- ## Wrangling missing values When we prepare our data for analysis there are generally two things we might want/have to do with regard to missing values: - define specific values as missings (i.e., set them to `NA`) - recode `NA` values into something else --- ## The missings of `naniar` 🦁 The `naniar` is useful for handling missing data in `R` . We can use the function `replace_with_na_all` to code every value in our data set that is < 0 as `NA`. ``` r tuesdata <- stackoverflow_survey_single_response %>% replace_with_na_all(condition = ~.x < 0) ``` Using the functions `replace_with_na_at()` and `replace_with_na_if()`, we can also recode values as `NA` for a selection or specific type of variables (e.g., all numeric variables). --- ## Dealing with missing values in `R` As with everything in `R`, there are also many online resources on dealing with missing data. A fairly new and interesting one is the [chapter on missing values on the work-in progress 2nd edition of *R for Data Science*](https://r4ds.hadley.nz/missing-values.html). There also are various packages for different imputation techniques. A popular one is the [`mice` package](https://amices.org/mice/). However, we won't cover the topic of imputation in this course. --- ## Excluding cases with missing values You can use `!is.na(variable_name)` with your filtering method of choice. However, there are also methods for only keeping complete cases (i.e., cases without missing data). The `base R` function for that is `na.omit()` ``` r tuesdata_complete <- na.omit(stackoverflow_survey_single_response) nrow(tuesdata_complete) ``` ``` ## [1] 10166 ``` *NB*: Of course, the number of excluded/included cases depends on how you have defined your missings values before. --- ## Excluding cases with missing values The `tidyverse` equivalent of `na.omit()` is `drop_na()` from the `tidyr` package. You can use this function to remove cases that have missings on any variable in a data set or only on specific variables. ``` r stackoverflow_survey_single_response %>% drop_na() %>% nrow() ``` ``` ## [1] 10166 ``` ``` r stackoverflow_survey_single_response %>% drop_na(ai_threat) %>% nrow() ``` ``` ## [1] 44689 ``` *NB*: Of course, the number of excluded/included cases depends on how you have defined your missings values before. --- ## Recode `NA` into something else An easy option for replacing `NA` with another value for a single variable is the `replace_na()` function from the `tidyr` package in combination with `mutate()`. ``` r tuesdata <- stackoverflow_survey_single_response %>% mutate(ai_threat = replace_na(ai_threat, -99)) ``` --- ## Conditional variable transformation Sometimes we need to make the values of a new variable conditional on values of (multiple) other variables. Such cases require conditional transformations. --- ## Simple conditional transformation The simplest version of a conditional variable transformation is using an `ifelse()` statement. ``` r stackoverflow_survey_single_response <- stackoverflow_survey_single_response %>% dplyr::mutate(ed_char = ifelse(ed_level == 1, "professional", "beginner")) stackoverflow_survey_single_response %>% dplyr::select(ed_level, ed_char) %>% dplyr::sample_n(5) # randomly sample 5 cases from the df ``` ``` ## # A tibble: 5 × 2 ## ed_level ed_char ## <dbl> <chr> ## 1 3 beginner ## 2 3 beginner ## 3 2 beginner ## 4 3 beginner ## 5 NA <NA> ``` .small[ *Note*: A more versatile option for creating dummy variables is the [`fastDummies` package](https://jacobkap.github.io/fastDummies/). ] --- ## Advanced conditional transformation For more flexible (or complex) conditional transformations, the `case_when()` function from `dyplyr` is a powerful tool. ``` r stackoverflow_survey_single_response <- stackoverflow_survey_single_response %>% dplyr::mutate(ed_level_cat = dplyr::case_when( dplyr::between(ed_level, 2, 4) ~ "beginner", dplyr::between(ed_level, 0, 1) ~ "expert", ed_level > 5 ~ "other" )) stackoverflow_survey_single_response %>% dplyr::select(ed_level, ed_level_cat) %>% dplyr::sample_n(5) ``` ``` ## # A tibble: 5 × 2 ## ed_level ed_level_cat ## <dbl> <chr> ## 1 2 beginner ## 2 3 beginner ## 3 2 beginner ## 4 NA <NA> ## 5 5 <NA> ``` --- ## `dplyr::case_when()` A few things to note about `case_when()`: - you can have multiple conditions per value - conditions are evaluated *consecutively* - when none of the specified conditions are met for an observation, by default, the new variable will have a missing value **`NA`** for that case - if you want some other value in the new variables when the specified conditions are not met, you need to add .highlight[`TRUE ~ value`] as the last argument of the `case_when()` call - to explore the full range of options for `case_when()` check out its [online documentation](https://dplyr.tidyverse.org/reference/case_when.html) --- ## Recode values `across()` defined variables We can also use .highlight[`across()`] to recode multiple variables. Here, we want to recode the items measuring trust so that they reflect distrust instead. In this case, we probably want to create new variables. We can do so by using the `.names` argument of the `across()` function ``` r stackoverflow_survey_single_response <- stackoverflow_survey_single_response %>% dplyr::mutate( #-- apply the same operation across multiple columns-- across( ai_acc:ai_complex, #-- recode values within selected columns (reverse likert scale)-- ~dplyr::recode( .x, # .x is current column value `5` = 1, # old value 5 becomes 1 `4` = 2, `3` = 3, `2` = 4, `1` = 5, ), #-- name the new columns by appending _R to the original name-- .names = "{.col}_R")) ``` --- ## Other options for using `across()` It can be used with logical conditions (such as `is.numeric()`) or the `dplyr` selection helpers we encountered in the previous session (such as `starts_with()`) to apply transformations to variables of a specific type or meeting some other criteria (as well as all variables in a data set). To explore more options, you can check the [documentation for the `across()` function](https://dplyr.tidyverse.org/reference/across.html). --- ## Aggregate variables What is important to keep in mind here is that `dplyr` operations are applied **per column**. This is a common sources of confusion and errors as what we want to do in the case of creating aggregate variables requires transformations to be applied per row (respondent). --- ## Aggregate variables The most common type of aggregate variables are sum and mean scores.<sup>1</sup> An easy way to create those is combining the `base R` functions `rowSums()` and `rowMeans()` with `across()` from `dplyr`. .small[ .footnote[ [1] Of course, `R` offers many other options for dimension reduction, such as PCA, factor analyis, etc. However, we won't cover those in this course. ] ] --- ## Mean score In this example, we create a mean score for trust in ai in the workflow. ``` r stackoverflow_survey_single_response <- stackoverflow_survey_single_response %>% dplyr::mutate(mean_ai_trust = rowMeans(dplyr::across( ai_acc:ai_threat), na.rm = TRUE)) ``` --- ## More options for aggregate variables If you want to use other functions than just `mean()` or `sum()` for creating aggregate variables, you need to use the [`rowwise()` function from `dplyr`](https://dplyr.tidyverse.org/articles/rowwise.html) in combination with [`c_across()`](https://dplyr.tidyverse.org/reference/c_across.html) which is a special variant of the `dplyr` function `across()` for row-wise operations/aggregations. --- ## Outlook: Other variable types In the examples in this session, we almost exclusively worked with numeric variables. There are, however, other variable types that occur frequently in data sets in the social sciences: - factors - strings - time and dates --- ## Factors Factor are a special type of variable in `R` that represent categorical data. Before `R` version `4.0.0.` the default for `base R` was that all characters variables are imported as factors. Internally, factors are stored as integers, but they have (character) labels (so-called *levels*) associated with them. Notably, as factors are a native data type to `R`, they do not cause the issues that labelled variables often do (as labels represent an additional attribute, making them a special class that many functions cannot work with). --- ## Factors Factors in `R` can be **unordered** - in which case they are similar to **nominal** level variables or **ordered** - in which case they are similar to **ordinal** level variables Using factors can be necessary for certain statistical analysis and plots (e.g., if you want to compare groups). Working with factors in `R` is a big topic, and we will only briefly touch upon it in this workshop. For a more in-depth discussion of factors in `R` you can, e.g., have a look at the [chapter on factors](https://r4ds.had.co.nz/factors.html) in *R for Data Science*. --- ## Factors 4 🐱s There are many functions for working with factors in `base R`, such as `factor()` or `as.factor()`. However, a generally more versatile and easier-to-use option is the [`forcats` package](https://forcats.tidyverse.org/) from the `tidyverse`. <img src="data:image/png;base64,#https://forcats.tidyverse.org/logo.png" width="25%" style="display: block; margin: auto;" /> *Note*: There is a good [introduction to working with factors using `forcats` by Vebash Naidoo](https://sciencificity-blog.netlify.app/posts/2021-01-30-control-your-factors-with-forcats/) and *RStudio* also offers a [`forcats` cheatsheet](https://raw.githubusercontent.com/rstudio/cheatsheets/master/factors.pdf). --- ## Unordered factor Using the `recode_factor()` function (together with `mutate()`) from `dplyr`, we can create a factor from a numeric (or a character) variable. ``` r tuesdata_ai_threat <- stackoverflow_survey_single_response %>% dplyr::mutate(ai_threat_fac = dplyr::recode_factor(ai_threat, `1` = "I'm not sure", `2` = "No", `3` = "Yes")) tuesdata_ai_threat %>% dplyr::select(ai_threat, ai_threat_fac) %>% dplyr::filter(!is.na(ai_threat)) %>% dplyr::sample_n(5) ``` ``` ## # A tibble: 5 × 2 ## ai_threat ai_threat_fac ## <dbl> <fct> ## 1 2 No ## 2 2 No ## 3 2 No ## 4 2 No ## 5 2 No ``` --- --- ## Summary statistics To make sense of quantitative data we can reduce their information to unique values. -- .center[ ~ **That's a simple definition of summary statistics** ~] -- As such, we can use summarizing functions of - location (e.g., the mean), - spread (e.g., standard deviation), - the shape of the distribution (e.g., skewness), and - relations between variables (e.g., correlation coefficients) --- ## Summary statistics: `summary()` A quick and easy way to check some summary statistics for your data set is the `base R` function `summary()` which can be applied to individual variables... ``` r summary(stackoverflow_survey_single_response$ai_acc) ``` ``` ## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's ## 1.000 3.000 4.000 3.847 5.000 5.000 28135 ``` as well as whole data frames: ``` r summary(stackoverflow_survey_single_response) ``` .right[↪️] --- class: middle .small[ ``` ## response_id main_branch age remote_work ed_level ## Min. : 1 Min. :1.000 Min. :1.000 Min. :1.00 Min. :1.000 ## 1st Qu.:16360 1st Qu.:1.000 1st Qu.:2.000 1st Qu.:1.00 1st Qu.:2.000 ## Median :32719 Median :1.000 Median :2.000 Median :2.00 Median :3.000 ## Mean :32719 Mean :1.503 Mean :2.629 Mean :1.96 Mean :3.513 ## 3rd Qu.:49078 3rd Qu.:1.000 3rd Qu.:3.000 3rd Qu.:3.00 3rd Qu.:5.000 ## Max. :65437 Max. :5.000 Max. :8.000 Max. :3.00 Max. :8.000 ## NA's :10631 NA's :4653 ## years_code years_code_pro dev_type org_size ## Min. : 0.0 Min. : 0.00 Min. : 1.00 Min. : 1.000 ## 1st Qu.: 6.0 1st Qu.: 3.00 1st Qu.:12.00 1st Qu.: 3.000 ## Median :11.0 Median : 7.00 Median :16.00 Median : 5.000 ## Mean :14.2 Mean :10.18 Mean :17.17 Mean : 4.774 ## 3rd Qu.:20.0 3rd Qu.:15.00 3rd Qu.:19.00 3rd Qu.: 6.000 ## Max. :51.0 Max. :51.00 Max. :34.00 Max. :10.000 ## NA's :5568 NA's :13827 NA's :5992 NA's :17957 ## purchase_influence buildvs_buy ## Min. :1.000 Min. :1.0 ## 1st Qu.:2.000 1st Qu.:1.0 ## Median :2.000 Median :1.0 ## Mean :2.188 Mean :1.6 ## 3rd Qu.:3.000 3rd Qu.:2.0 ## Max. :3.000 Max. :3.0 ## NA's :18031 NA's :22079 ## country currency ## United States of America :11095 Length:65437 ## Germany : 4947 Class :character ## India : 4231 Mode :character ## United Kingdom of Great Britain and Northern Ireland: 3224 ## Ukraine : 2672 ## (Other) :32761 ## NA's : 6507 ## comp_total so_visit_freq so_account so_part_freq ## Min. : 0.000e+00 Min. :1.000 Min. :1.000 Min. :1.000 ## 1st Qu.: 6.000e+04 1st Qu.:2.000 1st Qu.:3.000 1st Qu.:4.000 ## Median : 1.100e+05 Median :2.000 Median :3.000 Median :5.000 ## Mean :2.964e+145 Mean :2.511 Mean :2.601 Mean :4.016 ## 3rd Qu.: 2.500e+05 3rd Qu.:3.000 3rd Qu.:3.000 3rd Qu.:5.000 ## Max. :1.000e+150 Max. :5.000 Max. :3.000 Max. :6.000 ## NA's :31697 NA's :5901 NA's :5877 NA's :20200 ## so_comm ai_select ai_sent ai_acc ai_complex ## Min. :1.00 Min. :1.000 Min. :1.000 Min. :1.000 Min. :1.000 ## 1st Qu.:2.00 1st Qu.:2.000 1st Qu.:1.000 1st Qu.:3.000 1st Qu.:1.000 ## Median :3.00 Median :3.000 Median :2.000 Median :4.000 Median :2.000 ## Mean :3.36 Mean :2.375 Mean :2.386 Mean :3.847 Mean :2.232 ## 3rd Qu.:5.00 3rd Qu.:3.000 3rd Qu.:4.000 3rd Qu.:5.000 3rd Qu.:3.000 ## Max. :6.00 Max. :3.000 Max. :6.000 Max. :5.000 Max. :5.000 ## NA's :6274 NA's :4530 NA's :19564 NA's :28135 NA's :28416 ## ai_threat survey_length survey_ease converted_comp_yearly ## Min. :1.000 Min. :1.000 Min. :1.000 Min. : 1 ## 1st Qu.:2.000 1st Qu.:1.000 1st Qu.:2.000 1st Qu.: 32712 ## Median :2.000 Median :1.000 Median :2.000 Median : 65000 ## Mean :1.922 Mean :1.326 Mean :2.409 Mean : 86155 ## 3rd Qu.:2.000 3rd Qu.:2.000 3rd Qu.:3.000 3rd Qu.: 107972 ## Max. :3.000 Max. :3.000 Max. :3.000 Max. :16256603 ## NA's :20748 NA's :9255 NA's :9199 NA's :42002 ## r_used r_want_to_use org_size_R ed_char ## Min. :0.000 Min. :0.000 Min. : 1.000 Length:65437 ## 1st Qu.:0.000 1st Qu.:0.000 1st Qu.: 3.000 Class :character ## Median :0.000 Median :0.000 Median : 4.000 Mode :character ## Mean :0.043 Mean :0.039 Mean : 6.723 ## 3rd Qu.:0.000 3rd Qu.:0.000 3rd Qu.: 6.000 ## Max. :1.000 Max. :1.000 Max. :99.000 ## NA's :5692 NA's :9685 NA's :17957 ## ed_level_cat ai_acc_R ai_complex_R mean_ai_trust ## Length:65437 Min. :1.000 Min. :1.000 Min. :1.000 ## Class :character 1st Qu.:1.000 1st Qu.:3.000 1st Qu.:2.000 ## Mode :character Median :2.000 Median :4.000 Median :2.667 ## Mean :2.153 Mean :3.768 Mean :2.540 ## 3rd Qu.:3.000 3rd Qu.:5.000 3rd Qu.:3.000 ## Max. :5.000 Max. :5.000 Max. :5.000 ## NA's :28135 NA's :28416 NA's :19733 ``` ] --- ## Summary statistics with the `datawizard` package 🧙 The [`datawizard` package](https://easystats.github.io/datawizard/) provides a function for summary statistics. ``` r library(datawizard) stackoverflow_survey_single_response %>% select(where(is.numeric)) %>% describe_distribution(quartiles = TRUE) ``` .right[↪️] --- class: middle .small[ ``` ## Variable | Mean | SD | IQR | Range ## -------------------------------------------------------------------------- ## response_id | 32719.00 | 18890.18 | 32719 | [1.00, 65437.00] ## main_branch | 1.50 | 1.02 | 0 | [1.00, 5.00] ## age | 2.63 | 1.58 | 1 | [1.00, 8.00] ## remote_work | 1.96 | 0.89 | 2 | [1.00, 3.00] ## ed_level | 3.51 | 1.93 | 3 | [1.00, 8.00] ## years_code | 14.20 | 10.66 | 14 | [0.00, 51.00] ## years_code_pro | 10.18 | 9.11 | 12 | [0.00, 51.00] ## dev_type | 17.17 | 7.75 | 7 | [1.00, 34.00] ## org_size | 4.77 | 2.48 | 3 | [1.00, 10.00] ## purchase_influence | 2.19 | 0.77 | 1 | [1.00, 3.00] ## buildvs_buy | 1.60 | 0.80 | 1 | [1.00, 3.00] ## comp_total | 2.96e+145 | 5.44e+147 | 190000 | [0.00, 1.00e+150] ## so_visit_freq | 2.51 | 1.27 | 1 | [1.00, 5.00] ## so_account | 2.60 | 0.75 | 0 | [1.00, 3.00] ## so_part_freq | 4.02 | 1.43 | 1 | [1.00, 6.00] ## so_comm | 3.36 | 1.84 | 3 | [1.00, 6.00] ## ai_select | 2.37 | 0.85 | 1 | [1.00, 3.00] ## ai_sent | 2.39 | 1.68 | 3 | [1.00, 6.00] ## ai_acc | 3.85 | 1.21 | 2 | [1.00, 5.00] ## ai_complex | 2.23 | 1.11 | 2 | [1.00, 5.00] ## ai_threat | 1.92 | 0.56 | 0 | [1.00, 3.00] ## survey_length | 1.33 | 0.50 | 1 | [1.00, 3.00] ## survey_ease | 2.41 | 0.55 | 1 | [1.00, 3.00] ## converted_comp_yearly | 86155.29 | 1.87e+05 | 75288 | [1.00, 1.63e+07] ## r_used | 0.04 | 0.20 | 0 | [0.00, 1.00] ## r_want_to_use | 0.04 | 0.19 | 0 | [0.00, 1.00] ## org_size_R | 6.72 | 14.22 | 3 | [1.00, 99.00] ## ai_acc_R | 2.15 | 1.21 | 2 | [1.00, 5.00] ## ai_complex_R | 3.77 | 1.11 | 2 | [1.00, 5.00] ## mean_ai_trust | 2.54 | 0.64 | 1 | [1.00, 5.00] ## ## Variable | Quartiles | Skewness | Kurtosis | n | n_Missing ## ------------------------------------------------------------------------------------ ## response_id | 16360.00, 49078.00 | 0.00 | -1.20 | 65437 | 0 ## main_branch | 1.00, 1.00 | 1.93 | 2.62 | 65437 | 0 ## age | 2.00, 3.00 | 1.69 | 3.22 | 65437 | 0 ## remote_work | 1.00, 3.00 | 0.08 | -1.74 | 54806 | 10631 ## ed_level | 2.00, 5.00 | 0.89 | -0.70 | 60784 | 4653 ## years_code | 6.00, 20.00 | 1.19 | 0.90 | 59869 | 5568 ## years_code_pro | 3.00, 15.00 | 1.33 | 1.60 | 51610 | 13827 ## dev_type | 12.00, 19.00 | 0.61 | 0.12 | 59445 | 5992 ## org_size | 3.00, 6.00 | 0.36 | -0.51 | 47480 | 17957 ## purchase_influence | 2.00, 3.00 | -0.33 | -1.23 | 47406 | 18031 ## buildvs_buy | 1.00, 2.00 | 0.84 | -0.92 | 43358 | 22079 ## comp_total | 60000.00, 2.50e+05 | | | 33740 | 31697 ## so_visit_freq | 2.00, 3.00 | 0.64 | -0.53 | 59536 | 5901 ## so_account | 3.00, 3.00 | -1.50 | 0.45 | 59560 | 5877 ## so_part_freq | 4.00, 5.00 | -1.25 | 0.11 | 45237 | 20200 ## so_comm | 2.00, 5.00 | 0.26 | -1.33 | 59163 | 6274 ## ai_select | 2.00, 3.00 | -0.80 | -1.14 | 60907 | 4530 ## ai_sent | 1.00, 4.00 | 0.77 | -1.11 | 45873 | 19564 ## ai_acc | 3.00, 5.00 | -0.86 | -0.04 | 37302 | 28135 ## ai_complex | 1.00, 3.00 | 0.62 | -0.47 | 37021 | 28416 ## ai_threat | 2.00, 2.00 | -0.02 | 0.13 | 44689 | 20748 ## survey_length | 1.00, 2.00 | 1.12 | 0.07 | 56182 | 9255 ## survey_ease | 2.00, 3.00 | -0.15 | -0.98 | 56238 | 9199 ## converted_comp_yearly | 32712.00, 1.08e+05 | 52.92 | 3950.78 | 23435 | 42002 ## r_used | 0.00, 0.00 | 4.48 | 18.07 | 59745 | 5692 ## r_want_to_use | 0.00, 0.00 | 4.76 | 20.65 | 55752 | 9685 ## org_size_R | 3.00, 6.00 | 6.14 | 36.92 | 47480 | 17957 ## ai_acc_R | 1.00, 3.00 | 0.86 | -0.04 | 37302 | 28135 ## ai_complex_R | 3.00, 5.00 | -0.62 | -0.47 | 37021 | 28416 ## mean_ai_trust | 2.00, 3.00 | -0.19 | 0.21 | 45704 | 19733 ``` ] --- ## Summary statistics with `dplyr` `dplyr` provides a helpful function for creating summary statistics: `summarize()` `summarize()` is a [vectorized](https://win-vector.com/2019/01/03/what-does-it-mean-to-write-vectorized-code-in-r/) function that can be used to create summary statistics for variables using functions like... - `mean()` - `sd()` - `min()` - `max()` - etc. --- ## Summary statistics with `dplyr` While creating summary statistics using `summarize()` from `dplyr()` requires writing more code, it is the most flexible option. Another nice benefit of `summarize()` is that it produces a .highlight[`tibble`] which can be used for further analyses or for creating plots or tables. --- ## Excercises - A. Load the dataset (gapminder) from the library `gapminder` and determine the mean life expectancy in Asia. - B. Determine the mean and median GDP per capita in Europe. - C. Determine the mean life expectancy in 2007 for South America. - D. Determine the top 5 countries in 2007 with the highest GDP per capita among those with life expectancy over 75. --- ## Solutions ``` r library(gapminder) #Mean life expect. Asia gapminder %>% filter(continent == "Asia") %>% summarize(mean_lifeExp = mean(lifeExp)) ``` ``` ## # A tibble: 1 × 1 ## mean_lifeExp ## <dbl> ## 1 60.1 ``` ``` r # Mean, Median GDP Europe gapminder %>% filter(continent == "Europe") %>% summarize(mean_gdp = mean(gdpPercap), median_gdp = median(gdpPercap)) ``` ``` ## # A tibble: 1 × 2 ## mean_gdp median_gdp ## <dbl> <dbl> ## 1 14469. 12082. ``` ``` r # Mean life expect. in 2007 S.America gapminder %>% filter(continent == "Americas", year == 2007) %>% summarize(mean_lifeExp = mean(lifeExp)) ``` ``` ## # A tibble: 1 × 1 ## mean_lifeExp ## <dbl> ## 1 73.6 ``` --- ## Solutions ``` r # Top 5 coutnries gapminder %>% filter(year == 2007, lifeExp > 65) %>% arrange(desc(gdpPercap)) ``` ``` ## # A tibble: 90 × 6 ## country continent year lifeExp pop gdpPercap ## <fct> <fct> <int> <dbl> <int> <dbl> ## 1 Norway Europe 2007 80.2 4627926 49357. ## 2 Kuwait Asia 2007 77.6 2505559 47307. ## 3 Singapore Asia 2007 80.0 4553009 47143. ## 4 United States Americas 2007 78.2 301139947 42952. ## 5 Ireland Europe 2007 78.9 4109086 40676. ## 6 Hong Kong, China Asia 2007 82.2 6980412 39725. ## 7 Switzerland Europe 2007 81.7 7554661 37506. ## 8 Netherlands Europe 2007 79.8 16570613 36798. ## 9 Canada Americas 2007 80.7 33390141 36319. ## 10 Iceland Europe 2007 81.8 301931 36181. ## # ℹ 80 more rows ``` --- ## Outlook: Working with strings in `R` As stated before, we won't be able to cover the specifics of working with strings in `R` in this course. However, it may be good to know that the `tidyverse` package [`stringr`](https://stringr.tidyverse.org/index.html) offers a collection of convenient functions for working with strings. <img src="data:image/png;base64,#https://stringr.tidyverse.org/logo.png" width="25%" style="display: block; margin: auto;" /> The `stringr` package provides a good [introduction vignette](https://cran.r-project.org/web/packages/stringr/vignettes/stringr.html), the book *R for Data Science* has a whole section on [strings with `stringr`](https://r4ds.had.co.nz/strings.html). --- ## Outlook: Times and dates [Working with times and dates can be quite a pain in programming](https://www.youtube.com/watch?v=-5wpm-gesOY) (as well as data analysis). Luckily, there are a couple of neat options for working with times and dates in `R` that can reduce the headache. --- ## Outlook: Times and dates If you want/need to work with times and dates in `R`, you may want to look into the [`lubridate` package](https://lubridate.tidyverse.org/) which is part of the `tidyverse`, and for which *RStudio* also provides a [cheatsheet](https://raw.githubusercontent.com/rstudio/cheatsheets/master/lubridate.pdf). <img src="data:image/png;base64,#https://lubridate.tidyverse.org/logo.png" width="25%" style="display: block; margin: auto;" /> *Note*: If you work with time series data, it is also worth checking out the [`tsibble` package](https://tsibble.tidyverts.org/) for your wrangling tasks.