class: center, middle, inverse, title-slide .title[ # Data Literacy: Introduction to R ] .subtitle[ ## Data Wrangling - Part 1 ] .author[ ### Veronika Batzdorfer ] .date[ ### 2025-05-23 ] --- layout: true --- ## Data wrangling <img src="data:image/png;base64,#../img/wrangl.jpg" width="95%" style="display: block; margin: auto;" /> The process of re-**shaping**, re-**formatting**, and re-**arranging** raw data for analysis --- ## Steps of data wrangling Steps when working with tabular data in the social & behavioral sciences (e.g., from surveys) include: - **selecting** a subset of variables - **renaming** variables - **relocating** variables - **filtering** a subset of cases - **recoding** variables/values - **missing values** recoding - **creating/computing** new variables -- <small>The (in)famous **80/20-rule**: 80% wrangling, 20% analysis (of course, this ratio relates to the time required for writing the code, not the computing time).<small> --- ## The `tidyverse` > The `tidyverse` is a coherent system of packages for .highlight[data manipulation, exploration and visualization] that share a common design philosophy ([Rickert, 2017](https://rviews.rstudio.com/2017/06/08/what-is-the-tidyverse/)). --- ## Benefits of the `tidyverse` `Tidyverse` syntax is designed to increase - **human-readability** making it **attractive for `R` novices** as it can facilitate self-efficacy (see [Robinson, 2017](http://varianceexplained.org/r/teach-tidyverse/)) - **consistency** (e.g., data frame as first argument and output) - **smarter defaults** (e.g., no partial matching of data frame and column names). --- ## The 'dark side' of the `tidyverse` `tidyverse` is not `R` as in `base R` - some routines are like using a whole different language, which... - ... can be nice when learning `R` - ... can get difficult when searching for solutions to certain problems - Often, `tidyverse` functions are under heavy development - they change and can potentially break your code - E.g.: [Converting tables into long or wide format](https://tidyr.tidyverse.org/news/index.html#pivoting)<small> To learn more about the `tidyverse` lifecycle you can watch this [talk by Hadley Wickham](https://www.youtube.com/watch?v=izFssYRsLZs) or read the corresponding [documentation](https://lifecycle.r-lib.org/articles/stages.html#deprecated) --- ## `Base R` vs. `tidyverse` Similar to other fierce academic debates over, e.g., `R` vs. `Python` or Frequentism vs. Bayesianism, people have argued [for](http://varianceexplained.org/r/teach-tidyverse/) and [against](https://blog.ephorie.de/why-i-dont-use-the-tidyverse) using/teaching the `tidyverse`. But what's unites both: <img src="https://miro.medium.com/max/1280/0*ifjhcLyODu0nXjVx.jpg" width="60%" style="display: block; margin: auto;" /> .center[ <small><small>Source: https://bit.ly/3PmcL4t</small></small> ] --- ## Structure & focus of this session - focus on differences between `base R` and the `tidyverse` - our main focus will be on the use of packages (and functions) from the `tidyverse` and how they can be used to clean and transform your data. Of course, it is possible to combine `base R` and `tidyverse` code. However, in the long run, you should try to aim for consistency. --- ## Lift-off into the `tidyverse` 🚀 **Install all `tidyverse` packages** (for the full list of `tidyverse` packages see [https://www.tidyverse.org/packages/](https://www.tidyverse.org/packages/)) ``` r install.packages("tidyverse") ``` **Load core `tidyverse` packages** <small>(NB: To save time and reduce namespace conflicts you can also load `tidyverse` packages individually)<small> ``` r library(tidyverse) ##load the tidyverse package ``` --- ## `tidyverse` vocabulary 101 While there is much more to the `tidyverse` than this, three important concepts that you need to be familiar with, if you want to use it, are: 1. Tidy data 2. Tibbles 3. Pipes <small>(We already discussed tibbles in the session on *Data Import & Export*, so we will focus on tidy data and pipes here.)<small> --- ## Tidy data 🤠 The 3 rules of tidy data: 1. Each **variable** is in a separate **column**. 2. Each **observation** is in a separate **row**. 3. Each **value** is in a separate **cell**. <img src="data:image/png;base64,#https://d33wubrfki0l68.cloudfront.net/6f1ddb544fc5c69a2478e444ab8112fb0eea23f8/91adc/images/tidy-1.png" style="display: block; margin: auto;" /> <small><small>Source: https://r4ds.had.co.nz/tidy-data.html</small></small> *Note*: In the `tidyverse` terminology 'tidy data' usually also means data in long format (where applicable). --- ## Wide vs. long format <img src="data:image/png;base64,#https://raw.githubusercontent.com/gadenbuie/tidyexplain/main/images/static/png/original-dfs-tidy.png" width="45%" style="display: block; margin: auto;" /> <small><small>Source: https://github.com/gadenbuie/tidyexplain#tidy-data</small></small> .small[ *Note*: The functions `pivot_wider()` and `pivot_longer()` from the [`tidyr` package](https://tidyr.tidyverse.org/) are easy-to-use options from changing data from long to wide format and vice versa. ] --- ## Pipes (%>% == and then) Usually, in `R` we apply functions as follows: ``` r f(x) ``` In the logic of pipes this function is written as: ``` r x %>% f(.) ``` Here, object `x` is piped into function `f`, becoming (by default) its first argument (but by using *.* it can also be fed into other arguments). -- We can use pipes with more than one function: ``` r x %>% f_1() %>% f_2() %>% f_3() ``` .small[ More about pipes: https://r4ds.had.co.nz/pipes.html ] --- ## Pipes ("Chaining") - (((Onions))) vs. Pipes - The `%>%` used in the `tidyverse` is part of the [`magrittr` package](https://magrittr.tidyverse.org/) to pass data to another function. - *RStudio* offers a keyboard shortcut for inserting .highlight[**`%>%`**]: <kbd>Ctrl + Shift + M</kbd> (*Windows* & *Linux*)/<kbd>Cmd + Shift + M</kbd> (*Mac*) --- ## Data set We will use data from the [*Stack Overflow Annual Developer Survey 2024*](https://github.com/rfordatascience/tidytuesday/blob/master/data/2024/2024-09-03/readme.md). .highlight[Remember]: to code along/ for the exercises the *tuesdata* data file should be in a sub-folder called `data` in the same folder, as the other materials for this course. --- ## Note: Tidy vs. untidy data The *tuesdata* is already tidy. If you collect data yourself, the raw data may be `untidy`, e.g.: - cells may hold more than one value - a variable that should be in one column is spread across multiple columns (e.g., parts of a date or name). If you need to make your data tidy or change it from wide to long format or vice versa, the [`tidyr` package](https://tidyr.tidyverse.org/) from the `tidyverse` is a good option. --- ## Interlude 1: Citing FOSS There is a function in `R` that tells you how to cite it or any of the packages you have used (for this please see .highlight[`sessionInfo()`]. ``` r citation() ``` ``` ## To cite R in publications use: ## ## R Core Team (2023). _R: A Language and Environment for ## Statistical Computing_. R Foundation for Statistical ## Computing, Vienna, Austria. <https://www.R-project.org/>. ## ## Ein BibTeX-Eintrag für LaTeX-Benutzer ist ## ## @Manual{, ## title = {R: A Language and Environment for Statistical Computing}, ## author = {{R Core Team}}, ## organization = {R Foundation for Statistical Computing}, ## address = {Vienna, Austria}, ## year = {2023}, ## url = {https://www.R-project.org/}, ## } ## ## We have invested a lot of time and effort in creating R, please ## cite it when using it for data analysis. See also ## 'citation("pkgname")' for citing R packages. ``` --- ## Interlude 3: Codebook It is always advisable to consult the codebook (if there is one) before starting to work with a data set. Side note: If you want to (semi-)automatically generate a codebook for your own dataset, there are several options in `R`: - the [`codebook` package](https://rubenarslan.github.io/codebook/) which includes an *RStudio*-Addin and also offers a [web app](https://rubenarslan.ocpu.io/codebook/www/) - the `makeCodebook()` function from the [`dataReporter` package](https://github.com/ekstroem/dataReporter) (see this [blog post](http://sandsynligvis.dk/articles/18/codebook.html) for a short tutorial of the initial `dataMaid package`) --- ## Load the data The first step is loading the data into `R`. ``` r ## install.packages("tidytuesdayR") library(tidytuesdayR) tuesdata <- tidytuesdayR::tt_load('2024-09-03') qname_levels_single_response_crosswalk <- tuesdata$qname_levels_single_response_crosswalk stackoverflow_survey_questions <- tuesdata$stackoverflow_survey_questions stackoverflow_survey_single_response <- tuesdata$stackoverflow_survey_single_response ``` ``` r library(tidytuesdayR) stackoverflow_survey_questions <- read_csv("./data/stackoverflow_survey_questions.csv") stackoverflow_survey_single_response <- read_csv("./data/stackoverflow_survey_single_response.csv") qname_levels_single_response_crosswalk <- read_csv("./data/qname_levels_single_response_crosswalk.csv") ``` --- ## `dplyr` The `tidyverse` examples in the following will make use of [`dplyr` functions](https://dplyr.tidyverse.org/) that are .highlight[**verbs**] that signal an action (e.g., `group_by()`, `glimpse()`, `filter()`) Their structure is: 1. The first argument is a data frame. 2. The subsequent arguments describe what to do with the data frame. 3. The result is a new data frame (tibble). - **columns** (= variables in a tidy data frame) can be referenced without quotation marks (non-standard evaluation) - **actions** (verbs) can be applied to columns (variables) and rows (cases/observations) --- ## First look 👀 Getting a first good look at your data. The function `glimpse()` prints a data frame/tibble in a way that represents columns as rows and rows as columns and also provides some additional information about the data frame and its columns. ``` r stackoverflow_survey_single_response %>% glimpse() ``` .right[↪️] --- class: middle .tiny[ ``` ## Rows: 65,437 ## Columns: 28 ## $ response_id <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 1… ## $ main_branch <dbl> 1, 1, 1, 2, 1, 4, 3, 2, 4, 1, 5, 1, 1, 5… ## $ age <dbl> 8, 3, 4, 1, 1, 8, 3, 1, 4, 3, 3, 4, 3, 3… ## $ remote_work <dbl> 3, 3, 3, NA, NA, NA, 3, NA, 2, 3, 3, 2, … ## $ ed_level <dbl> 4, 2, 3, 7, 6, 4, 5, 6, 5, 3, 2, 5, 2, 2… ## $ years_code <dbl> NA, 20, 37, 4, 9, 10, 7, 1, 20, 15, 20, … ## $ years_code_pro <dbl> NA, 17, 27, NA, NA, NA, 7, NA, NA, 11, N… ## $ dev_type <dbl> NA, 16, 10, 16, 16, 33, 1, 33, 1, 16, 28… ## $ org_size <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, … ## $ purchase_influence <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, … ## $ buildvs_buy <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, … ## $ country <chr> "United States of America", "United King… ## $ currency <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, … ## $ comp_total <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, … ## $ so_visit_freq <dbl> NA, 5, 5, 3, 5, 5, 3, 4, 5, 2, 2, 5, 5, … ## $ so_account <dbl> NA, 3, 3, 1, 3, 3, 3, 1, 3, 3, 3, 3, 3, … ## $ so_part_freq <dbl> NA, 6, 6, NA, 6, 6, 3, NA, 6, 5, 5, 6, 2… ## $ so_comm <dbl> NA, 5, 5, 3, 5, 5, 5, 2, 5, 6, 5, 5, 5, … ## $ ai_select <dbl> 3, 1, 1, 3, 1, 3, 1, 3, 1, 3, 3, 1, 1, 3… ## $ ai_sent <dbl> 5, NA, NA, 5, NA, 1, NA, 2, NA, 2, 1, NA… ## $ ai_acc <dbl> NA, NA, NA, 5, NA, 5, NA, 4, NA, 3, 4, N… ## $ ai_complex <dbl> NA, NA, NA, 1, NA, 2, NA, 1, NA, 1, 3, N… ## $ ai_threat <dbl> NA, NA, NA, 2, NA, 2, NA, 3, NA, 1, 2, N… ## $ survey_length <dbl> NA, NA, 1, 2, 3, 1, 2, 1, 1, 2, 1, 1, 1,… ## $ survey_ease <dbl> NA, NA, 2, 2, 2, 2, 3, 1, 3, 2, 2, 3, 2,… ## $ converted_comp_yearly <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, … ## $ r_used <dbl> NA, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, … ## $ r_want_to_use <dbl> NA, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, … ``` ] --- ## Selecting variables We might want to reduce our data frame (or create a new one) to only include a **subset of specific variables**. E.g., select only the variables that measure attitudes towards *AI* (`ai_`) from our full data set. There are two options with .highlight[`base R`]: Option 1 .small[ ``` r tuesdata_ai <- stackoverflow_survey_single_response [, c("ai_select", "ai_sent", "ai_acc", "ai_complex", "ai_threat")] # When subsetting with [], the first value refers to rows, the second to columns # [, c("var1", "var2", ...)] means we want to select all rows but only some specific columns. ``` ] Option 2 .small[ ``` r tuesdata_ai <- subset(stackoverflow_survey_single_response, TRUE, select = c(ai_select, ai_sent, ai_acc, ai_complex, ai_threat)) # The 2nd argument refers to the rows. # Setting it to TRUE includes all rows in the subset. ``` ] --- ## Selecting variables You can also select variables based on their numeric index. ``` r tuesdata_ai <- stackoverflow_survey_single_response[, 19:23] names(tuesdata_ai) ``` ``` ## [1] "ai_select" "ai_sent" "ai_acc" "ai_complex" "ai_threat" ``` --- ## Selecting variables In .highlight[`tidyverse`], we can create a subset of variables with the `dplyr` verb .highlight[`select()`]. ``` r tuesdata_ai <- stackoverflow_survey_single_response %>% dplyr::select(ai_select, ai_sent, ai_acc, ai_complex, ai_threat) head(tuesdata_ai) ``` ``` ## # A tibble: 6 × 5 ## ai_select ai_sent ai_acc ai_complex ai_threat ## <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 3 5 NA NA NA ## 2 1 NA NA NA NA ## 3 1 NA NA NA NA ## 4 3 5 5 1 2 ## 5 1 NA NA NA NA ## 6 3 1 5 2 2 ``` --- ## Selecting a range of variables There also is a shorthand notation for selecting a set of consecutive columns with `select()`. ``` r tuesdata_ai <- stackoverflow_survey_single_response %>% dplyr::select(ai_select:ai_threat) head(tuesdata_ai) ``` ``` ## # A tibble: 6 × 5 ## ai_select ai_sent ai_acc ai_complex ai_threat ## <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 3 5 NA NA NA ## 2 1 NA NA NA NA ## 3 1 NA NA NA NA ## 4 3 5 5 1 2 ## 5 1 NA NA NA NA ## 6 3 1 5 2 2 ``` --- ## Selecting a range of variables Same as for .highlight[`base R`], you can also use the numeric index of variables in combination with `select()` from `dplyr`. ``` r tuesdata_ai <- stackoverflow_survey_single_response %>% dplyr::select(19:23) names(tuesdata_ai) ``` ``` ## [1] "ai_select" "ai_sent" "ai_acc" "ai_complex" "ai_threat" ``` --- ## Unselecting variables If you just want to exclude one or a few columns/variables, it is easier to unselect those than to select all others. Again, there's two ways to do this with `base R`. Option 1 .small[ ``` r tuesdata_cut <- stackoverflow_survey_single_response [!(names(stackoverflow_survey_single_response ) %in% c("dev_type", "purchase_influence", "remote_work"))] # The ! operator means "not" (i.e., it negates a condition) # The %in% operator means "is included in" (in this case the following character vector) dim(tuesdata_cut) ``` ``` ## [1] 65437 25 ``` ] --- ## Unselecting variables You can also use `select()` from `dplyr` to exclude one or more columns/variables. ``` r tuesdata_cut<- stackoverflow_survey_single_response %>% dplyr::select(-c(dev_type, purchase_influence, remote_work)) dim(tuesdata_cut) ``` ``` ## [1] 65437 25 ``` --- ## Advanced ways of selecting variables `dplyr` offers several helper functions for selecting variables. For a full list of those, you can check the [documentation for the `select()` function](https://dplyr.tidyverse.org/reference/select.html) or `?select()`. ``` r tuesdata_ai <- stackoverflow_survey_single_response %>% dplyr::select(starts_with("ai")) tuesdata_freq <-stackoverflow_survey_single_response %>% dplyr::select(ends_with("freq")) glimpse(tuesdata_freq) ``` ``` ## Rows: 65,437 ## Columns: 2 ## $ so_visit_freq <dbl> NA, 5, 5, 3, 5, 5, 3, 4, 5, 2, 2, 5, 5, 3, 3, 1,… ## $ so_part_freq <dbl> NA, 6, 6, NA, 6, 6, 3, NA, 6, 5, 5, 6, 2, 2, 2, … ``` --- ## Advanced ways of selecting variables Another particularly useful selection helper is .highlight[`where()`] to select only a specific type of variables. ``` r tuesdata_num <- stackoverflow_survey_single_response %>% dplyr::select(where(is.numeric)) %>% print() ``` ``` ## # A tibble: 65,437 × 26 ## response_id main_branch age remote_work ed_level years_code ## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 1 1 8 3 4 NA ## 2 2 1 3 3 2 20 ## 3 3 1 4 3 3 37 ## 4 4 2 1 NA 7 4 ## 5 5 1 1 NA 6 9 ## 6 6 4 8 NA 4 10 ## 7 7 3 3 3 5 7 ## 8 8 2 1 NA 6 1 ## 9 9 4 4 2 5 20 ## 10 10 1 3 3 3 15 ## # ℹ 65,427 more rows ## # ℹ 20 more variables: years_code_pro <dbl>, dev_type <dbl>, ## # org_size <dbl>, purchase_influence <dbl>, buildvs_buy <dbl>, ## # comp_total <dbl>, so_visit_freq <dbl>, so_account <dbl>, ## # so_part_freq <dbl>, so_comm <dbl>, ai_select <dbl>, ai_sent <dbl>, ## # ai_acc <dbl>, ai_complex <dbl>, ai_threat <dbl>, ## # survey_length <dbl>, survey_ease <dbl>, … ``` --- ## What's in a name? One thing that we need to know - and might want to change - are the names of the variables in the dataset. ``` r names(stackoverflow_survey_single_response) ``` ``` ## [1] "response_id" "main_branch" ## [3] "age" "remote_work" ## [5] "ed_level" "years_code" ## [7] "years_code_pro" "dev_type" ## [9] "org_size" "purchase_influence" ## [11] "buildvs_buy" "country" ## [13] "currency" "comp_total" ## [15] "so_visit_freq" "so_account" ## [17] "so_part_freq" "so_comm" ## [19] "ai_select" "ai_sent" ## [21] "ai_acc" "ai_complex" ## [23] "ai_threat" "survey_length" ## [25] "survey_ease" "converted_comp_yearly" ## [27] "r_used" "r_want_to_use" ``` --- ## Renaming variables It is good practice to use consistent naming conventions. Since `R` is .highlight[case-sensitive], we might want to only use lowercase letters.<br/> As spaces in variable names can cause problems, we could, e.g., decide to use 🐍 *snake_case* (🐫<br/> <br/> *camelCase* is a common alternative; --- ## Renaming variables Renaming columns/variables in `dplyr` with .highlight[`rename()`]. ``` r tuesdata_rn <- stackoverflow_survey_single_response %>% dplyr:: rename(ai_workflow = ai_sent, # new_name = old_name comm_member = so_comm, post_freq = so_part_freq ) names(tuesdata_rn) ``` ``` ## [1] "response_id" "main_branch" ## [3] "age" "remote_work" ## [5] "ed_level" "years_code" ## [7] "years_code_pro" "dev_type" ## [9] "org_size" "purchase_influence" ## [11] "buildvs_buy" "country" ## [13] "currency" "comp_total" ## [15] "so_visit_freq" "so_account" ## [17] "post_freq" "comm_member" ## [19] "ai_select" "ai_workflow" ## [21] "ai_acc" "ai_complex" ## [23] "ai_threat" "survey_length" ## [25] "survey_ease" "converted_comp_yearly" ## [27] "r_used" "r_want_to_use" ``` --- ## Renaming variables For some more advanced renaming options, you can use the `dplyr` function `rename_with()`. *Note*: The [`janitor` package](https://sfirke.github.io/janitor/) contains the function `clean_names()` that takes a data frame and creates column names that "are unique and consist only of the _ character, numbers, and letters" (from the help file for this function), with the default being 🐍 snake_case (but support for many other types of cases). ``` r stackoverflow_survey_single_response %>% dplyr::rename_with(toupper) %>% names() ``` ``` ## [1] "RESPONSE_ID" "MAIN_BRANCH" ## [3] "AGE" "REMOTE_WORK" ## [5] "ED_LEVEL" "YEARS_CODE" ## [7] "YEARS_CODE_PRO" "DEV_TYPE" ## [9] "ORG_SIZE" "PURCHASE_INFLUENCE" ## [11] "BUILDVS_BUY" "COUNTRY" ## [13] "CURRENCY" "COMP_TOTAL" ## [15] "SO_VISIT_FREQ" "SO_ACCOUNT" ## [17] "SO_PART_FREQ" "SO_COMM" ## [19] "AI_SELECT" "AI_SENT" ## [21] "AI_ACC" "AI_COMPLEX" ## [23] "AI_THREAT" "SURVEY_LENGTH" ## [25] "SURVEY_EASE" "CONVERTED_COMP_YEARLY" ## [27] "R_USED" "R_WANT_TO_USE" ``` --- ## Renaming variables We can use `rename_with()` in combination with `gsub()` to remove (or change) prefixes in variable names. ``` r stackoverflow_survey_single_response %>% dplyr::select(ai_select:ai_threat) %>% dplyr::rename_with(~ gsub("ai", "ai_attid", .x, fixed = TRUE)) %>% names() ``` ``` ## [1] "ai_attid_select" "ai_attid_sent" "ai_attid_acc" ## [4] "ai_attid_complex" "ai_attid_threat" ``` --- ## Re~~wind~~name select A nice thing about the `dplyr` verb `select` is that you can use it to select and rename variables in one step. .small[ ``` r tuesdata_ai <- stackoverflow_survey_single_response %>% dplyr::select(ai_workflow = ai_sent, # new_name = old_name comm_member = so_comm, post_freq = so_part_freq ) head(tuesdata_ai) ``` ``` ## # A tibble: 6 × 3 ## ai_workflow comm_member post_freq ## <dbl> <dbl> <dbl> ## 1 5 NA NA ## 2 NA 5 6 ## 3 NA 5 6 ## 4 5 3 NA ## 5 NA 5 6 ## 6 1 5 6 ``` ] --- class: center, middle # [Exercise](https://rawcdn.githack.com/nika-akin/r-intro/9d05476f895e390df08662eecbefd4137f67acf4/exercises/Exercise_2_1_1_Selecting_Renaming_Steps.html) time 🏋️♀️💪🏃🚴 ## [Solutions](https://rawcdn.githack.com/nika-akin/r-intro/9d05476f895e390df08662eecbefd4137f67acf4/solutions/Exercise_2_1_1_Selecting_Renaming_Steps.html) --- ## Filtering rows Filter rows/observations dependent on one or more conditions. To filter rows/observations you can use... - **comparison operators**: - **<** (smaller than) - **<=** (smaller than or equal to) - **==** (equal to) - **!=** (not equal to) - **>=** (larger than or equal to) - **>** (larger than) - **%in%** (included in) --- ## Filtering rows ... and combine comparisons with - **logical operators**: - **&** (and) - **|** (or) - **!** (not) - **xor** (either or, not both) --- ## Filtering rows Similar to selecting columns/variables, there are two options for filtering rows/observations with `base R`. Option 1 ``` r tuesdata_age <-stackoverflow_survey_single_response [which(stackoverflow_survey_single_response $age == 1), ] #18-24 dim(tuesdata_age) ``` ``` ## [1] 14098 28 ``` Option 2 ``` r tuesdata_age <- subset(stackoverflow_survey_single_response , age == 1) dim(tuesdata_age) ``` ``` ## [1] 14098 28 ``` --- ## Filtering rows The `dplyr` solution for filtering rows/observations is the verb `filter()`. ``` r tuesdata_age <- stackoverflow_survey_single_response %>% dplyr::filter(age == 1) dim(tuesdata_age) ``` ``` ## [1] 14098 28 ``` --- ## Filtering rows based on multiple conditions ``` r tuesdata_filter <- stackoverflow_survey_single_response %>% dplyr::filter(org_size > 1, so_visit_freq > 2, main_branch !=1) dim(tuesdata_filter) ``` ``` ## [1] 1398 28 ``` --- ## `dplyr::filter` - multiple conditions By default, multiple conditions in `filter()` are added as *&* (and). You can, however, also specify multiple conditions differently. **or** (cases for which at least one of the conditions is true) ``` r tuesdata_developer <- stackoverflow_survey_single_response %>% dplyr::filter(main_branch == 1 | #developer age > 1) dim(tuesdata_developer) ``` ``` ## [1] 60371 28 ``` --- ## `dplyr::filter` - multiple conditions **xor** (cases for which only one of the two conditions is true) ``` r tuesdata_developer_or_age <- stackoverflow_survey_single_response %>% dplyr::filter(xor(main_branch == 1, age > 1)) dim(tuesdata_developer_or_age) ``` ``` ## [1] 19196 28 ``` --- ## Advanced ways of filtering rows Similar to `select()` there are some helper functions for `filter()` for advanced filtering of rows. For example, you can... - Filter rows based on a .highlight[range in a numeric variable] ``` r tuesdata_frequent_user <- stackoverflow_survey_single_response %>% dplyr::filter(dplyr::between(so_visit_freq, 2, 3)) dim(tuesdata_frequent_user) ``` ``` ## [1] 33847 28 ``` *Note*: The range specified in `between()` is inclusive (on both sides). --- ## Advanced ways of filtering rows - Filter rows based on the values of specific variables matching certain criteria ``` r tuesdata_high_engagement <- stackoverflow_survey_single_response %>% # if the values of vars start with s0 in this df are >= 5 dplyr::filter(dplyr::if_all(dplyr::starts_with ("s0"), ~ . >=5)) dim(tuesdata_high_engagement) ``` ``` ## [1] 65437 28 ``` *Note*: The helper function `if_any()` can be used to specify that at least one of the variables needs to match a certain criterion. --- ## Selecting columns + filtering rows The `tidyverse` approach solution for combining the selection of columns and the filtering of rows is chaining these steps together in a pipe (the order of the pipe steps does not matter). ``` r tuesdata_freq_ai <- stackoverflow_survey_single_response %>% dplyr::filter(so_part_freq == 1) %>% dplyr::select(ai_select:ai_threat) dim(tuesdata_freq_ai) ``` ``` ## [1] 6277 5 ``` --- ## (Re-)Arranging the order of rows The `dplyr` verb for changing the order of rows in a data set is `arrange()` and you can use it in the same ways as the `base R` equivalent: Sorting by a single variable in ascending order, ... ``` r stackoverflow_survey_single_response %>% dplyr::arrange(age) %>% dplyr::select(19:23) %>% glimpse() ``` ``` ## Rows: 65,437 ## Columns: 5 ## $ ai_select <dbl> 3, 1, 3, 2, 1, 3, 3, 3, 3, 3, 3, 1, 3, 3, 3, 1, 1, … ## $ ai_sent <dbl> 5, NA, 2, 1, NA, 1, 1, 1, 1, 1, 4, NA, 1, 5, 2, NA,… ## $ ai_acc <dbl> 5, NA, 4, NA, NA, 5, 4, 4, 5, 4, 3, NA, 4, 5, 5, NA… ## $ ai_complex <dbl> 1, NA, 1, NA, NA, 2, 4, 2, 3, 4, 1, NA, 2, 2, 1, NA… ## $ ai_threat <dbl> 2, NA, 3, 1, NA, 2, 2, 3, 2, 2, 2, NA, 1, 2, 2, NA,… ``` --- ## (Re-)Arranging the order of rows ... sorting by a single variable in descending order, ... ``` r stackoverflow_survey_single_response %>% dplyr:: arrange(desc(age)) %>% dplyr::select(19:23) %>% glimpse() ``` ``` ## Rows: 65,437 ## Columns: 5 ## $ ai_select <dbl> 3, 3, 3, 3, 3, 1, 3, 3, 1, 1, 1, 1, 1, 3, 1, 1, 1, … ## $ ai_sent <dbl> 5, 1, 1, 1, 2, NA, 2, 1, NA, NA, NA, NA, NA, 2, NA,… ## $ ai_acc <dbl> NA, 5, 4, 4, 3, NA, 1, 1, NA, NA, NA, NA, NA, 3, NA… ## $ ai_complex <dbl> NA, 2, 3, 1, 2, NA, 4, 4, NA, NA, NA, NA, NA, 1, NA… ## $ ai_threat <dbl> NA, 2, 2, 2, 2, NA, 2, 3, NA, NA, NA, NA, NA, 3, NA… ```