Data Literacy: Introduction to R

.title[
# Data Literacy: Introduction to R
]
.subtitle[
## Data Wrangling - Part 1
]
.author[
### Veronika Batzdorfer
]
.date[
### 2025-05-23
]

---

---

## Data wrangling

The process of re-**shaping**, re-**formatting**, and re-**arranging** raw data for analysis
---

## Steps of data wrangling

Steps when working with tabular data in the social & behavioral sciences (e.g., from surveys) include:

- **selecting** a subset of variables
- **renaming** variables
- **relocating** variables
- **filtering** a subset of cases

- **recoding** variables/values
- **missing values** recoding
- **creating/computing** new variables

The (in)famous **80/20-rule**: 80% wrangling, 20% analysis (of course, this ratio relates to the time required for writing the code, not the computing time).

---

## The `tidyverse`

> The `tidyverse` is a coherent system of packages for .highlight[data manipulation, exploration and visualization] that share a common design philosophy ([Rickert, 2017](https://rviews.rstudio.com/2017/06/08/what-is-the-tidyverse/)).

---

## Benefits of the `tidyverse`

`Tidyverse` syntax is designed to increase

- **human-readability** making it **attractive for `R` novices** as it can facilitate self-efficacy (see [Robinson, 2017](http://varianceexplained.org/r/teach-tidyverse/))
- **consistency** (e.g., data frame as first argument and output) 
- **smarter defaults** (e.g., no partial matching of data frame and column names).

---

## The 'dark side' of the `tidyverse`

`tidyverse` is not `R` as in `base R`
- some routines are like using a whole different language, which...
  - ... can be nice when learning `R`
  - ... can get difficult when searching for solutions to certain problems

- Often, `tidyverse` functions are under heavy development
 - they change and can potentially break your code
 - E.g.: [Converting tables into long or wide format](https://tidyr.tidyverse.org/news/index.html#pivoting)
 
To learn more about the `tidyverse` lifecycle you can watch this [talk by Hadley Wickham](https://www.youtube.com/watch?v=izFssYRsLZs) or read the corresponding [documentation](https://lifecycle.r-lib.org/articles/stages.html#deprecated)

---

## `Base R` vs. `tidyverse`

Similar to other fierce academic debates over, e.g., `R` vs. `Python` or Frequentism vs. Bayesianism, people have argued [for](http://varianceexplained.org/r/teach-tidyverse/) and [against](https://blog.ephorie.de/why-i-dont-use-the-tidyverse) using/teaching the `tidyverse`.

But what's unites both:

<img src="https://miro.medium.com/max/1280/0*ifjhcLyODu0nXjVx.jpg" width="60%" style="display: block; margin: auto;" />
.center[
Source: https://bit.ly/3PmcL4t
]

---
## Structure & focus of this session

- focus on differences between `base R` and the `tidyverse`

- our main focus will be on the use of packages (and functions) from the `tidyverse` and how they can be used to clean and transform your data.

Of course, it is possible to combine `base R` and `tidyverse` code. However, in the long run, you should try to aim for consistency.

---

## Lift-off into the `tidyverse` 🚀

**Install all `tidyverse` packages** (for the full list of `tidyverse` packages see [https://www.tidyverse.org/packages/](https://www.tidyverse.org/packages/))

``` r
install.packages("tidyverse")
```
**Load core `tidyverse` packages** (NB: To save time and reduce namespace conflicts you can also load `tidyverse` packages individually)

``` r
library(tidyverse) ##load the tidyverse package
```

---

## `tidyverse` vocabulary 101

While there is much more to the `tidyverse` than this, three important concepts that you need to be familiar with, if you want to use it, are:

1. Tidy data

2. Tibbles

3. Pipes

(We already discussed tibbles in the session on *Data Import & Export*, so we will focus on tidy data and pipes here.)

---

## Tidy data 🤠

The 3 rules of tidy data:

1. Each **variable** is in a separate **column**.

2. Each **observation** is in a separate **row**.

3. Each **value** is in a separate **cell**.

<img src="data:image/png;base64,#https://d33wubrfki0l68.cloudfront.net/6f1ddb544fc5c69a2478e444ab8112fb0eea23f8/91adc/images/tidy-1.png" style="display: block; margin: auto;" />
Source: https://r4ds.had.co.nz/tidy-data.html

*Note*: In the `tidyverse` terminology 'tidy data' usually also means data in long format (where applicable).

---

## Wide vs. long format

<img src="data:image/png;base64,#https://raw.githubusercontent.com/gadenbuie/tidyexplain/main/images/static/png/original-dfs-tidy.png" width="45%" style="display: block; margin: auto;" />
Source: https://github.com/gadenbuie/tidyexplain#tidy-data

.small[
*Note*: The functions `pivot_wider()` and `pivot_longer()` from the [`tidyr` package](https://tidyr.tidyverse.org/) are easy-to-use options from changing data from long to wide format and vice versa.
]

---

## Pipes (%>% == and then)

Usually, in `R` we apply functions as follows:

``` r
f(x)
```

In the logic of pipes this function is written as:

``` r
x %>% f(.)
```

Here, object `x` is piped into function `f`, becoming (by default) its first argument (but by using *.* it can also be fed into other arguments).

We can use pipes with more than one function:

``` r
x %>% 
  f_1() %>% 
  f_2() %>% 
  f_3()
```

---

## Pipes ("Chaining")

- (((Onions))) vs. Pipes

- The `%>%` used in the `tidyverse` is part of the [`magrittr` package](https://magrittr.tidyverse.org/) to pass data to another function.

- *RStudio* offers a keyboard shortcut for inserting .highlight[**`%>%`**]: <kbd>Ctrl + Shift + M</kbd> (*Windows* & *Linux*)/<kbd>Cmd + Shift + M</kbd> (*Mac*)

---

## Data set

We will use data from the [*Stack Overflow Annual Developer Survey 2024*](https://github.com/rfordatascience/tidytuesday/blob/master/data/2024/2024-09-03/readme.md).

.highlight[Remember]: to code along/ for the exercises the *tuesdata* data file should be in a sub-folder called `data` in the same folder, as the other materials for this course.

---
## Note: Tidy vs. untidy data

The *tuesdata* is already tidy.
If you collect data yourself, the raw data may be `untidy`, e.g.:
- cells may hold more than one value
- a variable that should be in one column is spread across multiple columns (e.g., parts of a date or name).

If you need to make your data tidy or change it from wide to long format or vice versa, the [`tidyr` package](https://tidyr.tidyverse.org/) from the `tidyverse` is a good option.

---

## Interlude 1: Citing FOSS

There is a function in `R` that tells you how to cite it or any of the packages you have used (for this please see .highlight[`sessionInfo()`].

``` r
citation()
```

```
## To cite R in publications use:
## 
## R Core Team (2023). _R: A Language and Environment for
## Statistical Computing_. R Foundation for Statistical
## Computing, Vienna, Austria. <https://www.R-project.org/>.
## 
## Ein BibTeX-Eintrag für LaTeX-Benutzer ist
## 
## @Manual{,
## title = {R: A Language and Environment for Statistical Computing},
## author = {{R Core Team}},
## organization = {R Foundation for Statistical Computing},
## address = {Vienna, Austria},
## year = {2023},
## url = {https://www.R-project.org/},
## }
## 
## We have invested a lot of time and effort in creating R, please
## cite it when using it for data analysis. See also
## 'citation("pkgname")' for citing R packages.
```

---

## Interlude 3: Codebook

It is always advisable to consult the codebook (if there is one) before starting to work with a data set.

Side note: If you want to (semi-)automatically generate a codebook for your own dataset, there are several options in `R`:

- the [`codebook` package](https://rubenarslan.github.io/codebook/) which includes an *RStudio*-Addin and also offers a [web app](https://rubenarslan.ocpu.io/codebook/www/)

- the `makeCodebook()` function from the [`dataReporter` package](https://github.com/ekstroem/dataReporter) (see this [blog post](http://sandsynligvis.dk/articles/18/codebook.html) for a short tutorial of the initial `dataMaid package`)

---

## Load the data

The first step is loading the data into `R`.

``` r
## install.packages("tidytuesdayR")
library(tidytuesdayR)

tuesdata <- tidytuesdayR::tt_load('2024-09-03')

qname_levels_single_response_crosswalk <- tuesdata$qname_levels_single_response_crosswalk

stackoverflow_survey_questions <- tuesdata$stackoverflow_survey_questions

stackoverflow_survey_single_response <- tuesdata$stackoverflow_survey_single_response
```

``` r
library(tidytuesdayR)
stackoverflow_survey_questions <- read_csv("./data/stackoverflow_survey_questions.csv")

stackoverflow_survey_single_response <- read_csv("./data/stackoverflow_survey_single_response.csv")

qname_levels_single_response_crosswalk <- read_csv("./data/qname_levels_single_response_crosswalk.csv")
```

---

## `dplyr`

The `tidyverse` examples in the following will make use of [`dplyr` functions](https://dplyr.tidyverse.org/) that are .highlight[**verbs**] that signal an action (e.g., `group_by()`, `glimpse()`, `filter()`)
Their structure is:
1. The first argument is a data frame.
2. The subsequent arguments describe what to do with the data frame.
3. The result is a new data frame (tibble).

- **columns** (= variables in a tidy data frame) can be referenced without quotation marks (non-standard evaluation)
  - **actions** (verbs) can be applied to columns (variables) and rows (cases/observations)

---

## First look 👀

Getting a first good look at your data. The function `glimpse()` prints a data frame/tibble in a way that represents columns as rows and rows as columns and also provides some additional information about the data frame and its columns.

``` r
stackoverflow_survey_single_response %>% 
  glimpse()
```

---

```
## Rows: 65,437
## Columns: 28
## $ response_id <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 1…
## $ main_branch <dbl> 1, 1, 1, 2, 1, 4, 3, 2, 4, 1, 5, 1, 1, 5…
## $ age <dbl> 8, 3, 4, 1, 1, 8, 3, 1, 4, 3, 3, 4, 3, 3…
## $ remote_work <dbl> 3, 3, 3, NA, NA, NA, 3, NA, 2, 3, 3, 2, …
## $ ed_level <dbl> 4, 2, 3, 7, 6, 4, 5, 6, 5, 3, 2, 5, 2, 2…
## $ years_code <dbl> NA, 20, 37, 4, 9, 10, 7, 1, 20, 15, 20, …
## $ years_code_pro <dbl> NA, 17, 27, NA, NA, NA, 7, NA, NA, 11, N…
## $ dev_type <dbl> NA, 16, 10, 16, 16, 33, 1, 33, 1, 16, 28…
## $ org_size <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ purchase_influence <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ buildvs_buy <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ country <chr> "United States of America", "United King…
## $ currency <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ comp_total <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ so_visit_freq <dbl> NA, 5, 5, 3, 5, 5, 3, 4, 5, 2, 2, 5, 5, …
## $ so_account <dbl> NA, 3, 3, 1, 3, 3, 3, 1, 3, 3, 3, 3, 3, …
## $ so_part_freq <dbl> NA, 6, 6, NA, 6, 6, 3, NA, 6, 5, 5, 6, 2…
## $ so_comm <dbl> NA, 5, 5, 3, 5, 5, 5, 2, 5, 6, 5, 5, 5, …
## $ ai_select <dbl> 3, 1, 1, 3, 1, 3, 1, 3, 1, 3, 3, 1, 1, 3…
## $ ai_sent <dbl> 5, NA, NA, 5, NA, 1, NA, 2, NA, 2, 1, NA…
## $ ai_acc <dbl> NA, NA, NA, 5, NA, 5, NA, 4, NA, 3, 4, N…
## $ ai_complex <dbl> NA, NA, NA, 1, NA, 2, NA, 1, NA, 1, 3, N…
## $ ai_threat <dbl> NA, NA, NA, 2, NA, 2, NA, 3, NA, 1, 2, N…
## $ survey_length <dbl> NA, NA, 1, 2, 3, 1, 2, 1, 1, 2, 1, 1, 1,…
## $ survey_ease <dbl> NA, NA, 2, 2, 2, 2, 3, 1, 3, 2, 2, 3, 2,…
## $ converted_comp_yearly <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ r_used <dbl> NA, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, …
## $ r_want_to_use <dbl> NA, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, …
```
]

---

## Selecting variables

We might want to reduce our data frame (or create a new one) to only include a **subset of specific variables**. E.g., select only the variables that measure attitudes towards *AI* (`ai_`) from our full data set. There are two options with .highlight[`base R`]:

Option 1
.small[

``` r
tuesdata_ai <- stackoverflow_survey_single_response [, c("ai_select", "ai_sent", "ai_acc", "ai_complex", "ai_threat")]

# When subsetting with [], the first value refers to rows, the second to columns
# [, c("var1", "var2", ...)] means we want to select all rows but only some specific columns.
```
]

Option 2
.small[

``` r
tuesdata_ai <- subset(stackoverflow_survey_single_response, TRUE, select = c(ai_select, ai_sent, ai_acc, ai_complex, ai_threat))

# The 2nd argument refers to the rows.
# Setting it to TRUE includes all rows in the subset.
```
]

---

## Selecting variables

You can also select variables based on their numeric index.

``` r
tuesdata_ai <- stackoverflow_survey_single_response[, 19:23]

names(tuesdata_ai)
```

```
## [1] "ai_select"  "ai_sent"    "ai_acc"     "ai_complex" "ai_threat"
```

---

## Selecting variables

In .highlight[`tidyverse`], we can create a subset of variables with the `dplyr` verb .highlight[`select()`].

``` r
tuesdata_ai <- stackoverflow_survey_single_response %>% 
 dplyr::select(ai_select,
 ai_sent,
 ai_acc,
 ai_complex,
 ai_threat)

head(tuesdata_ai)
```

```
## # A tibble: 6 × 5
## ai_select ai_sent ai_acc ai_complex ai_threat
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 3 5 NA NA NA
## 2 1 NA NA NA NA
## 3 1 NA NA NA NA
## 4 3 5 5 1 2
## 5 1 NA NA NA NA
## 6 3 1 5 2 2
```

---

## Selecting a range of variables

There also is a shorthand notation for selecting a set of consecutive columns with `select()`.

``` r
tuesdata_ai <- stackoverflow_survey_single_response %>% 
 dplyr::select(ai_select:ai_threat)

head(tuesdata_ai)
```

---

## Selecting a range of variables

Same as for .highlight[`base R`], you can also use the numeric index of variables in combination with `select()` from `dplyr`.

``` r
tuesdata_ai <- stackoverflow_survey_single_response %>% 
 dplyr::select(19:23)

names(tuesdata_ai)
```

```
## [1] "ai_select"  "ai_sent"    "ai_acc"     "ai_complex" "ai_threat"
```

---

## Unselecting variables

If you just want to exclude one or a few columns/variables, it is easier to unselect those than to select all others. Again, there's two ways to do this with `base R`.

Option 1
.small[

``` r
tuesdata_cut <- stackoverflow_survey_single_response [!(names(stackoverflow_survey_single_response ) %in% c("dev_type", "purchase_influence", "remote_work"))]
# The ! operator means "not" (i.e., it negates a condition)
# The %in% operator means "is included in" (in this case the following character vector)

dim(tuesdata_cut)
```

```
## [1] 65437    25
```
]

---

## Unselecting variables

You can also use `select()` from `dplyr` to exclude one or more columns/variables.

``` r
tuesdata_cut<- stackoverflow_survey_single_response %>% 
 dplyr::select(-c(dev_type, purchase_influence, remote_work))

dim(tuesdata_cut)
```

```
## [1] 65437    25
```

---

## Advanced ways of selecting variables

`dplyr` offers several helper functions for selecting variables. For a full list of those, you can check the [documentation for the `select()` function](https://dplyr.tidyverse.org/reference/select.html) or `?select()`.

``` r
tuesdata_ai <- stackoverflow_survey_single_response %>% 
 dplyr::select(starts_with("ai"))

tuesdata_freq <-stackoverflow_survey_single_response %>% 
 dplyr::select(ends_with("freq"))

glimpse(tuesdata_freq)
```

```
## Rows: 65,437
## Columns: 2
## $ so_visit_freq <dbl> NA, 5, 5, 3, 5, 5, 3, 4, 5, 2, 2, 5, 5, 3, 3, 1,…
## $ so_part_freq <dbl> NA, 6, 6, NA, 6, 6, 3, NA, 6, 5, 5, 6, 2, 2, 2, …
```

---

## Advanced ways of selecting variables

Another particularly useful selection helper is .highlight[`where()`] to select only a specific type of variables.

``` r
tuesdata_num <- stackoverflow_survey_single_response %>% 
 dplyr::select(where(is.numeric)) %>% 
 print()
```

```
## # A tibble: 65,437 × 26
## response_id main_branch age remote_work ed_level years_code
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 1 8 3 4 NA
## 2 2 1 3 3 2 20
## 3 3 1 4 3 3 37
## 4 4 2 1 NA 7 4
## 5 5 1 1 NA 6 9
## 6 6 4 8 NA 4 10
## 7 7 3 3 3 5 7
## 8 8 2 1 NA 6 1
## 9 9 4 4 2 5 20
## 10 10 1 3 3 3 15
## # ℹ 65,427 more rows
## # ℹ 20 more variables: years_code_pro <dbl>, dev_type <dbl>,
## # org_size <dbl>, purchase_influence <dbl>, buildvs_buy <dbl>,
## # comp_total <dbl>, so_visit_freq <dbl>, so_account <dbl>,
## # so_part_freq <dbl>, so_comm <dbl>, ai_select <dbl>, ai_sent <dbl>,
## # ai_acc <dbl>, ai_complex <dbl>, ai_threat <dbl>,
## # survey_length <dbl>, survey_ease <dbl>, …
```

---

## What's in a name?

One thing that we need to know - and might want to change - are the names of the variables in the dataset.

``` r
names(stackoverflow_survey_single_response)
```

```
##  [1] "response_id"           "main_branch"          
##  [3] "age"                   "remote_work"          
##  [5] "ed_level"              "years_code"           
##  [7] "years_code_pro"        "dev_type"             
##  [9] "org_size"              "purchase_influence"   
## [11] "buildvs_buy"           "country"              
## [13] "currency"              "comp_total"           
## [15] "so_visit_freq"         "so_account"           
## [17] "so_part_freq"          "so_comm"              
## [19] "ai_select"             "ai_sent"              
## [21] "ai_acc"                "ai_complex"           
## [23] "ai_threat"             "survey_length"        
## [25] "survey_ease"           "converted_comp_yearly"
## [27] "r_used"                "r_want_to_use"
```

---

## Renaming variables

It is good practice to use consistent naming conventions. Since `R` is .highlight[case-sensitive], we might want to only use lowercase letters. 
As spaces in variable names can cause problems, we could, e.g., decide to use 🐍 *snake_case* (🐫 *camelCase* is a common alternative;

---

## Renaming variables

Renaming columns/variables in `dplyr` with .highlight[`rename()`].

``` r
tuesdata_rn <- stackoverflow_survey_single_response %>% 
 dplyr:: rename(ai_workflow = ai_sent, # new_name = old_name
 comm_member = so_comm,
 post_freq = so_part_freq 
 )

names(tuesdata_rn)
```

```
##  [1] "response_id"           "main_branch"          
##  [3] "age"                   "remote_work"          
##  [5] "ed_level"              "years_code"           
##  [7] "years_code_pro"        "dev_type"             
##  [9] "org_size"              "purchase_influence"   
## [11] "buildvs_buy"           "country"              
## [13] "currency"              "comp_total"           
## [15] "so_visit_freq"         "so_account"           
## [17] "post_freq"             "comm_member"          
## [19] "ai_select"             "ai_workflow"          
## [21] "ai_acc"                "ai_complex"           
## [23] "ai_threat"             "survey_length"        
## [25] "survey_ease"           "converted_comp_yearly"
## [27] "r_used"                "r_want_to_use"
```

---

## Renaming variables

For some more advanced renaming options, you can use the `dplyr` function `rename_with()`.

*Note*: The [`janitor` package](https://sfirke.github.io/janitor/) contains the function `clean_names()` that takes a data frame and creates column names that "are unique and consist only of the _ character, numbers, and letters" (from the help file for this function), with the default being 🐍 snake_case (but support for many other types of cases).

``` r
stackoverflow_survey_single_response %>% 
  dplyr::rename_with(toupper) %>% 
  names()
```

```
##  [1] "RESPONSE_ID"           "MAIN_BRANCH"          
##  [3] "AGE"                   "REMOTE_WORK"          
##  [5] "ED_LEVEL"              "YEARS_CODE"           
##  [7] "YEARS_CODE_PRO"        "DEV_TYPE"             
##  [9] "ORG_SIZE"              "PURCHASE_INFLUENCE"   
## [11] "BUILDVS_BUY"           "COUNTRY"              
## [13] "CURRENCY"              "COMP_TOTAL"           
## [15] "SO_VISIT_FREQ"         "SO_ACCOUNT"           
## [17] "SO_PART_FREQ"          "SO_COMM"              
## [19] "AI_SELECT"             "AI_SENT"              
## [21] "AI_ACC"                "AI_COMPLEX"           
## [23] "AI_THREAT"             "SURVEY_LENGTH"        
## [25] "SURVEY_EASE"           "CONVERTED_COMP_YEARLY"
## [27] "R_USED"                "R_WANT_TO_USE"
```

---

## Renaming variables

We can use `rename_with()` in combination with `gsub()` to remove (or change) prefixes in variable names.

``` r
stackoverflow_survey_single_response %>% 
  dplyr::select(ai_select:ai_threat) %>% 
  dplyr::rename_with(~ gsub("ai", "ai_attid", .x,
                     fixed = TRUE)) %>% 
  names()
```

```
## [1] "ai_attid_select"  "ai_attid_sent"    "ai_attid_acc"    
## [4] "ai_attid_complex" "ai_attid_threat"
```

---

## Re~~wind~~name select

A nice thing about the `dplyr` verb `select` is that you can use it to select and rename variables in one step.

``` r
tuesdata_ai <- stackoverflow_survey_single_response %>% 
 dplyr::select(ai_workflow = ai_sent, # new_name = old_name
 comm_member = so_comm,
 post_freq = so_part_freq )

head(tuesdata_ai)
```

```
## # A tibble: 6 × 3
## ai_workflow comm_member post_freq
## <dbl> <dbl> <dbl>
## 1 5 NA NA
## 2 NA 5 6
## 3 NA 5 6
## 4 5 3 NA
## 5 NA 5 6
## 6 1 5 6
```
]

---
class: center, middle

# [Exercise](https://rawcdn.githack.com/nika-akin/r-intro/9d05476f895e390df08662eecbefd4137f67acf4/exercises/Exercise_2_1_1_Selecting_Renaming_Steps.html) time 🏋️‍♀️💪🏃🚴

## [Solutions](https://rawcdn.githack.com/nika-akin/r-intro/9d05476f895e390df08662eecbefd4137f67acf4/solutions/Exercise_2_1_1_Selecting_Renaming_Steps.html)

---

## Filtering rows

Filter rows/observations dependent on one or more conditions.

To filter rows/observations you can use... 
- **comparison operators**:
 - **<** (smaller than)
 - **<=** (smaller than or equal to)
 - **==** (equal to)
 - **!=** (not equal to)
 - **>=** (larger than or equal to)
 - **>** (larger than)
 - **%in%** (included in)

---

## Filtering rows

... and combine comparisons with
- **logical operators**:
    - **&** (and)
    - **|** (or)
    - **!** (not)
    - **xor** (either or, not both)

---

## Filtering rows

Similar to selecting columns/variables, there are two options for filtering rows/observations with `base R`.

Option 1

``` r
tuesdata_age <-stackoverflow_survey_single_response [which(stackoverflow_survey_single_response $age == 1), ] #18-24

dim(tuesdata_age)
```

```
## [1] 14098    28
```

Option 2

``` r
tuesdata_age <- subset(stackoverflow_survey_single_response , age == 1)

dim(tuesdata_age)
```

```
## [1] 14098    28
```

---

## Filtering rows

The `dplyr` solution for filtering rows/observations is the verb `filter()`.

``` r
tuesdata_age <- stackoverflow_survey_single_response %>% 
 dplyr::filter(age == 1)

dim(tuesdata_age)
```

```
## [1] 14098    28
```

---

## Filtering rows based on multiple conditions

``` r
tuesdata_filter <- stackoverflow_survey_single_response %>% 
 dplyr::filter(org_size > 1, so_visit_freq > 2, main_branch !=1)

dim(tuesdata_filter)
```

```
## [1] 1398   28
```

---

## `dplyr::filter` - multiple conditions

By default, multiple conditions in `filter()` are added as *&* (and). You can, however, also specify multiple conditions differently.

**or** (cases for which at least one of the conditions is true)

``` r
tuesdata_developer <- stackoverflow_survey_single_response %>% 
 dplyr::filter(main_branch == 1 | #developer
 age > 1)

dim(tuesdata_developer)
```

```
## [1] 60371    28
```

---

## `dplyr::filter` - multiple conditions

**xor** (cases for which only one of the two conditions is true)

``` r
tuesdata_developer_or_age <- stackoverflow_survey_single_response %>%
 dplyr::filter(xor(main_branch == 1, 
 age > 1))

dim(tuesdata_developer_or_age)
```

```
## [1] 19196    28
```

---

## Advanced ways of filtering rows

Similar to `select()` there are some helper functions for `filter()` for advanced filtering of rows. For example, you can...

- Filter rows based on a .highlight[range in a numeric variable]

``` r
tuesdata_frequent_user <- stackoverflow_survey_single_response %>% 
 dplyr::filter(dplyr::between(so_visit_freq, 2, 3))

dim(tuesdata_frequent_user)
```

```
## [1] 33847    28
```

*Note*: The range specified in `between()` is inclusive (on both sides).

---

## Advanced ways of filtering rows

- Filter rows based on the values of specific variables matching certain criteria

``` r
tuesdata_high_engagement <- stackoverflow_survey_single_response %>% 
# if the values of vars start with s0 in this df are >= 5 
 dplyr::filter(dplyr::if_all(dplyr::starts_with ("s0"), ~ . >=5))

dim(tuesdata_high_engagement)
```

```
## [1] 65437    28
```

*Note*: The helper function `if_any()` can be used to specify that at least one of the variables needs to match a certain criterion.

---

## Selecting columns + filtering rows

The `tidyverse` approach solution for combining the selection of columns and the filtering of rows is chaining these steps together in a pipe (the order of the pipe steps does not matter).

``` r
tuesdata_freq_ai <- stackoverflow_survey_single_response %>% 
 dplyr::filter(so_part_freq == 1) %>% 
 dplyr::select(ai_select:ai_threat)

dim(tuesdata_freq_ai)
```

```
## [1] 6277    5
```

---

## (Re-)Arranging the order of rows

The `dplyr` verb for changing the order of rows in a data set is `arrange()` and you can use it in the same ways as the `base R` equivalent: Sorting by a single variable in ascending order, ...

``` r
stackoverflow_survey_single_response %>% 
  dplyr::arrange(age) %>% 
  dplyr::select(19:23) %>% 
  glimpse()
```

```
## Rows: 65,437
## Columns: 5
## $ ai_select <dbl> 3, 1, 3, 2, 1, 3, 3, 3, 3, 3, 3, 1, 3, 3, 3, 1, 1, …
## $ ai_sent <dbl> 5, NA, 2, 1, NA, 1, 1, 1, 1, 1, 4, NA, 1, 5, 2, NA,…
## $ ai_acc <dbl> 5, NA, 4, NA, NA, 5, 4, 4, 5, 4, 3, NA, 4, 5, 5, NA…
## $ ai_complex <dbl> 1, NA, 1, NA, NA, 2, 4, 2, 3, 4, 1, NA, 2, 2, 1, NA…
## $ ai_threat <dbl> 2, NA, 3, 1, NA, 2, 2, 3, 2, 2, 2, NA, 1, 2, 2, NA,…
```

---

## (Re-)Arranging the order of rows

... sorting by a single variable in descending order, ...

``` r
stackoverflow_survey_single_response %>% 
 dplyr:: arrange(desc(age)) %>% 
  dplyr::select(19:23) %>% 
  glimpse()
```

```
## Rows: 65,437
## Columns: 5
## $ ai_select <dbl> 3, 3, 3, 3, 3, 1, 3, 3, 1, 1, 1, 1, 1, 3, 1, 1, 1, …
## $ ai_sent <dbl> 5, 1, 1, 1, 2, NA, 2, 1, NA, NA, NA, NA, NA, 2, NA,…
## $ ai_acc <dbl> NA, 5, 4, 4, 3, NA, 1, 1, NA, NA, NA, NA, NA, 3, NA…
## $ ai_complex <dbl> NA, 2, 3, 1, 2, NA, 4, 4, NA, NA, NA, NA, NA, 1, NA…
## $ ai_threat <dbl> NA, 2, 2, 2, 2, NA, 2, 3, NA, NA, NA, NA, NA, 3, NA…
```