+ - 0:00:00
Notes for current slide
Notes for next slide

Data Literacy: Introduction to R

Data Wrangling - Part 1

Veronika Batzdorfer

2025-05-23

1 / 52

Data wrangling

The process of re-shaping, re-formatting, and re-arranging raw data for analysis

2 / 52

Steps of data wrangling

Steps when working with tabular data in the social & behavioral sciences (e.g., from surveys) include:

  • selecting a subset of variables
  • renaming variables
  • relocating variables
  • filtering a subset of cases
  • recoding variables/values
  • missing values recoding
  • creating/computing new variables
3 / 52

Steps of data wrangling

Steps when working with tabular data in the social & behavioral sciences (e.g., from surveys) include:

  • selecting a subset of variables
  • renaming variables
  • relocating variables
  • filtering a subset of cases
  • recoding variables/values
  • missing values recoding
  • creating/computing new variables

The (in)famous 80/20-rule: 80% wrangling, 20% analysis (of course, this ratio relates to the time required for writing the code, not the computing time).

4 / 52

The tidyverse

The tidyverse is a coherent system of packages for data manipulation, exploration and visualization that share a common design philosophy (Rickert, 2017).

5 / 52

Benefits of the tidyverse

Tidyverse syntax is designed to increase

  • human-readability making it attractive for R novices as it can facilitate self-efficacy (see Robinson, 2017)
  • consistency (e.g., data frame as first argument and output)
  • smarter defaults (e.g., no partial matching of data frame and column names).
6 / 52

The 'dark side' of the tidyverse

tidyverse is not R as in base R

  • some routines are like using a whole different language, which...

    • ... can be nice when learning R
    • ... can get difficult when searching for solutions to certain problems
  • Often, tidyverse functions are under heavy development

To learn more about the tidyverse lifecycle you can watch this talk by Hadley Wickham or read the corresponding documentation

7 / 52

Base R vs. tidyverse

Similar to other fierce academic debates over, e.g., R vs. Python or Frequentism vs. Bayesianism, people have argued for and against using/teaching the tidyverse.

But what's unites both:

8 / 52

Structure & focus of this session

  • focus on differences between base R and the tidyverse

  • our main focus will be on the use of packages (and functions) from the tidyverse and how they can be used to clean and transform your data.

Of course, it is possible to combine base R and tidyverse code. However, in the long run, you should try to aim for consistency.

9 / 52

Lift-off into the tidyverse 🚀

Install all tidyverse packages (for the full list of tidyverse packages see https://www.tidyverse.org/packages/)

install.packages("tidyverse")

Load core tidyverse packages (NB: To save time and reduce namespace conflicts you can also load tidyverse packages individually)

library(tidyverse) ##load the tidyverse package
10 / 52

tidyverse vocabulary 101

While there is much more to the tidyverse than this, three important concepts that you need to be familiar with, if you want to use it, are:

  1. Tidy data

  2. Tibbles

  3. Pipes

(We already discussed tibbles in the session on Data Import & Export, so we will focus on tidy data and pipes here.)

11 / 52

Tidy data 🤠

The 3 rules of tidy data:

  1. Each variable is in a separate column.

  2. Each observation is in a separate row.

  3. Each value is in a separate cell.

Source: https://r4ds.had.co.nz/tidy-data.html

Note: In the tidyverse terminology 'tidy data' usually also means data in long format (where applicable).

12 / 52

Wide vs. long format

Source: https://github.com/gadenbuie/tidyexplain#tidy-data

Note: The functions pivot_wider() and pivot_longer() from the tidyr package are easy-to-use options from changing data from long to wide format and vice versa.

13 / 52

Pipes (%>% == and then)

Usually, in R we apply functions as follows:

f(x)

In the logic of pipes this function is written as:

x %>% f(.)

Here, object x is piped into function f, becoming (by default) its first argument (but by using . it can also be fed into other arguments).

14 / 52

Pipes (%>% == and then)

Usually, in R we apply functions as follows:

f(x)

In the logic of pipes this function is written as:

x %>% f(.)

Here, object x is piped into function f, becoming (by default) its first argument (but by using . it can also be fed into other arguments).

We can use pipes with more than one function:

x %>%
f_1() %>%
f_2() %>%
f_3()
15 / 52

Pipes ("Chaining")

  • (((Onions))) vs. Pipes

  • The %>% used in the tidyverse is part of the magrittr package to pass data to another function.

  • RStudio offers a keyboard shortcut for inserting %>%: Ctrl + Shift + M (Windows & Linux)/Cmd + Shift + M (Mac)

16 / 52

Data set

We will use data from the Stack Overflow Annual Developer Survey 2024.

Remember: to code along/ for the exercises the tuesdata data file should be in a sub-folder called data in the same folder, as the other materials for this course.

17 / 52

Note: Tidy vs. untidy data

The tuesdata is already tidy. If you collect data yourself, the raw data may be untidy, e.g.:

  • cells may hold more than one value
  • a variable that should be in one column is spread across multiple columns (e.g., parts of a date or name).

If you need to make your data tidy or change it from wide to long format or vice versa, the tidyr package from the tidyverse is a good option.

18 / 52

Interlude 1: Citing FOSS

There is a function in R that tells you how to cite it or any of the packages you have used (for this please see sessionInfo().

citation()
## To cite R in publications use:
##
## R Core Team (2023). _R: A Language and Environment for
## Statistical Computing_. R Foundation for Statistical
## Computing, Vienna, Austria. <https://www.R-project.org/>.
##
## Ein BibTeX-Eintrag für LaTeX-Benutzer ist
##
## @Manual{,
## title = {R: A Language and Environment for Statistical Computing},
## author = {{R Core Team}},
## organization = {R Foundation for Statistical Computing},
## address = {Vienna, Austria},
## year = {2023},
## url = {https://www.R-project.org/},
## }
##
## We have invested a lot of time and effort in creating R, please
## cite it when using it for data analysis. See also
## 'citation("pkgname")' for citing R packages.
19 / 52

Interlude 3: Codebook

It is always advisable to consult the codebook (if there is one) before starting to work with a data set.

Side note: If you want to (semi-)automatically generate a codebook for your own dataset, there are several options in R:

20 / 52

Load the data

The first step is loading the data into R.

## install.packages("tidytuesdayR")
library(tidytuesdayR)
tuesdata <- tidytuesdayR::tt_load('2024-09-03')
qname_levels_single_response_crosswalk <- tuesdata$qname_levels_single_response_crosswalk
stackoverflow_survey_questions <- tuesdata$stackoverflow_survey_questions
stackoverflow_survey_single_response <- tuesdata$stackoverflow_survey_single_response
library(tidytuesdayR)
stackoverflow_survey_questions <- read_csv("./data/stackoverflow_survey_questions.csv")
stackoverflow_survey_single_response <- read_csv("./data/stackoverflow_survey_single_response.csv")
qname_levels_single_response_crosswalk <- read_csv("./data/qname_levels_single_response_crosswalk.csv")
21 / 52

dplyr

The tidyverse examples in the following will make use of dplyr functions that are verbs that signal an action (e.g., group_by(), glimpse(), filter()) Their structure is:

  1. The first argument is a data frame.
  2. The subsequent arguments describe what to do with the data frame.
  3. The result is a new data frame (tibble).

    • columns (= variables in a tidy data frame) can be referenced without quotation marks (non-standard evaluation)
    • actions (verbs) can be applied to columns (variables) and rows (cases/observations)
22 / 52

First look 👀

Getting a first good look at your data. The function glimpse() prints a data frame/tibble in a way that represents columns as rows and rows as columns and also provides some additional information about the data frame and its columns.

stackoverflow_survey_single_response %>%
glimpse()

↪️

23 / 52
## Rows: 65,437
## Columns: 28
## $ response_id <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 1…
## $ main_branch <dbl> 1, 1, 1, 2, 1, 4, 3, 2, 4, 1, 5, 1, 1, 5…
## $ age <dbl> 8, 3, 4, 1, 1, 8, 3, 1, 4, 3, 3, 4, 3, 3…
## $ remote_work <dbl> 3, 3, 3, NA, NA, NA, 3, NA, 2, 3, 3, 2, …
## $ ed_level <dbl> 4, 2, 3, 7, 6, 4, 5, 6, 5, 3, 2, 5, 2, 2…
## $ years_code <dbl> NA, 20, 37, 4, 9, 10, 7, 1, 20, 15, 20, …
## $ years_code_pro <dbl> NA, 17, 27, NA, NA, NA, 7, NA, NA, 11, N…
## $ dev_type <dbl> NA, 16, 10, 16, 16, 33, 1, 33, 1, 16, 28…
## $ org_size <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ purchase_influence <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ buildvs_buy <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ country <chr> "United States of America", "United King…
## $ currency <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ comp_total <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ so_visit_freq <dbl> NA, 5, 5, 3, 5, 5, 3, 4, 5, 2, 2, 5, 5, …
## $ so_account <dbl> NA, 3, 3, 1, 3, 3, 3, 1, 3, 3, 3, 3, 3, …
## $ so_part_freq <dbl> NA, 6, 6, NA, 6, 6, 3, NA, 6, 5, 5, 6, 2…
## $ so_comm <dbl> NA, 5, 5, 3, 5, 5, 5, 2, 5, 6, 5, 5, 5, …
## $ ai_select <dbl> 3, 1, 1, 3, 1, 3, 1, 3, 1, 3, 3, 1, 1, 3…
## $ ai_sent <dbl> 5, NA, NA, 5, NA, 1, NA, 2, NA, 2, 1, NA…
## $ ai_acc <dbl> NA, NA, NA, 5, NA, 5, NA, 4, NA, 3, 4, N…
## $ ai_complex <dbl> NA, NA, NA, 1, NA, 2, NA, 1, NA, 1, 3, N…
## $ ai_threat <dbl> NA, NA, NA, 2, NA, 2, NA, 3, NA, 1, 2, N…
## $ survey_length <dbl> NA, NA, 1, 2, 3, 1, 2, 1, 1, 2, 1, 1, 1,…
## $ survey_ease <dbl> NA, NA, 2, 2, 2, 2, 3, 1, 3, 2, 2, 3, 2,…
## $ converted_comp_yearly <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ r_used <dbl> NA, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, …
## $ r_want_to_use <dbl> NA, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, …
24 / 52

Selecting variables

We might want to reduce our data frame (or create a new one) to only include a subset of specific variables. E.g., select only the variables that measure attitudes towards AI (ai_) from our full data set. There are two options with base R:

Option 1

tuesdata_ai <- stackoverflow_survey_single_response [, c("ai_select", "ai_sent", "ai_acc", "ai_complex", "ai_threat")]
# When subsetting with [], the first value refers to rows, the second to columns
# [, c("var1", "var2", ...)] means we want to select all rows but only some specific columns.

Option 2

tuesdata_ai <- subset(stackoverflow_survey_single_response, TRUE, select = c(ai_select, ai_sent, ai_acc, ai_complex, ai_threat))
# The 2nd argument refers to the rows.
# Setting it to TRUE includes all rows in the subset.
25 / 52

Selecting variables

You can also select variables based on their numeric index.

tuesdata_ai <- stackoverflow_survey_single_response[, 19:23]
names(tuesdata_ai)
## [1] "ai_select" "ai_sent" "ai_acc" "ai_complex" "ai_threat"
26 / 52

Selecting variables

In tidyverse, we can create a subset of variables with the dplyr verb select().

tuesdata_ai <- stackoverflow_survey_single_response %>%
dplyr::select(ai_select,
ai_sent,
ai_acc,
ai_complex,
ai_threat)
head(tuesdata_ai)
## # A tibble: 6 × 5
## ai_select ai_sent ai_acc ai_complex ai_threat
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 3 5 NA NA NA
## 2 1 NA NA NA NA
## 3 1 NA NA NA NA
## 4 3 5 5 1 2
## 5 1 NA NA NA NA
## 6 3 1 5 2 2
27 / 52

Selecting a range of variables

There also is a shorthand notation for selecting a set of consecutive columns with select().

tuesdata_ai <- stackoverflow_survey_single_response %>%
dplyr::select(ai_select:ai_threat)
head(tuesdata_ai)
## # A tibble: 6 × 5
## ai_select ai_sent ai_acc ai_complex ai_threat
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 3 5 NA NA NA
## 2 1 NA NA NA NA
## 3 1 NA NA NA NA
## 4 3 5 5 1 2
## 5 1 NA NA NA NA
## 6 3 1 5 2 2
28 / 52

Selecting a range of variables

Same as for base R, you can also use the numeric index of variables in combination with select() from dplyr.

tuesdata_ai <- stackoverflow_survey_single_response %>%
dplyr::select(19:23)
names(tuesdata_ai)
## [1] "ai_select" "ai_sent" "ai_acc" "ai_complex" "ai_threat"
29 / 52

Unselecting variables

If you just want to exclude one or a few columns/variables, it is easier to unselect those than to select all others. Again, there's two ways to do this with base R.

Option 1

tuesdata_cut <- stackoverflow_survey_single_response [!(names(stackoverflow_survey_single_response ) %in% c("dev_type", "purchase_influence", "remote_work"))]
# The ! operator means "not" (i.e., it negates a condition)
# The %in% operator means "is included in" (in this case the following character vector)
dim(tuesdata_cut)
## [1] 65437 25
30 / 52

Unselecting variables

You can also use select() from dplyr to exclude one or more columns/variables.

tuesdata_cut<- stackoverflow_survey_single_response %>%
dplyr::select(-c(dev_type, purchase_influence, remote_work))
dim(tuesdata_cut)
## [1] 65437 25
31 / 52

Advanced ways of selecting variables

dplyr offers several helper functions for selecting variables. For a full list of those, you can check the documentation for the select() function or ?select().

tuesdata_ai <- stackoverflow_survey_single_response %>%
dplyr::select(starts_with("ai"))
tuesdata_freq <-stackoverflow_survey_single_response %>%
dplyr::select(ends_with("freq"))
glimpse(tuesdata_freq)
## Rows: 65,437
## Columns: 2
## $ so_visit_freq <dbl> NA, 5, 5, 3, 5, 5, 3, 4, 5, 2, 2, 5, 5, 3, 3, 1,…
## $ so_part_freq <dbl> NA, 6, 6, NA, 6, 6, 3, NA, 6, 5, 5, 6, 2, 2, 2, …
32 / 52

Advanced ways of selecting variables

Another particularly useful selection helper is where() to select only a specific type of variables.

tuesdata_num <- stackoverflow_survey_single_response %>%
dplyr::select(where(is.numeric)) %>%
print()
## # A tibble: 65,437 × 26
## response_id main_branch age remote_work ed_level years_code
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 1 8 3 4 NA
## 2 2 1 3 3 2 20
## 3 3 1 4 3 3 37
## 4 4 2 1 NA 7 4
## 5 5 1 1 NA 6 9
## 6 6 4 8 NA 4 10
## 7 7 3 3 3 5 7
## 8 8 2 1 NA 6 1
## 9 9 4 4 2 5 20
## 10 10 1 3 3 3 15
## # ℹ 65,427 more rows
## # ℹ 20 more variables: years_code_pro <dbl>, dev_type <dbl>,
## # org_size <dbl>, purchase_influence <dbl>, buildvs_buy <dbl>,
## # comp_total <dbl>, so_visit_freq <dbl>, so_account <dbl>,
## # so_part_freq <dbl>, so_comm <dbl>, ai_select <dbl>, ai_sent <dbl>,
## # ai_acc <dbl>, ai_complex <dbl>, ai_threat <dbl>,
## # survey_length <dbl>, survey_ease <dbl>, …
33 / 52

What's in a name?

One thing that we need to know - and might want to change - are the names of the variables in the dataset.

names(stackoverflow_survey_single_response)
## [1] "response_id" "main_branch"
## [3] "age" "remote_work"
## [5] "ed_level" "years_code"
## [7] "years_code_pro" "dev_type"
## [9] "org_size" "purchase_influence"
## [11] "buildvs_buy" "country"
## [13] "currency" "comp_total"
## [15] "so_visit_freq" "so_account"
## [17] "so_part_freq" "so_comm"
## [19] "ai_select" "ai_sent"
## [21] "ai_acc" "ai_complex"
## [23] "ai_threat" "survey_length"
## [25] "survey_ease" "converted_comp_yearly"
## [27] "r_used" "r_want_to_use"
34 / 52

Renaming variables

It is good practice to use consistent naming conventions. Since R is case-sensitive, we might want to only use lowercase letters.
As spaces in variable names can cause problems, we could, e.g., decide to use 🐍 snake_case (🐫

camelCase is a common alternative;

35 / 52

Renaming variables

Renaming columns/variables in dplyr with rename().

tuesdata_rn <- stackoverflow_survey_single_response %>%
dplyr:: rename(ai_workflow = ai_sent, # new_name = old_name
comm_member = so_comm,
post_freq = so_part_freq
)
names(tuesdata_rn)
## [1] "response_id" "main_branch"
## [3] "age" "remote_work"
## [5] "ed_level" "years_code"
## [7] "years_code_pro" "dev_type"
## [9] "org_size" "purchase_influence"
## [11] "buildvs_buy" "country"
## [13] "currency" "comp_total"
## [15] "so_visit_freq" "so_account"
## [17] "post_freq" "comm_member"
## [19] "ai_select" "ai_workflow"
## [21] "ai_acc" "ai_complex"
## [23] "ai_threat" "survey_length"
## [25] "survey_ease" "converted_comp_yearly"
## [27] "r_used" "r_want_to_use"
36 / 52

Renaming variables

For some more advanced renaming options, you can use the dplyr function rename_with().

Note: The janitor package contains the function clean_names() that takes a data frame and creates column names that "are unique and consist only of the _ character, numbers, and letters" (from the help file for this function), with the default being 🐍 snake_case (but support for many other types of cases).

stackoverflow_survey_single_response %>%
dplyr::rename_with(toupper) %>%
names()
## [1] "RESPONSE_ID" "MAIN_BRANCH"
## [3] "AGE" "REMOTE_WORK"
## [5] "ED_LEVEL" "YEARS_CODE"
## [7] "YEARS_CODE_PRO" "DEV_TYPE"
## [9] "ORG_SIZE" "PURCHASE_INFLUENCE"
## [11] "BUILDVS_BUY" "COUNTRY"
## [13] "CURRENCY" "COMP_TOTAL"
## [15] "SO_VISIT_FREQ" "SO_ACCOUNT"
## [17] "SO_PART_FREQ" "SO_COMM"
## [19] "AI_SELECT" "AI_SENT"
## [21] "AI_ACC" "AI_COMPLEX"
## [23] "AI_THREAT" "SURVEY_LENGTH"
## [25] "SURVEY_EASE" "CONVERTED_COMP_YEARLY"
## [27] "R_USED" "R_WANT_TO_USE"
37 / 52

Renaming variables

We can use rename_with() in combination with gsub() to remove (or change) prefixes in variable names.

stackoverflow_survey_single_response %>%
dplyr::select(ai_select:ai_threat) %>%
dplyr::rename_with(~ gsub("ai", "ai_attid", .x,
fixed = TRUE)) %>%
names()
## [1] "ai_attid_select" "ai_attid_sent" "ai_attid_acc"
## [4] "ai_attid_complex" "ai_attid_threat"
38 / 52

Rewindname select

A nice thing about the dplyr verb select is that you can use it to select and rename variables in one step.

tuesdata_ai <- stackoverflow_survey_single_response %>%
dplyr::select(ai_workflow = ai_sent, # new_name = old_name
comm_member = so_comm,
post_freq = so_part_freq )
head(tuesdata_ai)
## # A tibble: 6 × 3
## ai_workflow comm_member post_freq
## <dbl> <dbl> <dbl>
## 1 5 NA NA
## 2 NA 5 6
## 3 NA 5 6
## 4 5 3 NA
## 5 NA 5 6
## 6 1 5 6
39 / 52

Exercise time 🏋️‍♀️💪🏃🚴

Solutions

40 / 52

Filtering rows

Filter rows/observations dependent on one or more conditions.

To filter rows/observations you can use...

  • comparison operators:
    • < (smaller than)
    • <= (smaller than or equal to)
    • == (equal to)
    • != (not equal to)
    • >= (larger than or equal to)
    • > (larger than)
    • %in% (included in)
41 / 52

Filtering rows

... and combine comparisons with

  • logical operators:
    • & (and)
    • | (or)
    • ! (not)
    • xor (either or, not both)
42 / 52

Filtering rows

Similar to selecting columns/variables, there are two options for filtering rows/observations with base R.

Option 1

tuesdata_age <-stackoverflow_survey_single_response [which(stackoverflow_survey_single_response $age == 1), ] #18-24
dim(tuesdata_age)
## [1] 14098 28

Option 2

tuesdata_age <- subset(stackoverflow_survey_single_response , age == 1)
dim(tuesdata_age)
## [1] 14098 28
43 / 52

Filtering rows

The dplyr solution for filtering rows/observations is the verb filter().

tuesdata_age <- stackoverflow_survey_single_response %>%
dplyr::filter(age == 1)
dim(tuesdata_age)
## [1] 14098 28
44 / 52

Filtering rows based on multiple conditions

tuesdata_filter <- stackoverflow_survey_single_response %>%
dplyr::filter(org_size > 1, so_visit_freq > 2, main_branch !=1)
dim(tuesdata_filter)
## [1] 1398 28
45 / 52

dplyr::filter - multiple conditions

By default, multiple conditions in filter() are added as & (and). You can, however, also specify multiple conditions differently.

or (cases for which at least one of the conditions is true)

tuesdata_developer <- stackoverflow_survey_single_response %>%
dplyr::filter(main_branch == 1 | #developer
age > 1)
dim(tuesdata_developer)
## [1] 60371 28
46 / 52

dplyr::filter - multiple conditions

xor (cases for which only one of the two conditions is true)

tuesdata_developer_or_age <- stackoverflow_survey_single_response %>%
dplyr::filter(xor(main_branch == 1,
age > 1))
dim(tuesdata_developer_or_age)
## [1] 19196 28
47 / 52

Advanced ways of filtering rows

Similar to select() there are some helper functions for filter() for advanced filtering of rows. For example, you can...

  • Filter rows based on a range in a numeric variable
tuesdata_frequent_user <- stackoverflow_survey_single_response %>%
dplyr::filter(dplyr::between(so_visit_freq, 2, 3))
dim(tuesdata_frequent_user)
## [1] 33847 28

Note: The range specified in between() is inclusive (on both sides).

48 / 52

Advanced ways of filtering rows

  • Filter rows based on the values of specific variables matching certain criteria
tuesdata_high_engagement <- stackoverflow_survey_single_response %>%
# if the values of vars start with s0 in this df are >= 5
dplyr::filter(dplyr::if_all(dplyr::starts_with ("s0"), ~ . >=5))
dim(tuesdata_high_engagement)
## [1] 65437 28

Note: The helper function if_any() can be used to specify that at least one of the variables needs to match a certain criterion.

49 / 52

Selecting columns + filtering rows

The tidyverse approach solution for combining the selection of columns and the filtering of rows is chaining these steps together in a pipe (the order of the pipe steps does not matter).

tuesdata_freq_ai <- stackoverflow_survey_single_response %>%
dplyr::filter(so_part_freq == 1) %>%
dplyr::select(ai_select:ai_threat)
dim(tuesdata_freq_ai)
## [1] 6277 5
50 / 52

(Re-)Arranging the order of rows

The dplyr verb for changing the order of rows in a data set is arrange() and you can use it in the same ways as the base R equivalent: Sorting by a single variable in ascending order, ...

stackoverflow_survey_single_response %>%
dplyr::arrange(age) %>%
dplyr::select(19:23) %>%
glimpse()
## Rows: 65,437
## Columns: 5
## $ ai_select <dbl> 3, 1, 3, 2, 1, 3, 3, 3, 3, 3, 3, 1, 3, 3, 3, 1, 1, …
## $ ai_sent <dbl> 5, NA, 2, 1, NA, 1, 1, 1, 1, 1, 4, NA, 1, 5, 2, NA,…
## $ ai_acc <dbl> 5, NA, 4, NA, NA, 5, 4, 4, 5, 4, 3, NA, 4, 5, 5, NA…
## $ ai_complex <dbl> 1, NA, 1, NA, NA, 2, 4, 2, 3, 4, 1, NA, 2, 2, 1, NA…
## $ ai_threat <dbl> 2, NA, 3, 1, NA, 2, 2, 3, 2, 2, 2, NA, 1, 2, 2, NA,…
51 / 52

(Re-)Arranging the order of rows

... sorting by a single variable in descending order, ...

stackoverflow_survey_single_response %>%
dplyr:: arrange(desc(age)) %>%
dplyr::select(19:23) %>%
glimpse()
## Rows: 65,437
## Columns: 5
## $ ai_select <dbl> 3, 3, 3, 3, 3, 1, 3, 3, 1, 1, 1, 1, 1, 3, 1, 1, 1, …
## $ ai_sent <dbl> 5, 1, 1, 1, 2, NA, 2, 1, NA, NA, NA, NA, NA, 2, NA,…
## $ ai_acc <dbl> NA, 5, 4, 4, 3, NA, 1, 1, NA, NA, NA, NA, NA, 3, NA…
## $ ai_complex <dbl> NA, 2, 3, 1, 2, NA, 4, 4, NA, NA, NA, NA, NA, 1, NA…
## $ ai_threat <dbl> NA, 2, 2, 2, 2, NA, 2, 3, NA, NA, NA, NA, NA, 3, NA…
52 / 52

Data wrangling

The process of re-shaping, re-formatting, and re-arranging raw data for analysis

2 / 52
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
oTile View: Overview of Slides
Esc Back to slideshow