The process of re-shaping, re-formatting, and re-arranging raw data for analysis
Steps when working with tabular data in the social & behavioral sciences (e.g., from surveys) include:
Steps when working with tabular data in the social & behavioral sciences (e.g., from surveys) include:
The (in)famous 80/20-rule: 80% wrangling, 20% analysis (of course, this ratio relates to the time required for writing the code, not the computing time).
tidyverse
The
tidyverse
is a coherent system of packages for data manipulation, exploration and visualization that share a common design philosophy (Rickert, 2017).
tidyverse
Tidyverse
syntax is designed to increase
R
novices as it can facilitate self-efficacy (see Robinson, 2017)tidyverse
tidyverse
is not R
as in base R
some routines are like using a whole different language, which...
R
Often, tidyverse
functions are under heavy development
To learn more about the tidyverse
lifecycle you can watch this talk by Hadley Wickham or read the corresponding documentation
Base R
vs. tidyverse
Similar to other fierce academic debates over, e.g., R
vs. Python
or Frequentism vs. Bayesianism, people have argued for and against using/teaching the tidyverse
.
But what's unites both:
Source: https://bit.ly/3PmcL4t
focus on differences between base R
and the tidyverse
our main focus will be on the use of packages (and functions) from the tidyverse
and how they can be used to clean and transform your data.
Of course, it is possible to combine base R
and tidyverse
code. However, in the long run, you should try to aim for consistency.
tidyverse
🚀Install all tidyverse
packages (for the full list of tidyverse
packages see https://www.tidyverse.org/packages/)
install.packages("tidyverse")
Load core tidyverse
packages (NB: To save time and reduce namespace conflicts you can also load tidyverse
packages individually)
library(tidyverse) ##load the tidyverse package
tidyverse
vocabulary 101While there is much more to the tidyverse
than this, three important concepts that you need to be familiar with, if you want to use it, are:
Tidy data
Tibbles
Pipes
(We already discussed tibbles in the session on Data Import & Export, so we will focus on tidy data and pipes here.)
The 3 rules of tidy data:
Each variable is in a separate column.
Each observation is in a separate row.
Each value is in a separate cell.
Source: https://r4ds.had.co.nz/tidy-data.html
Note: In the tidyverse
terminology 'tidy data' usually also means data in long format (where applicable).
Source: https://github.com/gadenbuie/tidyexplain#tidy-data
Note: The functions pivot_wider()
and pivot_longer()
from the tidyr
package are easy-to-use options from changing data from long to wide format and vice versa.
Usually, in R
we apply functions as follows:
f(x)
In the logic of pipes this function is written as:
x %>% f(.)
Here, object x
is piped into function f
, becoming (by default) its first argument (but by using . it can also be fed into other arguments).
Usually, in R
we apply functions as follows:
f(x)
In the logic of pipes this function is written as:
x %>% f(.)
Here, object x
is piped into function f
, becoming (by default) its first argument (but by using . it can also be fed into other arguments).
We can use pipes with more than one function:
x %>% f_1() %>% f_2() %>% f_3()
More about pipes: https://r4ds.had.co.nz/pipes.html
(((Onions))) vs. Pipes
The %>%
used in the tidyverse
is part of the magrittr
package to pass data to another function.
RStudio offers a keyboard shortcut for inserting %>%
: Ctrl + Shift + M (Windows & Linux)/Cmd + Shift + M (Mac)
We will use data from the Stack Overflow Annual Developer Survey 2024.
Remember: to code along/ for the exercises the tuesdata data file should be in a sub-folder called data
in the same folder, as the other materials for this course.
The tuesdata is already tidy.
If you collect data yourself, the raw data may be untidy
, e.g.:
If you need to make your data tidy or change it from wide to long format or vice versa, the tidyr
package from the tidyverse
is a good option.
There is a function in R
that tells you how to cite it or any of the packages you have used (for this please see sessionInfo()
.
citation()
## To cite R in publications use:## ## R Core Team (2023). _R: A Language and Environment for## Statistical Computing_. R Foundation for Statistical## Computing, Vienna, Austria. <https://www.R-project.org/>.## ## Ein BibTeX-Eintrag für LaTeX-Benutzer ist## ## @Manual{,## title = {R: A Language and Environment for Statistical Computing},## author = {{R Core Team}},## organization = {R Foundation for Statistical Computing},## address = {Vienna, Austria},## year = {2023},## url = {https://www.R-project.org/},## }## ## We have invested a lot of time and effort in creating R, please## cite it when using it for data analysis. See also## 'citation("pkgname")' for citing R packages.
It is always advisable to consult the codebook (if there is one) before starting to work with a data set.
Side note: If you want to (semi-)automatically generate a codebook for your own dataset, there are several options in R
:
the codebook
package which includes an RStudio-Addin and also offers a web app
the makeCodebook()
function from the dataReporter
package (see this blog post for a short tutorial of the initial dataMaid package
)
The first step is loading the data into R
.
## install.packages("tidytuesdayR")library(tidytuesdayR)tuesdata <- tidytuesdayR::tt_load('2024-09-03')qname_levels_single_response_crosswalk <- tuesdata$qname_levels_single_response_crosswalkstackoverflow_survey_questions <- tuesdata$stackoverflow_survey_questionsstackoverflow_survey_single_response <- tuesdata$stackoverflow_survey_single_response
library(tidytuesdayR)stackoverflow_survey_questions <- read_csv("./data/stackoverflow_survey_questions.csv")stackoverflow_survey_single_response <- read_csv("./data/stackoverflow_survey_single_response.csv")qname_levels_single_response_crosswalk <- read_csv("./data/qname_levels_single_response_crosswalk.csv")
dplyr
The tidyverse
examples in the following will make use of dplyr
functions that are verbs that signal an action (e.g., group_by()
, glimpse()
, filter()
)
Their structure is:
The result is a new data frame (tibble).
Getting a first good look at your data. The function glimpse()
prints a data frame/tibble in a way that represents columns as rows and rows as columns and also provides some additional information about the data frame and its columns.
stackoverflow_survey_single_response %>% glimpse()
↪️
## Rows: 65,437## Columns: 28## $ response_id <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 1…## $ main_branch <dbl> 1, 1, 1, 2, 1, 4, 3, 2, 4, 1, 5, 1, 1, 5…## $ age <dbl> 8, 3, 4, 1, 1, 8, 3, 1, 4, 3, 3, 4, 3, 3…## $ remote_work <dbl> 3, 3, 3, NA, NA, NA, 3, NA, 2, 3, 3, 2, …## $ ed_level <dbl> 4, 2, 3, 7, 6, 4, 5, 6, 5, 3, 2, 5, 2, 2…## $ years_code <dbl> NA, 20, 37, 4, 9, 10, 7, 1, 20, 15, 20, …## $ years_code_pro <dbl> NA, 17, 27, NA, NA, NA, 7, NA, NA, 11, N…## $ dev_type <dbl> NA, 16, 10, 16, 16, 33, 1, 33, 1, 16, 28…## $ org_size <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …## $ purchase_influence <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …## $ buildvs_buy <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …## $ country <chr> "United States of America", "United King…## $ currency <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …## $ comp_total <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …## $ so_visit_freq <dbl> NA, 5, 5, 3, 5, 5, 3, 4, 5, 2, 2, 5, 5, …## $ so_account <dbl> NA, 3, 3, 1, 3, 3, 3, 1, 3, 3, 3, 3, 3, …## $ so_part_freq <dbl> NA, 6, 6, NA, 6, 6, 3, NA, 6, 5, 5, 6, 2…## $ so_comm <dbl> NA, 5, 5, 3, 5, 5, 5, 2, 5, 6, 5, 5, 5, …## $ ai_select <dbl> 3, 1, 1, 3, 1, 3, 1, 3, 1, 3, 3, 1, 1, 3…## $ ai_sent <dbl> 5, NA, NA, 5, NA, 1, NA, 2, NA, 2, 1, NA…## $ ai_acc <dbl> NA, NA, NA, 5, NA, 5, NA, 4, NA, 3, 4, N…## $ ai_complex <dbl> NA, NA, NA, 1, NA, 2, NA, 1, NA, 1, 3, N…## $ ai_threat <dbl> NA, NA, NA, 2, NA, 2, NA, 3, NA, 1, 2, N…## $ survey_length <dbl> NA, NA, 1, 2, 3, 1, 2, 1, 1, 2, 1, 1, 1,…## $ survey_ease <dbl> NA, NA, 2, 2, 2, 2, 3, 1, 3, 2, 2, 3, 2,…## $ converted_comp_yearly <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …## $ r_used <dbl> NA, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, …## $ r_want_to_use <dbl> NA, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, …
We might want to reduce our data frame (or create a new one) to only include a subset of specific variables. E.g., select only the variables that measure attitudes towards AI (ai_
) from our full data set. There are two options with base R
:
Option 1
tuesdata_ai <- stackoverflow_survey_single_response [, c("ai_select", "ai_sent", "ai_acc", "ai_complex", "ai_threat")]# When subsetting with [], the first value refers to rows, the second to columns# [, c("var1", "var2", ...)] means we want to select all rows but only some specific columns.
Option 2
tuesdata_ai <- subset(stackoverflow_survey_single_response, TRUE, select = c(ai_select, ai_sent, ai_acc, ai_complex, ai_threat))# The 2nd argument refers to the rows.# Setting it to TRUE includes all rows in the subset.
You can also select variables based on their numeric index.
tuesdata_ai <- stackoverflow_survey_single_response[, 19:23]names(tuesdata_ai)
## [1] "ai_select" "ai_sent" "ai_acc" "ai_complex" "ai_threat"
In tidyverse
, we can create a subset of variables with the dplyr
verb select()
.
tuesdata_ai <- stackoverflow_survey_single_response %>% dplyr::select(ai_select, ai_sent, ai_acc, ai_complex, ai_threat)head(tuesdata_ai)
## # A tibble: 6 × 5## ai_select ai_sent ai_acc ai_complex ai_threat## <dbl> <dbl> <dbl> <dbl> <dbl>## 1 3 5 NA NA NA## 2 1 NA NA NA NA## 3 1 NA NA NA NA## 4 3 5 5 1 2## 5 1 NA NA NA NA## 6 3 1 5 2 2
There also is a shorthand notation for selecting a set of consecutive columns with select()
.
tuesdata_ai <- stackoverflow_survey_single_response %>% dplyr::select(ai_select:ai_threat)head(tuesdata_ai)
## # A tibble: 6 × 5## ai_select ai_sent ai_acc ai_complex ai_threat## <dbl> <dbl> <dbl> <dbl> <dbl>## 1 3 5 NA NA NA## 2 1 NA NA NA NA## 3 1 NA NA NA NA## 4 3 5 5 1 2## 5 1 NA NA NA NA## 6 3 1 5 2 2
Same as for base R
, you can also use the numeric index of variables in combination with select()
from dplyr
.
tuesdata_ai <- stackoverflow_survey_single_response %>% dplyr::select(19:23)names(tuesdata_ai)
## [1] "ai_select" "ai_sent" "ai_acc" "ai_complex" "ai_threat"
If you just want to exclude one or a few columns/variables, it is easier to unselect those than to select all others. Again, there's two ways to do this with base R
.
Option 1
tuesdata_cut <- stackoverflow_survey_single_response [!(names(stackoverflow_survey_single_response ) %in% c("dev_type", "purchase_influence", "remote_work"))]# The ! operator means "not" (i.e., it negates a condition)# The %in% operator means "is included in" (in this case the following character vector)dim(tuesdata_cut)
## [1] 65437 25
You can also use select()
from dplyr
to exclude one or more columns/variables.
tuesdata_cut<- stackoverflow_survey_single_response %>% dplyr::select(-c(dev_type, purchase_influence, remote_work))dim(tuesdata_cut)
## [1] 65437 25
dplyr
offers several helper functions for selecting variables. For a full list of those, you can check the documentation for the select()
function or ?select()
.
tuesdata_ai <- stackoverflow_survey_single_response %>% dplyr::select(starts_with("ai"))tuesdata_freq <-stackoverflow_survey_single_response %>% dplyr::select(ends_with("freq"))glimpse(tuesdata_freq)
## Rows: 65,437## Columns: 2## $ so_visit_freq <dbl> NA, 5, 5, 3, 5, 5, 3, 4, 5, 2, 2, 5, 5, 3, 3, 1,…## $ so_part_freq <dbl> NA, 6, 6, NA, 6, 6, 3, NA, 6, 5, 5, 6, 2, 2, 2, …
Another particularly useful selection helper is where()
to select only a specific type of variables.
tuesdata_num <- stackoverflow_survey_single_response %>% dplyr::select(where(is.numeric)) %>% print()
## # A tibble: 65,437 × 26## response_id main_branch age remote_work ed_level years_code## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>## 1 1 1 8 3 4 NA## 2 2 1 3 3 2 20## 3 3 1 4 3 3 37## 4 4 2 1 NA 7 4## 5 5 1 1 NA 6 9## 6 6 4 8 NA 4 10## 7 7 3 3 3 5 7## 8 8 2 1 NA 6 1## 9 9 4 4 2 5 20## 10 10 1 3 3 3 15## # ℹ 65,427 more rows## # ℹ 20 more variables: years_code_pro <dbl>, dev_type <dbl>,## # org_size <dbl>, purchase_influence <dbl>, buildvs_buy <dbl>,## # comp_total <dbl>, so_visit_freq <dbl>, so_account <dbl>,## # so_part_freq <dbl>, so_comm <dbl>, ai_select <dbl>, ai_sent <dbl>,## # ai_acc <dbl>, ai_complex <dbl>, ai_threat <dbl>,## # survey_length <dbl>, survey_ease <dbl>, …
One thing that we need to know - and might want to change - are the names of the variables in the dataset.
names(stackoverflow_survey_single_response)
## [1] "response_id" "main_branch" ## [3] "age" "remote_work" ## [5] "ed_level" "years_code" ## [7] "years_code_pro" "dev_type" ## [9] "org_size" "purchase_influence" ## [11] "buildvs_buy" "country" ## [13] "currency" "comp_total" ## [15] "so_visit_freq" "so_account" ## [17] "so_part_freq" "so_comm" ## [19] "ai_select" "ai_sent" ## [21] "ai_acc" "ai_complex" ## [23] "ai_threat" "survey_length" ## [25] "survey_ease" "converted_comp_yearly"## [27] "r_used" "r_want_to_use"
It is good practice to use consistent naming conventions. Since R
is case-sensitive, we might want to only use lowercase letters.
As spaces in variable names can cause problems, we could, e.g., decide to use 🐍 snake_case (🐫
camelCase is a common alternative;
Renaming columns/variables in dplyr
with rename()
.
tuesdata_rn <- stackoverflow_survey_single_response %>% dplyr:: rename(ai_workflow = ai_sent, # new_name = old_name comm_member = so_comm, post_freq = so_part_freq )names(tuesdata_rn)
## [1] "response_id" "main_branch" ## [3] "age" "remote_work" ## [5] "ed_level" "years_code" ## [7] "years_code_pro" "dev_type" ## [9] "org_size" "purchase_influence" ## [11] "buildvs_buy" "country" ## [13] "currency" "comp_total" ## [15] "so_visit_freq" "so_account" ## [17] "post_freq" "comm_member" ## [19] "ai_select" "ai_workflow" ## [21] "ai_acc" "ai_complex" ## [23] "ai_threat" "survey_length" ## [25] "survey_ease" "converted_comp_yearly"## [27] "r_used" "r_want_to_use"
For some more advanced renaming options, you can use the dplyr
function rename_with()
.
Note: The janitor
package contains the function clean_names()
that takes a data frame and creates column names that "are unique and consist only of the _ character, numbers, and letters" (from the help file for this function), with the default being 🐍 snake_case (but support for many other types of cases).
stackoverflow_survey_single_response %>% dplyr::rename_with(toupper) %>% names()
## [1] "RESPONSE_ID" "MAIN_BRANCH" ## [3] "AGE" "REMOTE_WORK" ## [5] "ED_LEVEL" "YEARS_CODE" ## [7] "YEARS_CODE_PRO" "DEV_TYPE" ## [9] "ORG_SIZE" "PURCHASE_INFLUENCE" ## [11] "BUILDVS_BUY" "COUNTRY" ## [13] "CURRENCY" "COMP_TOTAL" ## [15] "SO_VISIT_FREQ" "SO_ACCOUNT" ## [17] "SO_PART_FREQ" "SO_COMM" ## [19] "AI_SELECT" "AI_SENT" ## [21] "AI_ACC" "AI_COMPLEX" ## [23] "AI_THREAT" "SURVEY_LENGTH" ## [25] "SURVEY_EASE" "CONVERTED_COMP_YEARLY"## [27] "R_USED" "R_WANT_TO_USE"
We can use rename_with()
in combination with gsub()
to remove (or change) prefixes in variable names.
stackoverflow_survey_single_response %>% dplyr::select(ai_select:ai_threat) %>% dplyr::rename_with(~ gsub("ai", "ai_attid", .x, fixed = TRUE)) %>% names()
## [1] "ai_attid_select" "ai_attid_sent" "ai_attid_acc" ## [4] "ai_attid_complex" "ai_attid_threat"
A nice thing about the dplyr
verb select
is that you can use it to select and rename variables in one step.
tuesdata_ai <- stackoverflow_survey_single_response %>% dplyr::select(ai_workflow = ai_sent, # new_name = old_name comm_member = so_comm, post_freq = so_part_freq )head(tuesdata_ai)
## # A tibble: 6 × 3## ai_workflow comm_member post_freq## <dbl> <dbl> <dbl>## 1 5 NA NA## 2 NA 5 6## 3 NA 5 6## 4 5 3 NA## 5 NA 5 6## 6 1 5 6
Filter rows/observations dependent on one or more conditions.
To filter rows/observations you can use...
... and combine comparisons with
Similar to selecting columns/variables, there are two options for filtering rows/observations with base R
.
Option 1
tuesdata_age <-stackoverflow_survey_single_response [which(stackoverflow_survey_single_response $age == 1), ] #18-24dim(tuesdata_age)
## [1] 14098 28
Option 2
tuesdata_age <- subset(stackoverflow_survey_single_response , age == 1)dim(tuesdata_age)
## [1] 14098 28
The dplyr
solution for filtering rows/observations is the verb filter()
.
tuesdata_age <- stackoverflow_survey_single_response %>% dplyr::filter(age == 1)dim(tuesdata_age)
## [1] 14098 28
tuesdata_filter <- stackoverflow_survey_single_response %>% dplyr::filter(org_size > 1, so_visit_freq > 2, main_branch !=1)dim(tuesdata_filter)
## [1] 1398 28
dplyr::filter
- multiple conditionsBy default, multiple conditions in filter()
are added as & (and). You can, however, also specify multiple conditions differently.
or (cases for which at least one of the conditions is true)
tuesdata_developer <- stackoverflow_survey_single_response %>% dplyr::filter(main_branch == 1 | #developer age > 1)dim(tuesdata_developer)
## [1] 60371 28
dplyr::filter
- multiple conditionsxor (cases for which only one of the two conditions is true)
tuesdata_developer_or_age <- stackoverflow_survey_single_response %>% dplyr::filter(xor(main_branch == 1, age > 1))dim(tuesdata_developer_or_age)
## [1] 19196 28
Similar to select()
there are some helper functions for filter()
for advanced filtering of rows. For example, you can...
tuesdata_frequent_user <- stackoverflow_survey_single_response %>% dplyr::filter(dplyr::between(so_visit_freq, 2, 3))dim(tuesdata_frequent_user)
## [1] 33847 28
Note: The range specified in between()
is inclusive (on both sides).
tuesdata_high_engagement <- stackoverflow_survey_single_response %>% # if the values of vars start with s0 in this df are >= 5 dplyr::filter(dplyr::if_all(dplyr::starts_with ("s0"), ~ . >=5)) dim(tuesdata_high_engagement)
## [1] 65437 28
Note: The helper function if_any()
can be used to specify that at least one of the variables needs to match a certain criterion.
The tidyverse
approach solution for combining the selection of columns and the filtering of rows is chaining these steps together in a pipe (the order of the pipe steps does not matter).
tuesdata_freq_ai <- stackoverflow_survey_single_response %>% dplyr::filter(so_part_freq == 1) %>% dplyr::select(ai_select:ai_threat)dim(tuesdata_freq_ai)
## [1] 6277 5
The dplyr
verb for changing the order of rows in a data set is arrange()
and you can use it in the same ways as the base R
equivalent: Sorting by a single variable in ascending order, ...
stackoverflow_survey_single_response %>% dplyr::arrange(age) %>% dplyr::select(19:23) %>% glimpse()
## Rows: 65,437## Columns: 5## $ ai_select <dbl> 3, 1, 3, 2, 1, 3, 3, 3, 3, 3, 3, 1, 3, 3, 3, 1, 1, …## $ ai_sent <dbl> 5, NA, 2, 1, NA, 1, 1, 1, 1, 1, 4, NA, 1, 5, 2, NA,…## $ ai_acc <dbl> 5, NA, 4, NA, NA, 5, 4, 4, 5, 4, 3, NA, 4, 5, 5, NA…## $ ai_complex <dbl> 1, NA, 1, NA, NA, 2, 4, 2, 3, 4, 1, NA, 2, 2, 1, NA…## $ ai_threat <dbl> 2, NA, 3, 1, NA, 2, 2, 3, 2, 2, 2, NA, 1, 2, 2, NA,…
... sorting by a single variable in descending order, ...
stackoverflow_survey_single_response %>% dplyr:: arrange(desc(age)) %>% dplyr::select(19:23) %>% glimpse()
## Rows: 65,437## Columns: 5## $ ai_select <dbl> 3, 3, 3, 3, 3, 1, 3, 3, 1, 1, 1, 1, 1, 3, 1, 1, 1, …## $ ai_sent <dbl> 5, 1, 1, 1, 2, NA, 2, 1, NA, NA, NA, NA, NA, 2, NA,…## $ ai_acc <dbl> NA, 5, 4, 4, 3, NA, 1, 1, NA, NA, NA, NA, NA, 3, NA…## $ ai_complex <dbl> NA, 2, 3, 1, 2, NA, 4, 4, NA, NA, NA, NA, NA, 1, NA…## $ ai_threat <dbl> NA, 2, 2, 2, 2, NA, 2, 3, NA, NA, NA, NA, NA, 3, NA…
The process of re-shaping, re-formatting, and re-arranging raw data for analysis
Keyboard shortcuts
↑, ←, Pg Up, k | Go to previous slide |
↓, →, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
o | Tile View: Overview of Slides |
Esc | Back to slideshow |
The process of re-shaping, re-formatting, and re-arranging raw data for analysis
Steps when working with tabular data in the social & behavioral sciences (e.g., from surveys) include:
Steps when working with tabular data in the social & behavioral sciences (e.g., from surveys) include:
The (in)famous 80/20-rule: 80% wrangling, 20% analysis (of course, this ratio relates to the time required for writing the code, not the computing time).
tidyverse
The
tidyverse
is a coherent system of packages for data manipulation, exploration and visualization that share a common design philosophy (Rickert, 2017).
tidyverse
Tidyverse
syntax is designed to increase
R
novices as it can facilitate self-efficacy (see Robinson, 2017)tidyverse
tidyverse
is not R
as in base R
some routines are like using a whole different language, which...
R
Often, tidyverse
functions are under heavy development
To learn more about the tidyverse
lifecycle you can watch this talk by Hadley Wickham or read the corresponding documentation
Base R
vs. tidyverse
Similar to other fierce academic debates over, e.g., R
vs. Python
or Frequentism vs. Bayesianism, people have argued for and against using/teaching the tidyverse
.
But what's unites both:
Source: https://bit.ly/3PmcL4t
focus on differences between base R
and the tidyverse
our main focus will be on the use of packages (and functions) from the tidyverse
and how they can be used to clean and transform your data.
Of course, it is possible to combine base R
and tidyverse
code. However, in the long run, you should try to aim for consistency.
tidyverse
🚀Install all tidyverse
packages (for the full list of tidyverse
packages see https://www.tidyverse.org/packages/)
install.packages("tidyverse")
Load core tidyverse
packages (NB: To save time and reduce namespace conflicts you can also load tidyverse
packages individually)
library(tidyverse) ##load the tidyverse package
tidyverse
vocabulary 101While there is much more to the tidyverse
than this, three important concepts that you need to be familiar with, if you want to use it, are:
Tidy data
Tibbles
Pipes
(We already discussed tibbles in the session on Data Import & Export, so we will focus on tidy data and pipes here.)
The 3 rules of tidy data:
Each variable is in a separate column.
Each observation is in a separate row.
Each value is in a separate cell.
Source: https://r4ds.had.co.nz/tidy-data.html
Note: In the tidyverse
terminology 'tidy data' usually also means data in long format (where applicable).
Source: https://github.com/gadenbuie/tidyexplain#tidy-data
Note: The functions pivot_wider()
and pivot_longer()
from the tidyr
package are easy-to-use options from changing data from long to wide format and vice versa.
Usually, in R
we apply functions as follows:
f(x)
In the logic of pipes this function is written as:
x %>% f(.)
Here, object x
is piped into function f
, becoming (by default) its first argument (but by using . it can also be fed into other arguments).
Usually, in R
we apply functions as follows:
f(x)
In the logic of pipes this function is written as:
x %>% f(.)
Here, object x
is piped into function f
, becoming (by default) its first argument (but by using . it can also be fed into other arguments).
We can use pipes with more than one function:
x %>% f_1() %>% f_2() %>% f_3()
More about pipes: https://r4ds.had.co.nz/pipes.html
(((Onions))) vs. Pipes
The %>%
used in the tidyverse
is part of the magrittr
package to pass data to another function.
RStudio offers a keyboard shortcut for inserting %>%
: Ctrl + Shift + M (Windows & Linux)/Cmd + Shift + M (Mac)
We will use data from the Stack Overflow Annual Developer Survey 2024.
Remember: to code along/ for the exercises the tuesdata data file should be in a sub-folder called data
in the same folder, as the other materials for this course.
The tuesdata is already tidy.
If you collect data yourself, the raw data may be untidy
, e.g.:
If you need to make your data tidy or change it from wide to long format or vice versa, the tidyr
package from the tidyverse
is a good option.
There is a function in R
that tells you how to cite it or any of the packages you have used (for this please see sessionInfo()
.
citation()
## To cite R in publications use:## ## R Core Team (2023). _R: A Language and Environment for## Statistical Computing_. R Foundation for Statistical## Computing, Vienna, Austria. <https://www.R-project.org/>.## ## Ein BibTeX-Eintrag für LaTeX-Benutzer ist## ## @Manual{,## title = {R: A Language and Environment for Statistical Computing},## author = {{R Core Team}},## organization = {R Foundation for Statistical Computing},## address = {Vienna, Austria},## year = {2023},## url = {https://www.R-project.org/},## }## ## We have invested a lot of time and effort in creating R, please## cite it when using it for data analysis. See also## 'citation("pkgname")' for citing R packages.
It is always advisable to consult the codebook (if there is one) before starting to work with a data set.
Side note: If you want to (semi-)automatically generate a codebook for your own dataset, there are several options in R
:
the codebook
package which includes an RStudio-Addin and also offers a web app
the makeCodebook()
function from the dataReporter
package (see this blog post for a short tutorial of the initial dataMaid package
)
The first step is loading the data into R
.
## install.packages("tidytuesdayR")library(tidytuesdayR)tuesdata <- tidytuesdayR::tt_load('2024-09-03')qname_levels_single_response_crosswalk <- tuesdata$qname_levels_single_response_crosswalkstackoverflow_survey_questions <- tuesdata$stackoverflow_survey_questionsstackoverflow_survey_single_response <- tuesdata$stackoverflow_survey_single_response
library(tidytuesdayR)stackoverflow_survey_questions <- read_csv("./data/stackoverflow_survey_questions.csv")stackoverflow_survey_single_response <- read_csv("./data/stackoverflow_survey_single_response.csv")qname_levels_single_response_crosswalk <- read_csv("./data/qname_levels_single_response_crosswalk.csv")
dplyr
The tidyverse
examples in the following will make use of dplyr
functions that are verbs that signal an action (e.g., group_by()
, glimpse()
, filter()
)
Their structure is:
The result is a new data frame (tibble).
Getting a first good look at your data. The function glimpse()
prints a data frame/tibble in a way that represents columns as rows and rows as columns and also provides some additional information about the data frame and its columns.
stackoverflow_survey_single_response %>% glimpse()
↪️
## Rows: 65,437## Columns: 28## $ response_id <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 1…## $ main_branch <dbl> 1, 1, 1, 2, 1, 4, 3, 2, 4, 1, 5, 1, 1, 5…## $ age <dbl> 8, 3, 4, 1, 1, 8, 3, 1, 4, 3, 3, 4, 3, 3…## $ remote_work <dbl> 3, 3, 3, NA, NA, NA, 3, NA, 2, 3, 3, 2, …## $ ed_level <dbl> 4, 2, 3, 7, 6, 4, 5, 6, 5, 3, 2, 5, 2, 2…## $ years_code <dbl> NA, 20, 37, 4, 9, 10, 7, 1, 20, 15, 20, …## $ years_code_pro <dbl> NA, 17, 27, NA, NA, NA, 7, NA, NA, 11, N…## $ dev_type <dbl> NA, 16, 10, 16, 16, 33, 1, 33, 1, 16, 28…## $ org_size <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …## $ purchase_influence <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …## $ buildvs_buy <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …## $ country <chr> "United States of America", "United King…## $ currency <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …## $ comp_total <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …## $ so_visit_freq <dbl> NA, 5, 5, 3, 5, 5, 3, 4, 5, 2, 2, 5, 5, …## $ so_account <dbl> NA, 3, 3, 1, 3, 3, 3, 1, 3, 3, 3, 3, 3, …## $ so_part_freq <dbl> NA, 6, 6, NA, 6, 6, 3, NA, 6, 5, 5, 6, 2…## $ so_comm <dbl> NA, 5, 5, 3, 5, 5, 5, 2, 5, 6, 5, 5, 5, …## $ ai_select <dbl> 3, 1, 1, 3, 1, 3, 1, 3, 1, 3, 3, 1, 1, 3…## $ ai_sent <dbl> 5, NA, NA, 5, NA, 1, NA, 2, NA, 2, 1, NA…## $ ai_acc <dbl> NA, NA, NA, 5, NA, 5, NA, 4, NA, 3, 4, N…## $ ai_complex <dbl> NA, NA, NA, 1, NA, 2, NA, 1, NA, 1, 3, N…## $ ai_threat <dbl> NA, NA, NA, 2, NA, 2, NA, 3, NA, 1, 2, N…## $ survey_length <dbl> NA, NA, 1, 2, 3, 1, 2, 1, 1, 2, 1, 1, 1,…## $ survey_ease <dbl> NA, NA, 2, 2, 2, 2, 3, 1, 3, 2, 2, 3, 2,…## $ converted_comp_yearly <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …## $ r_used <dbl> NA, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, …## $ r_want_to_use <dbl> NA, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, …
We might want to reduce our data frame (or create a new one) to only include a subset of specific variables. E.g., select only the variables that measure attitudes towards AI (ai_
) from our full data set. There are two options with base R
:
Option 1
tuesdata_ai <- stackoverflow_survey_single_response [, c("ai_select", "ai_sent", "ai_acc", "ai_complex", "ai_threat")]# When subsetting with [], the first value refers to rows, the second to columns# [, c("var1", "var2", ...)] means we want to select all rows but only some specific columns.
Option 2
tuesdata_ai <- subset(stackoverflow_survey_single_response, TRUE, select = c(ai_select, ai_sent, ai_acc, ai_complex, ai_threat))# The 2nd argument refers to the rows.# Setting it to TRUE includes all rows in the subset.
You can also select variables based on their numeric index.
tuesdata_ai <- stackoverflow_survey_single_response[, 19:23]names(tuesdata_ai)
## [1] "ai_select" "ai_sent" "ai_acc" "ai_complex" "ai_threat"
In tidyverse
, we can create a subset of variables with the dplyr
verb select()
.
tuesdata_ai <- stackoverflow_survey_single_response %>% dplyr::select(ai_select, ai_sent, ai_acc, ai_complex, ai_threat)head(tuesdata_ai)
## # A tibble: 6 × 5## ai_select ai_sent ai_acc ai_complex ai_threat## <dbl> <dbl> <dbl> <dbl> <dbl>## 1 3 5 NA NA NA## 2 1 NA NA NA NA## 3 1 NA NA NA NA## 4 3 5 5 1 2## 5 1 NA NA NA NA## 6 3 1 5 2 2
There also is a shorthand notation for selecting a set of consecutive columns with select()
.
tuesdata_ai <- stackoverflow_survey_single_response %>% dplyr::select(ai_select:ai_threat)head(tuesdata_ai)
## # A tibble: 6 × 5## ai_select ai_sent ai_acc ai_complex ai_threat## <dbl> <dbl> <dbl> <dbl> <dbl>## 1 3 5 NA NA NA## 2 1 NA NA NA NA## 3 1 NA NA NA NA## 4 3 5 5 1 2## 5 1 NA NA NA NA## 6 3 1 5 2 2
Same as for base R
, you can also use the numeric index of variables in combination with select()
from dplyr
.
tuesdata_ai <- stackoverflow_survey_single_response %>% dplyr::select(19:23)names(tuesdata_ai)
## [1] "ai_select" "ai_sent" "ai_acc" "ai_complex" "ai_threat"
If you just want to exclude one or a few columns/variables, it is easier to unselect those than to select all others. Again, there's two ways to do this with base R
.
Option 1
tuesdata_cut <- stackoverflow_survey_single_response [!(names(stackoverflow_survey_single_response ) %in% c("dev_type", "purchase_influence", "remote_work"))]# The ! operator means "not" (i.e., it negates a condition)# The %in% operator means "is included in" (in this case the following character vector)dim(tuesdata_cut)
## [1] 65437 25
You can also use select()
from dplyr
to exclude one or more columns/variables.
tuesdata_cut<- stackoverflow_survey_single_response %>% dplyr::select(-c(dev_type, purchase_influence, remote_work))dim(tuesdata_cut)
## [1] 65437 25
dplyr
offers several helper functions for selecting variables. For a full list of those, you can check the documentation for the select()
function or ?select()
.
tuesdata_ai <- stackoverflow_survey_single_response %>% dplyr::select(starts_with("ai"))tuesdata_freq <-stackoverflow_survey_single_response %>% dplyr::select(ends_with("freq"))glimpse(tuesdata_freq)
## Rows: 65,437## Columns: 2## $ so_visit_freq <dbl> NA, 5, 5, 3, 5, 5, 3, 4, 5, 2, 2, 5, 5, 3, 3, 1,…## $ so_part_freq <dbl> NA, 6, 6, NA, 6, 6, 3, NA, 6, 5, 5, 6, 2, 2, 2, …
Another particularly useful selection helper is where()
to select only a specific type of variables.
tuesdata_num <- stackoverflow_survey_single_response %>% dplyr::select(where(is.numeric)) %>% print()
## # A tibble: 65,437 × 26## response_id main_branch age remote_work ed_level years_code## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>## 1 1 1 8 3 4 NA## 2 2 1 3 3 2 20## 3 3 1 4 3 3 37## 4 4 2 1 NA 7 4## 5 5 1 1 NA 6 9## 6 6 4 8 NA 4 10## 7 7 3 3 3 5 7## 8 8 2 1 NA 6 1## 9 9 4 4 2 5 20## 10 10 1 3 3 3 15## # ℹ 65,427 more rows## # ℹ 20 more variables: years_code_pro <dbl>, dev_type <dbl>,## # org_size <dbl>, purchase_influence <dbl>, buildvs_buy <dbl>,## # comp_total <dbl>, so_visit_freq <dbl>, so_account <dbl>,## # so_part_freq <dbl>, so_comm <dbl>, ai_select <dbl>, ai_sent <dbl>,## # ai_acc <dbl>, ai_complex <dbl>, ai_threat <dbl>,## # survey_length <dbl>, survey_ease <dbl>, …
One thing that we need to know - and might want to change - are the names of the variables in the dataset.
names(stackoverflow_survey_single_response)
## [1] "response_id" "main_branch" ## [3] "age" "remote_work" ## [5] "ed_level" "years_code" ## [7] "years_code_pro" "dev_type" ## [9] "org_size" "purchase_influence" ## [11] "buildvs_buy" "country" ## [13] "currency" "comp_total" ## [15] "so_visit_freq" "so_account" ## [17] "so_part_freq" "so_comm" ## [19] "ai_select" "ai_sent" ## [21] "ai_acc" "ai_complex" ## [23] "ai_threat" "survey_length" ## [25] "survey_ease" "converted_comp_yearly"## [27] "r_used" "r_want_to_use"
It is good practice to use consistent naming conventions. Since R
is case-sensitive, we might want to only use lowercase letters.
As spaces in variable names can cause problems, we could, e.g., decide to use 🐍 snake_case (🐫
camelCase is a common alternative;
Renaming columns/variables in dplyr
with rename()
.
tuesdata_rn <- stackoverflow_survey_single_response %>% dplyr:: rename(ai_workflow = ai_sent, # new_name = old_name comm_member = so_comm, post_freq = so_part_freq )names(tuesdata_rn)
## [1] "response_id" "main_branch" ## [3] "age" "remote_work" ## [5] "ed_level" "years_code" ## [7] "years_code_pro" "dev_type" ## [9] "org_size" "purchase_influence" ## [11] "buildvs_buy" "country" ## [13] "currency" "comp_total" ## [15] "so_visit_freq" "so_account" ## [17] "post_freq" "comm_member" ## [19] "ai_select" "ai_workflow" ## [21] "ai_acc" "ai_complex" ## [23] "ai_threat" "survey_length" ## [25] "survey_ease" "converted_comp_yearly"## [27] "r_used" "r_want_to_use"
For some more advanced renaming options, you can use the dplyr
function rename_with()
.
Note: The janitor
package contains the function clean_names()
that takes a data frame and creates column names that "are unique and consist only of the _ character, numbers, and letters" (from the help file for this function), with the default being 🐍 snake_case (but support for many other types of cases).
stackoverflow_survey_single_response %>% dplyr::rename_with(toupper) %>% names()
## [1] "RESPONSE_ID" "MAIN_BRANCH" ## [3] "AGE" "REMOTE_WORK" ## [5] "ED_LEVEL" "YEARS_CODE" ## [7] "YEARS_CODE_PRO" "DEV_TYPE" ## [9] "ORG_SIZE" "PURCHASE_INFLUENCE" ## [11] "BUILDVS_BUY" "COUNTRY" ## [13] "CURRENCY" "COMP_TOTAL" ## [15] "SO_VISIT_FREQ" "SO_ACCOUNT" ## [17] "SO_PART_FREQ" "SO_COMM" ## [19] "AI_SELECT" "AI_SENT" ## [21] "AI_ACC" "AI_COMPLEX" ## [23] "AI_THREAT" "SURVEY_LENGTH" ## [25] "SURVEY_EASE" "CONVERTED_COMP_YEARLY"## [27] "R_USED" "R_WANT_TO_USE"
We can use rename_with()
in combination with gsub()
to remove (or change) prefixes in variable names.
stackoverflow_survey_single_response %>% dplyr::select(ai_select:ai_threat) %>% dplyr::rename_with(~ gsub("ai", "ai_attid", .x, fixed = TRUE)) %>% names()
## [1] "ai_attid_select" "ai_attid_sent" "ai_attid_acc" ## [4] "ai_attid_complex" "ai_attid_threat"
A nice thing about the dplyr
verb select
is that you can use it to select and rename variables in one step.
tuesdata_ai <- stackoverflow_survey_single_response %>% dplyr::select(ai_workflow = ai_sent, # new_name = old_name comm_member = so_comm, post_freq = so_part_freq )head(tuesdata_ai)
## # A tibble: 6 × 3## ai_workflow comm_member post_freq## <dbl> <dbl> <dbl>## 1 5 NA NA## 2 NA 5 6## 3 NA 5 6## 4 5 3 NA## 5 NA 5 6## 6 1 5 6
Filter rows/observations dependent on one or more conditions.
To filter rows/observations you can use...
... and combine comparisons with
Similar to selecting columns/variables, there are two options for filtering rows/observations with base R
.
Option 1
tuesdata_age <-stackoverflow_survey_single_response [which(stackoverflow_survey_single_response $age == 1), ] #18-24dim(tuesdata_age)
## [1] 14098 28
Option 2
tuesdata_age <- subset(stackoverflow_survey_single_response , age == 1)dim(tuesdata_age)
## [1] 14098 28
The dplyr
solution for filtering rows/observations is the verb filter()
.
tuesdata_age <- stackoverflow_survey_single_response %>% dplyr::filter(age == 1)dim(tuesdata_age)
## [1] 14098 28
tuesdata_filter <- stackoverflow_survey_single_response %>% dplyr::filter(org_size > 1, so_visit_freq > 2, main_branch !=1)dim(tuesdata_filter)
## [1] 1398 28
dplyr::filter
- multiple conditionsBy default, multiple conditions in filter()
are added as & (and). You can, however, also specify multiple conditions differently.
or (cases for which at least one of the conditions is true)
tuesdata_developer <- stackoverflow_survey_single_response %>% dplyr::filter(main_branch == 1 | #developer age > 1)dim(tuesdata_developer)
## [1] 60371 28
dplyr::filter
- multiple conditionsxor (cases for which only one of the two conditions is true)
tuesdata_developer_or_age <- stackoverflow_survey_single_response %>% dplyr::filter(xor(main_branch == 1, age > 1))dim(tuesdata_developer_or_age)
## [1] 19196 28
Similar to select()
there are some helper functions for filter()
for advanced filtering of rows. For example, you can...
tuesdata_frequent_user <- stackoverflow_survey_single_response %>% dplyr::filter(dplyr::between(so_visit_freq, 2, 3))dim(tuesdata_frequent_user)
## [1] 33847 28
Note: The range specified in between()
is inclusive (on both sides).
tuesdata_high_engagement <- stackoverflow_survey_single_response %>% # if the values of vars start with s0 in this df are >= 5 dplyr::filter(dplyr::if_all(dplyr::starts_with ("s0"), ~ . >=5)) dim(tuesdata_high_engagement)
## [1] 65437 28
Note: The helper function if_any()
can be used to specify that at least one of the variables needs to match a certain criterion.
The tidyverse
approach solution for combining the selection of columns and the filtering of rows is chaining these steps together in a pipe (the order of the pipe steps does not matter).
tuesdata_freq_ai <- stackoverflow_survey_single_response %>% dplyr::filter(so_part_freq == 1) %>% dplyr::select(ai_select:ai_threat)dim(tuesdata_freq_ai)
## [1] 6277 5
The dplyr
verb for changing the order of rows in a data set is arrange()
and you can use it in the same ways as the base R
equivalent: Sorting by a single variable in ascending order, ...
stackoverflow_survey_single_response %>% dplyr::arrange(age) %>% dplyr::select(19:23) %>% glimpse()
## Rows: 65,437## Columns: 5## $ ai_select <dbl> 3, 1, 3, 2, 1, 3, 3, 3, 3, 3, 3, 1, 3, 3, 3, 1, 1, …## $ ai_sent <dbl> 5, NA, 2, 1, NA, 1, 1, 1, 1, 1, 4, NA, 1, 5, 2, NA,…## $ ai_acc <dbl> 5, NA, 4, NA, NA, 5, 4, 4, 5, 4, 3, NA, 4, 5, 5, NA…## $ ai_complex <dbl> 1, NA, 1, NA, NA, 2, 4, 2, 3, 4, 1, NA, 2, 2, 1, NA…## $ ai_threat <dbl> 2, NA, 3, 1, NA, 2, 2, 3, 2, 2, 2, NA, 1, 2, 2, NA,…
... sorting by a single variable in descending order, ...
stackoverflow_survey_single_response %>% dplyr:: arrange(desc(age)) %>% dplyr::select(19:23) %>% glimpse()
## Rows: 65,437## Columns: 5## $ ai_select <dbl> 3, 3, 3, 3, 3, 1, 3, 3, 1, 1, 1, 1, 1, 3, 1, 1, 1, …## $ ai_sent <dbl> 5, 1, 1, 1, 2, NA, 2, 1, NA, NA, NA, NA, NA, 2, NA,…## $ ai_acc <dbl> NA, 5, 4, 4, 3, NA, 1, 1, NA, NA, NA, NA, NA, 3, NA…## $ ai_complex <dbl> NA, 2, 3, 1, 2, NA, 4, 4, NA, NA, NA, NA, NA, 1, NA…## $ ai_threat <dbl> NA, 2, 2, 2, 2, NA, 2, 3, NA, NA, NA, NA, NA, 3, NA…