class: center, middle, inverse, title-slide .title[ # Data Literacy: Introduction to R ] .subtitle[ ## Exploratory Data Analysis ] .author[ ### Veronika Batzdorfer ] .date[ ### 2025-11-07 ] --- layout: true --- ## Exploring EDA In this session, we will focus on the following: - Compute summary statistics for numeric variables - Produce counts, proportions and crosstabs for categorical variables - Use group_by() + summarize() and across() for group summaries - Run simple correlation analyses and interpret them - Create concise EDA pipelines and diagnostic plots --- ## Disclaimer: A flood 🌊 of packages 📦 In the previous sessions, we have tried to focus on a small set of packages (mostly from `base R` and the `tidyverse`) and took a deep dive 🥽 exploring their functionalities. By comparison, in this session, we will browse through more different packages. --- ## Wrangling pipeline: Load packages & prepare data .small[ ``` r library(tidyverse) # dplyr, ggplot2, tidyr, etc. library(janitor) # tabyl, adorn_* library(corrr) # neat correlation tibbles library(broom) # tidy test results # prepare mtcars: convert some numeric columns to factors and relabel data("mtcars") df <- mtcars %>% rownames_to_column(var = "model") %>% mutate( cyl = factor(cyl, levels = c(4,6,8), labels = c("4cyl","6cyl","8cyl")), gear = factor(gear), am = factor(am, levels = c(0,1), labels = c("automatic","manual")) ) %>% as_tibble() glimpse(df) ``` ``` ## Rows: 32 ## Columns: 12 ## $ model <chr> "Mazda RX4", "Mazda RX4 Wag", "Datsun 710", "Hornet 4 Drive", "… ## $ mpg <dbl> 21.0, 21.0, 22.8, 21.4, 18.7, 18.1, 14.3, 24.4, 22.8, 19.2, 17.… ## $ cyl <fct> 6cyl, 6cyl, 4cyl, 6cyl, 8cyl, 6cyl, 8cyl, 4cyl, 4cyl, 6cyl, 6cy… ## $ disp <dbl> 160.0, 160.0, 108.0, 258.0, 360.0, 225.0, 360.0, 146.7, 140.8, … ## $ hp <dbl> 110, 110, 93, 110, 175, 105, 245, 62, 95, 123, 123, 180, 180, 1… ## $ drat <dbl> 3.90, 3.90, 3.85, 3.08, 3.15, 2.76, 3.21, 3.69, 3.92, 3.92, 3.9… ## $ wt <dbl> 2.620, 2.875, 2.320, 3.215, 3.440, 3.460, 3.570, 3.190, 3.150, … ## $ qsec <dbl> 16.46, 17.02, 18.61, 19.44, 17.02, 20.22, 15.84, 20.00, 22.90, … ## $ vs <dbl> 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, … ## $ am <fct> manual, manual, manual, automatic, automatic, automatic, automa… ## $ gear <fct> 4, 4, 4, 3, 3, 3, 3, 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 4, 4, 4, 3, … ## $ carb <dbl> 4, 4, 1, 1, 2, 1, 4, 2, 2, 4, 4, 3, 3, 3, 4, 4, 4, 1, 2, 1, 1, … ``` ``` r df %>% summarize( mean_mpg = mean(mpg), sd_mpg = sd(mpg), median_hp = median(hp), iqr_wt = IQR(wt), min_qsec = min(qsec), max_qsec = max(qsec) ) %>% round(2) ``` ``` ## # A tibble: 1 × 6 ## mean_mpg sd_mpg median_hp iqr_wt min_qsec max_qsec ## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 20.1 6.03 123 1.03 14.5 22.9 ``` ] --- ## `dplyr::group_by()` The `dplyr` function `group_by()` creates data frames (tibbles) that are grouped by one or more variables. ``` r df %>% group_by(cyl) %>% summarize( n = n(), mean_mpg = mean(mpg), sd_mpg = sd(mpg), mean_hp = mean(hp) ) %>% ungroup() ``` .right[↪️] --- class: middle ``` ## # A tibble: 3 × 5 ## cyl n mean_mpg sd_mpg mean_hp ## <fct> <int> <dbl> <dbl> <dbl> ## 1 4cyl 11 26.7 4.51 82.6 ## 2 6cyl 7 19.7 1.45 122. ## 3 8cyl 14 15.1 2.56 209. ``` --- ## Grouping and ungrouping Keep in mind that `group_by()` changes the class of a dataframe. To avoid (accidentally) continuing to perform operations groupwise, it is necessary to use the .highlight[`ungroup()`] function (typically at the end of a pipe). --- ## `dplyr::across()` Multiple summaries for many columns. ``` r df %>% select(cyl, mpg, hp, wt, qsec) %>% group_by(cyl) %>% summarize( across(mpg:qsec, list(mean = ~mean(.x), sd = ~sd(.x)), .names = "{col}_{fn}") ) %>% ungroup() ``` .right[↪️] --- class: middle ``` ## # A tibble: 3 × 9 ## cyl mpg_mean mpg_sd hp_mean hp_sd wt_mean wt_sd qsec_mean qsec_sd ## <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 4cyl 26.7 4.51 82.6 20.9 2.29 0.570 19.1 1.68 ## 2 6cyl 19.7 1.45 122. 24.3 3.12 0.356 18.0 1.71 ## 3 8cyl 15.1 2.56 209. 51.0 4.00 0.759 16.8 1.20 ``` --- ## Frequencies: `table()` A simple way of looking at frequencies (e.g., for categorical variables) that we have already seen before is the `base R` function `table()`. ``` r table(df$cyl) ``` ``` ## ## 4cyl 6cyl 8cyl ## 11 7 14 ``` If you also want to include `NA` in the frequency counts, you need to specify the argument `useNA = "always"`. ``` r table(df$cyl, useNA = "always") ``` ``` ## ## 4cyl 6cyl 8cyl <NA> ## 11 7 14 0 ``` --- ## Proportions with `tidyverse` ``` r # tidyverse approach df %>% count(cyl) %>% mutate(prop = n / sum(n)) ``` ``` ## # A tibble: 3 × 3 ## cyl n prop ## <fct> <int> <dbl> ## 1 4cyl 11 0.344 ## 2 6cyl 7 0.219 ## 3 8cyl 14 0.438 ``` --- ## Frequencies and proportions with `janitor::tabyl()` The [`janitor` package](https://github.com/sfirke/janitor) provides the [`tabyl()` function](https://sfirke.github.io/janitor/articles/tabyls.html) for creating frequency and proportion tables: ``` r library(janitor) df %>% tabyl(cyl) %>% adorn_pct_formatting(digits = 1) ``` ``` ## cyl n percent ## 4cyl 11 34.4% ## 6cyl 7 21.9% ## 8cyl 14 43.8% ``` --- ## Tables in `R` There are typically two types of outputs you can produce with `R` for further use in reports or publications: tables and plots. However, as summary statistics, frequencies, and proportions are often presented in tables, we will briefly discuss how to create tables as output in `R` in the following. As with almost everything in `R`, there are many options (read: packages) for creating tables. We will show examples from three popular options: - [`stargazer`](https://cran.r-project.org/web/packages/stargazer/index.html) - [`gtsummary`](http://www.danieldsjoberg.com/gtsummary/) + [`flextable`](https://davidgohel.github.io/flextable/index.html) --- ## Summary tables with `stargazer` If we, e.g., want to create a summary statistics table as text output (e.g., for printing in the `R` console), we can use `stargazer` for that. **NB**: As the main `stargazer()` function does not work with tibbles, we need to convert our data to a regular data frame. ``` r library(stargazer) df %>% select(mpg, cyl, gear) %>% as.data.frame() %>% stargazer(type = "text", digits = 2, title="Descriptive statistics") ``` .right[↪️] --- class: middle ``` ## ## Descriptive statistics ## ======================================= ## Statistic N Mean St. Dev. Min Max ## --------------------------------------- ## mpg 32 20.09 6.03 10.40 33.90 ## --------------------------------------- ``` --- ## Relationships between variables In the following, we will briefly discuss some options for two methods of exploring relationships between variables: - crosstabulation (for categorical variables) - correlations (for numeric and/or binary variables) --- ## Crosstabs You can use the `base R` functions `table()` and `prop.table()` ``` r # base R contingency table tab <- table(df$cyl, df$am) tab ``` ``` ## ## automatic manual ## 4cyl 3 8 ## 6cyl 4 3 ## 8cyl 12 2 ``` ``` r # row percentages round(prop.table(tab, margin = 1) * 100, 1) ``` ``` ## ## automatic manual ## 4cyl 27.3 72.7 ## 6cyl 57.1 42.9 ## 8cyl 85.7 14.3 ``` --- ## Crosstabs with the `janitor` package The `tabyl()` function from the `janitor` package provides quite a few options for crosstabs. .small[ ``` r library(janitor) df %>% tabyl(cyl, am) %>% adorn_totals("both") %>% adorn_percentages("row") %>% adorn_pct_formatting(digits = 1) %>% adorn_ns(position = "front") ``` ``` ## cyl automatic manual Total ## 4cyl 3 (27.3%) 8 (72.7%) 11 (100.0%) ## 6cyl 4 (57.1%) 3 (42.9%) 7 (100.0%) ## 8cyl 12 (85.7%) 2 (14.3%) 14 (100.0%) ## Total 19 (59.4%) 13 (40.6%) 32 (100.0%) ``` ] --- ## Other options for crosstabulation in `R` As with most things in `R`, there are many options for creating crosstabs. Some alternatives to the ones we presented before include the `CrossTable()` and `crosstab()` functions from the [`descr` package](https://cran.r-project.org/web/packages/descr/index.html) or the `ctable()` function from the [`summarytools` package](https://github.com/dcomtois/summarytools). Another quite very versatile option for creating crosstabs as well as output based on those (e.g., in combination with the `flextable` package we have briefly used before) is the [`crosstable` package](https://danchaltiel.github.io/crosstable/) --- ## Chi-Square for independence (categorical) We can use the `summary()` function from `base R` in combination with `table()` to perform a chi-square test. ``` r #--Are cyl and am independent? ------ # contingency table for cyl vs am tbl <- df %>% count(cyl, am) %>% pivot_wider(names_from = am, values_from = n, values_fill = 0) chisq_res <- janitor::chisq.test(as.matrix(tbl %>% select(-cyl))) chisq_res ``` ``` ## ## Pearson's Chi-squared test ## ## data: as.matrix(tbl %>% select(-cyl)) ## X-squared = 8.7407, df = 2, p-value = 0.01265 ``` --- ``` r tidy(chisq_res) # broom::tidy for neat display ``` ``` ## # A tibble: 1 × 4 ## statistic p.value parameter method ## <dbl> <dbl> <int> <chr> ## 1 8.74 0.0126 2 Pearson's Chi-squared test ``` --- ## Correlations with `base R` The `base R` function `cor()` computes the correlation coefficient(s) between two or more variables. This function can be used to calculate *Pearson's r*, *Kendall's tau*, and *Spearman's rho*. We also need to specify how we want to deal with missing values (e.g., use pairwise complete observations). .small[ ``` r nums <- df %>% select(mpg, hp, wt, qsec, disp) # base R correlation matrix (pairwise complete) cor(nums, use = "pairwise.complete.obs", method = "pearson") ``` ``` ## mpg hp wt qsec disp ## mpg 1.0000000 -0.7761684 -0.8676594 0.4186840 -0.8475514 ## hp -0.7761684 1.0000000 0.6587479 -0.7082234 0.7909486 ## wt -0.8676594 0.6587479 1.0000000 -0.1747159 0.8879799 ## qsec 0.4186840 -0.7082234 -0.1747159 1.0000000 -0.4336979 ## disp -0.8475514 0.7909486 0.8879799 -0.4336979 1.0000000 ``` ] --- ``` r # corrr style (tidy) nums %>% correlate() %>% stretch() %>% # long form arrange(desc(abs(r))) ``` ``` ## # A tibble: 25 × 3 ## x y r ## <chr> <chr> <dbl> ## 1 wt disp 0.888 ## 2 disp wt 0.888 ## 3 mpg wt -0.868 ## 4 wt mpg -0.868 ## 5 mpg disp -0.848 ## 6 disp mpg -0.848 ## 7 hp disp 0.791 ## 8 disp hp 0.791 ## 9 mpg hp -0.776 ## 10 hp mpg -0.776 ## # ℹ 15 more rows ``` --- ## The `corrr` package The [`corrr` package](https://corrr.tidymodels.org/) is part of the [`tidymodels` suite of packages](https://www.tidymodels.org/) and provides various functions for displaying correlations. The main function is `correlate()` which produces a `tibble` as output. .small[ ``` r library(corrr) correlate(nums) ``` ``` ## # A tibble: 5 × 6 ## term mpg hp wt qsec disp ## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 mpg NA -0.776 -0.868 0.419 -0.848 ## 2 hp -0.776 NA 0.659 -0.708 0.791 ## 3 wt -0.868 0.659 NA -0.175 0.888 ## 4 qsec 0.419 -0.708 -0.175 NA -0.434 ## 5 disp -0.848 0.791 0.888 -0.434 NA ``` ] --- ## The `corrr` package The `corrr` package provides several functions for tweaking/optimizing the output of the `correlate()` function. Here's one example: .small[ ``` r nums %>% correlate() %>% shave() %>% fashion() ``` ``` ## term mpg hp wt qsec disp ## 1 mpg ## 2 hp -.78 ## 3 wt -.87 .66 ## 4 qsec .42 -.71 -.17 ## 5 disp -.85 .79 .89 -.43 ``` ] --- ## Correlation diagnostics ``` r # scatter with smoothing ggplot(df, aes(x = wt, y = mpg, color = cyl)) + geom_point(size = 3) + geom_smooth(method = "lm", se = FALSE) + theme_minimal() ``` <img src="data:image/png;base64,#3_1_Exploratory_Data_Analysis_files/figure-html/unnamed-chunk-7-1.png" style="display: block; margin: auto;" /> ``` r # pairs plot quick overview pairs(select(df, mpg, hp, wt, qsec), main = "Pairs for selected vars") ``` <img src="data:image/png;base64,#3_1_Exploratory_Data_Analysis_files/figure-html/unnamed-chunk-7-2.png" style="display: block; margin: auto;" /> --- ## Guilty by ~~association~~ correlation While correlation coefficients are useful for exploring relationships between variables, they can also be misleading. For example, if we do correlation analysis and encounter a (Pearson's) correlation coefficient close to 0, we often think of relationships as pictured below. <img src="data:image/png;base64,#3_1_Exploratory_Data_Analysis_files/figure-html/dino-plot-1-1.png" width="45%" style="display: block; margin: auto;" /> --- ## Guilty by ~~association~~ correlation This data set has **the same correlation coefficient (Pearson's r of -0.06)** as the one on the previous slide: <img src="data:image/png;base64,#3_1_Exploratory_Data_Analysis_files/figure-html/dino-plot-2-1.png" width="50%" style="display: block; margin: auto;" /> --- ## Guilty by ~~association~~ correlation 🦖 So does this one... <img src="data:image/png;base64,#3_1_Exploratory_Data_Analysis_files/figure-html/dino-plot-3-1.png" width="60%" style="display: block; margin: auto;" /> --- ## Guilty by ~~association~~ correlation We could go on... The previous three examples all come from the [`datasauRus` package](https://jumpingrivers.github.io/datasauRus/) which essentially is an extension of [Anscombe's quartet](https://en.wikipedia.org/wiki/Anscombe%27s_quartet) and includes 13 data sets with the same (Pearson) correlation between x and y. <img src="data:image/png;base64,#3_1_Exploratory_Data_Analysis_files/figure-html/dino-plot-4-1.png" width="50%" style="display: block; margin: auto;" /> --- ## Trust no singular value! Importantly, the x- and y-variables in these `datasaurus_dozen` data set also all have the same means and standard deviations. ``` r datasaurus_dozen %>% group_by(dataset) %>% summarize( mean_x = mean(x), mean_y = mean(y), sd_x = sd(x), sd_y = sd(y), corr = cor(x, y, method = "pearson") ) ``` ``` ## # A tibble: 13 × 6 ## dataset mean_x mean_y sd_x sd_y corr ## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 away 54.3 47.8 16.8 26.9 -0.0641 ## 2 bullseye 54.3 47.8 16.8 26.9 -0.0686 ## 3 circle 54.3 47.8 16.8 26.9 -0.0683 ## 4 dino 54.3 47.8 16.8 26.9 -0.0645 ## 5 dots 54.3 47.8 16.8 26.9 -0.0603 ## 6 h_lines 54.3 47.8 16.8 26.9 -0.0617 ## 7 high_lines 54.3 47.8 16.8 26.9 -0.0685 ## 8 slant_down 54.3 47.8 16.8 26.9 -0.0690 ## 9 slant_up 54.3 47.8 16.8 26.9 -0.0686 ## 10 star 54.3 47.8 16.8 26.9 -0.0630 ## 11 v_lines 54.3 47.8 16.8 26.9 -0.0694 ## 12 wide_lines 54.3 47.8 16.8 26.9 -0.0666 ## 13 x_shape 54.3 47.8 16.8 26.9 -0.0656 ``` .right[↪️] --- class: middle .small[ ``` ## # A tibble: 13 × 6 ## dataset mean_x mean_y sd_x sd_y corr ## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 away 54.3 47.8 16.8 26.9 -0.0641 ## 2 bullseye 54.3 47.8 16.8 26.9 -0.0686 ## 3 circle 54.3 47.8 16.8 26.9 -0.0683 ## 4 dino 54.3 47.8 16.8 26.9 -0.0645 ## 5 dots 54.3 47.8 16.8 26.9 -0.0603 ## 6 h_lines 54.3 47.8 16.8 26.9 -0.0617 ## 7 high_lines 54.3 47.8 16.8 26.9 -0.0685 ## 8 slant_down 54.3 47.8 16.8 26.9 -0.0690 ## 9 slant_up 54.3 47.8 16.8 26.9 -0.0686 ## 10 star 54.3 47.8 16.8 26.9 -0.0630 ## 11 v_lines 54.3 47.8 16.8 26.9 -0.0694 ## 12 wide_lines 54.3 47.8 16.8 26.9 -0.0666 ## 13 x_shape 54.3 47.8 16.8 26.9 -0.0656 ``` ] --- ## Plot your data! The message from the `datasaurus_dozen` examples should be clear. Relying only on singular values that summarize the location or spread of a single variable or the association of two variables is not a good idea. To avoid reducing a ~~mountain to a molehill~~ dinosaur to a lack of correlation, it is important to plot your data to explore: - univariate distributions - grouped univariate distributions (if you want to compare groups) - bivariate distributions For that reason (and because it is fun and `R` has a lot to offer in that regard), we will dive 🥽 --- ## Other packages for EDA Besides the ones we have covered in this session and `summarytools`, there still are plenty of others that also provide some interesting functionalities. Some of these additional/alternative options are: - [`inspectdf`](https://alastairrushworth.github.io/inspectdf/) - [`skimr`](https://docs.ropensci.org/skimr/) - [`descriptr`](https://descriptr.rsquaredacademy.com/index.html) - [`DataExplorer`](https://boxuancui.github.io/DataExplorer/) - [`explore`](https://github.com/rolkra/explore) - [`dataReporter`](https://github.com/ekstroem/dataReporter) --- ## Automated EDA reports Some of the EDA packages provide functions for generating automated EDA reports, for example: - the `dfSummary()` function from the `summarytools` package - the `skim()` function from the [`skimr` package](https://docs.ropensci.org/skimr/) - the `make_report()` function from the [`dataReporter` package](https://github.com/ekstroem/dataReporter) - the `report()` function from the [`explore` package](https://github.com/rolkra/explore) The function `explore`() from the `explore` package also allows for interactive data exploration. --- # Extracurricular activities Check out the appendix slides on exploring missing data and outliers and/or watch the [*YouTube* tutorial *EDA with the tidyverse: Exploring Variation*](https://www.youtube.com/watch?v=26hbyVb00xs) by Mauro Lepore for even more EDA inspiration.