Data Literacy: Introduction to R

.title[
# Data Literacy: Introduction to R
]
.subtitle[
## Exploratory Data Analysis
]
.author[
### Veronika Batzdorfer
]
.date[
### 2025-11-07
]

---

---

## Exploring EDA

In this session, we will focus on the following:

- Compute summary statistics for numeric variables

- Produce counts, proportions and crosstabs for categorical variables

- Use group_by() + summarize() and across() for group summaries

- Run simple correlation analyses and interpret them

- Create concise EDA pipelines and diagnostic plots

---

## Disclaimer: A flood 🌊 of packages 📦

In the previous sessions, we have tried to focus on a small set of packages (mostly from `base R` and the `tidyverse`) and took a deep dive 🥽 exploring their functionalities. By comparison, in this session, we will browse through more different packages.

---

## Wrangling pipeline:  Load packages & prepare data

``` r
library(tidyverse)   # dplyr, ggplot2, tidyr, etc.
library(janitor)     # tabyl, adorn_*
library(corrr)       # neat correlation tibbles
library(broom)       # tidy test results

# prepare mtcars: convert some numeric columns to factors and relabel
data("mtcars")
df <- mtcars %>%
  rownames_to_column(var = "model") %>%
  mutate(
    cyl = factor(cyl, levels = c(4,6,8), labels = c("4cyl","6cyl","8cyl")),
    gear = factor(gear),
    am   = factor(am, levels = c(0,1), labels = c("automatic","manual"))
  ) %>%
  as_tibble()

glimpse(df)
```

```
## Rows: 32
## Columns: 12
## $ model <chr> "Mazda RX4", "Mazda RX4 Wag", "Datsun 710", "Hornet 4 Drive", "…
## $ mpg   <dbl> 21.0, 21.0, 22.8, 21.4, 18.7, 18.1, 14.3, 24.4, 22.8, 19.2, 17.…
## $ cyl   <fct> 6cyl, 6cyl, 4cyl, 6cyl, 8cyl, 6cyl, 8cyl, 4cyl, 4cyl, 6cyl, 6cy…
## $ disp  <dbl> 160.0, 160.0, 108.0, 258.0, 360.0, 225.0, 360.0, 146.7, 140.8, …
## $ hp    <dbl> 110, 110, 93, 110, 175, 105, 245, 62, 95, 123, 123, 180, 180, 1…
## $ drat  <dbl> 3.90, 3.90, 3.85, 3.08, 3.15, 2.76, 3.21, 3.69, 3.92, 3.92, 3.9…
## $ wt    <dbl> 2.620, 2.875, 2.320, 3.215, 3.440, 3.460, 3.570, 3.190, 3.150, …
## $ qsec  <dbl> 16.46, 17.02, 18.61, 19.44, 17.02, 20.22, 15.84, 20.00, 22.90, …
## $ vs    <dbl> 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, …
## $ am    <fct> manual, manual, manual, automatic, automatic, automatic, automa…
## $ gear  <fct> 4, 4, 4, 3, 3, 3, 3, 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 4, 4, 4, 3, …
## $ carb  <dbl> 4, 4, 1, 1, 2, 1, 4, 2, 2, 4, 4, 3, 3, 3, 4, 4, 4, 1, 2, 1, 1, …
```

``` r
df %>%
  summarize(
    mean_mpg = mean(mpg),
    sd_mpg   = sd(mpg),
    median_hp = median(hp),
    iqr_wt    = IQR(wt),
    min_qsec  = min(qsec),
    max_qsec  = max(qsec)
  ) %>%
  round(2)
```

```
## # A tibble: 1 × 6
##   mean_mpg sd_mpg median_hp iqr_wt min_qsec max_qsec
##      <dbl>  <dbl>     <dbl>  <dbl>    <dbl>    <dbl>
## 1     20.1   6.03       123   1.03     14.5     22.9
```
]

---
## `dplyr::group_by()`

The `dplyr` function `group_by()` creates data frames (tibbles) that are grouped by one or more variables.

``` r
df %>%
  group_by(cyl) %>%
  summarize(
    n = n(),
    mean_mpg = mean(mpg),
    sd_mpg   = sd(mpg),
    mean_hp  = mean(hp)
  ) %>%
  ungroup()
```

---

```
## # A tibble: 3 × 5
##   cyl       n mean_mpg sd_mpg mean_hp
##   <fct> <int>    <dbl>  <dbl>   <dbl>
## 1 4cyl     11     26.7   4.51    82.6
## 2 6cyl      7     19.7   1.45   122. 
## 3 8cyl     14     15.1   2.56   209.
```

---

## Grouping and ungrouping

Keep in mind that `group_by()` changes the class of a dataframe. To avoid (accidentally) continuing to perform operations groupwise, it is necessary to use the .highlight[`ungroup()`] function (typically at the end of a pipe).

---

## `dplyr::across()`

Multiple summaries for many columns.

``` r
df %>%
  select(cyl, mpg, hp, wt, qsec) %>%
  group_by(cyl) %>%
  summarize(
    across(mpg:qsec,
           list(mean = ~mean(.x), sd = ~sd(.x)),
           .names = "{col}_{fn}")
  ) %>%
  ungroup()
```

---

```
## # A tibble: 3 × 9
##   cyl   mpg_mean mpg_sd hp_mean hp_sd wt_mean wt_sd qsec_mean qsec_sd
##   <fct>    <dbl>  <dbl>   <dbl> <dbl>   <dbl> <dbl>     <dbl>   <dbl>
## 1 4cyl      26.7   4.51    82.6  20.9    2.29 0.570      19.1    1.68
## 2 6cyl      19.7   1.45   122.   24.3    3.12 0.356      18.0    1.71
## 3 8cyl      15.1   2.56   209.   51.0    4.00 0.759      16.8    1.20
```

---

## Frequencies: `table()`

A simple way of looking at frequencies (e.g., for categorical variables) that we have already seen before is the `base R` function `table()`.

``` r
table(df$cyl)
```

```
## 
## 4cyl 6cyl 8cyl 
##   11    7   14
```

If you also want to include `NA` in the frequency counts, you need to specify the argument `useNA = "always"`.

``` r
table(df$cyl, useNA = "always")
```

```
## 
## 4cyl 6cyl 8cyl <NA> 
##   11    7   14    0
```

---

## Proportions with `tidyverse`

``` r
# tidyverse approach
df %>%
  count(cyl) %>%
  mutate(prop = n / sum(n))
```

```
## # A tibble: 3 × 3
##   cyl       n  prop
##   <fct> <int> <dbl>
## 1 4cyl     11 0.344
## 2 6cyl      7 0.219
## 3 8cyl     14 0.438
```

---

## Frequencies and proportions with `janitor::tabyl()`

The [`janitor` package](https://github.com/sfirke/janitor) provides the [`tabyl()` function](https://sfirke.github.io/janitor/articles/tabyls.html) for creating frequency and proportion tables:

``` r
library(janitor)

df %>%
  tabyl(cyl) %>%
  adorn_pct_formatting(digits = 1)
```

```
##   cyl  n percent
##  4cyl 11   34.4%
##  6cyl  7   21.9%
##  8cyl 14   43.8%
```

---
## Tables in `R`

There are typically two types of outputs you can produce with `R` for further use in reports or publications: tables and plots.

However, as summary statistics, frequencies, and proportions are often presented in tables, we will briefly discuss how to create tables as output in `R` in the following.

As with almost everything in `R`, there are many options (read: packages) for creating tables. We will show examples from three popular options:
- [`stargazer`](https://cran.r-project.org/web/packages/stargazer/index.html)
- [`gtsummary`](http://www.danieldsjoberg.com/gtsummary/) + [`flextable`](https://davidgohel.github.io/flextable/index.html)

---

## Summary tables with `stargazer`

If we, e.g., want to create a summary statistics table as text output (e.g., for printing in the `R` console), we can use `stargazer` for that.

**NB**: As the main `stargazer()` function does not work with tibbles, we need to convert our data to a regular data frame.

``` r
library(stargazer)

df  %>% 
  select(mpg,
         cyl,
         gear) %>% 
  as.data.frame() %>% 
  stargazer(type = "text",
            digits = 2,
            title="Descriptive statistics")
```

---

```
## 
## Descriptive statistics
## =======================================
## Statistic N  Mean  St. Dev.  Min   Max 
## ---------------------------------------
## mpg       32 20.09   6.03   10.40 33.90
## ---------------------------------------
```

---

## Relationships between variables

In the following, we will briefly discuss some options for two methods of exploring relationships between variables:

- crosstabulation (for categorical variables)

- correlations (for numeric and/or binary variables)

---

## Crosstabs

You can use the `base R` functions `table()` and `prop.table()`

``` r
# base R contingency table
tab <- table(df$cyl, df$am)
tab
```

```
##       
##        automatic manual
##   4cyl         3      8
##   6cyl         4      3
##   8cyl        12      2
```

``` r
# row percentages
round(prop.table(tab, margin = 1) * 100, 1)
```

```
##       
##        automatic manual
##   4cyl      27.3   72.7
##   6cyl      57.1   42.9
##   8cyl      85.7   14.3
```

---

## Crosstabs with the `janitor` package

The `tabyl()` function from the `janitor` package provides quite a few options for crosstabs.

``` r
library(janitor)

df %>%
  tabyl(cyl, am) %>%
  adorn_totals("both") %>%
  adorn_percentages("row") %>%
  adorn_pct_formatting(digits = 1) %>%
  adorn_ns(position = "front")
```

```
##    cyl  automatic     manual       Total
##   4cyl  3 (27.3%)  8 (72.7%) 11 (100.0%)
##   6cyl  4 (57.1%)  3 (42.9%)  7 (100.0%)
##   8cyl 12 (85.7%)  2 (14.3%) 14 (100.0%)
##  Total 19 (59.4%) 13 (40.6%) 32 (100.0%)
```
]

---

## Other options for crosstabulation in `R`

As with most things in `R`, there are many options for creating crosstabs. Some alternatives to the ones we presented before include the `CrossTable()` and `crosstab()` functions from the [`descr` package](https://cran.r-project.org/web/packages/descr/index.html) or the `ctable()` function from the [`summarytools` package](https://github.com/dcomtois/summarytools).

Another quite very versatile option for creating crosstabs as well as output based on those (e.g., in combination with the `flextable` package we have briefly used before) is the [`crosstable` package](https://danchaltiel.github.io/crosstable/)

---

## Chi-Square for independence (categorical)

We can use the `summary()` function from `base R` in combination with `table()` to perform a chi-square test.

``` r
#--Are cyl and am independent? ------

# contingency table for cyl vs am
tbl <- df %>% count(cyl, am) %>% 
  pivot_wider(names_from = am, values_from = n, values_fill = 0)

chisq_res <- janitor::chisq.test(as.matrix(tbl %>% select(-cyl)))

chisq_res
```

```
## 
## 	Pearson's Chi-squared test
## 
## data:  as.matrix(tbl %>% select(-cyl))
## X-squared = 8.7407, df = 2, p-value = 0.01265
```

---

``` r
tidy(chisq_res) # broom::tidy for neat display
```

```
## # A tibble: 1 × 4
##   statistic p.value parameter method                    
##       <dbl>   <dbl>     <int> <chr>                     
## 1      8.74  0.0126         2 Pearson's Chi-squared test
```

---

## Correlations with `base R`

The `base R` function `cor()` computes the correlation coefficient(s) between two or more variables. This function can be used to calculate *Pearson's r*, *Kendall's tau*, and *Spearman's rho*. We also need to specify how we want to deal with missing values (e.g., use pairwise complete observations).

``` r
nums <- df %>% select(mpg, hp, wt, qsec, disp)

# base R correlation matrix (pairwise complete)
cor(nums, use = "pairwise.complete.obs", method = "pearson")
```

```
##             mpg         hp         wt       qsec       disp
## mpg   1.0000000 -0.7761684 -0.8676594  0.4186840 -0.8475514
## hp   -0.7761684  1.0000000  0.6587479 -0.7082234  0.7909486
## wt   -0.8676594  0.6587479  1.0000000 -0.1747159  0.8879799
## qsec  0.4186840 -0.7082234 -0.1747159  1.0000000 -0.4336979
## disp -0.8475514  0.7909486  0.8879799 -0.4336979  1.0000000
```
]

---

``` r
# corrr style (tidy)
nums %>%
  correlate() %>%
  stretch() %>%            # long form
  arrange(desc(abs(r)))
```

```
## # A tibble: 25 × 3
##    x     y          r
##    <chr> <chr>  <dbl>
##  1 wt    disp   0.888
##  2 disp  wt     0.888
##  3 mpg   wt    -0.868
##  4 wt    mpg   -0.868
##  5 mpg   disp  -0.848
##  6 disp  mpg   -0.848
##  7 hp    disp   0.791
##  8 disp  hp     0.791
##  9 mpg   hp    -0.776
## 10 hp    mpg   -0.776
## # ℹ 15 more rows
```

---

## The `corrr` package

The [`corrr` package](https://corrr.tidymodels.org/) is part of the [`tidymodels` suite of packages](https://www.tidymodels.org/) and provides various functions for displaying correlations. The main function is `correlate()` which produces a `tibble` as output.

``` r
library(corrr)

correlate(nums)
```

```
## # A tibble: 5 × 6
##   term     mpg     hp     wt   qsec   disp
##   <chr>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
## 1 mpg   NA     -0.776 -0.868  0.419 -0.848
## 2 hp    -0.776 NA      0.659 -0.708  0.791
## 3 wt    -0.868  0.659 NA     -0.175  0.888
## 4 qsec   0.419 -0.708 -0.175 NA     -0.434
## 5 disp  -0.848  0.791  0.888 -0.434 NA
```
]

---

## The `corrr` package

The `corrr` package provides several functions for tweaking/optimizing the output of the `correlate()` function. Here's one example:

``` r
nums %>% 
  correlate() %>% 
  shave() %>% 
  fashion()
```

```
##   term  mpg   hp   wt qsec disp
## 1  mpg                         
## 2   hp -.78                    
## 3   wt -.87  .66               
## 4 qsec  .42 -.71 -.17          
## 5 disp -.85  .79  .89 -.43
```
]

---
## Correlation diagnostics

``` r
# scatter with smoothing
ggplot(df, aes(x = wt, y = mpg, color = cyl)) +
  geom_point(size = 3) +
  geom_smooth(method = "lm", se = FALSE) +
  theme_minimal()
```

``` r
# pairs plot quick overview
pairs(select(df, mpg, hp, wt, qsec), main = "Pairs for selected vars")
```

---

## Guilty by ~~association~~ correlation

While correlation coefficients are useful for exploring relationships between variables, they can also be misleading. For example, if we do correlation analysis and encounter a (Pearson's) correlation coefficient close to 0, we often think of relationships as pictured below.

---

## Guilty by ~~association~~ correlation

This data set has **the same correlation coefficient (Pearson's r of -0.06)** as the one on the previous slide:

---

## Guilty by ~~association~~ correlation 🦖

So does this one...

---

## Guilty by ~~association~~ correlation

We could go on... The previous three examples all come from the [`datasauRus` package](https://jumpingrivers.github.io/datasauRus/) which essentially is an extension of [Anscombe's quartet](https://en.wikipedia.org/wiki/Anscombe%27s_quartet) and includes 13 data sets with the same (Pearson) correlation between x and y.

---

## Trust no singular value!

Importantly, the x- and y-variables in these `datasaurus_dozen` data set also all have the same means and standard deviations.

``` r
datasaurus_dozen %>% 
  group_by(dataset) %>%
  summarize(
    mean_x = mean(x), 
    mean_y = mean(y), 
    sd_x = sd(x), 
    sd_y = sd(y), 
    corr = cor(x, y, method = "pearson")
  )
```

```
## # A tibble: 13 × 6
##    dataset    mean_x mean_y  sd_x  sd_y    corr
##    <chr>       <dbl>  <dbl> <dbl> <dbl>   <dbl>
##  1 away         54.3   47.8  16.8  26.9 -0.0641
##  2 bullseye     54.3   47.8  16.8  26.9 -0.0686
##  3 circle       54.3   47.8  16.8  26.9 -0.0683
##  4 dino         54.3   47.8  16.8  26.9 -0.0645
##  5 dots         54.3   47.8  16.8  26.9 -0.0603
##  6 h_lines      54.3   47.8  16.8  26.9 -0.0617
##  7 high_lines   54.3   47.8  16.8  26.9 -0.0685
##  8 slant_down   54.3   47.8  16.8  26.9 -0.0690
##  9 slant_up     54.3   47.8  16.8  26.9 -0.0686
## 10 star         54.3   47.8  16.8  26.9 -0.0630
## 11 v_lines      54.3   47.8  16.8  26.9 -0.0694
## 12 wide_lines   54.3   47.8  16.8  26.9 -0.0666
## 13 x_shape      54.3   47.8  16.8  26.9 -0.0656
```

---

---

## Plot your data!

The message from the `datasaurus_dozen` examples should be clear. Relying only on singular values that summarize the location or spread of a single variable or the association of two variables is not a good idea. To avoid reducing a ~~mountain to a molehill~~ dinosaur to a lack of correlation, it is important to plot your data to explore:

- univariate distributions

- grouped univariate distributions (if you want to compare groups)

- bivariate distributions

For that reason (and because it is fun and `R` has a lot to offer in that regard), we will dive 🥽

---

## Other packages for EDA

Besides the ones we have covered in this session and `summarytools`, there still are plenty of others that also provide some interesting functionalities. Some of these additional/alternative options are:
- [`inspectdf`](https://alastairrushworth.github.io/inspectdf/)
- [`skimr`](https://docs.ropensci.org/skimr/)
- [`descriptr`](https://descriptr.rsquaredacademy.com/index.html)
- [`DataExplorer`](https://boxuancui.github.io/DataExplorer/)
- [`explore`](https://github.com/rolkra/explore)
- [`dataReporter`](https://github.com/ekstroem/dataReporter)

---

## Automated EDA reports

Some of the EDA packages provide functions for generating automated EDA reports, for example:
- the `dfSummary()` function from the `summarytools` package
- the `skim()` function from the [`skimr` package](https://docs.ropensci.org/skimr/)
- the `make_report()` function from the [`dataReporter` package](https://github.com/ekstroem/dataReporter)
- the `report()` function from the [`explore` package](https://github.com/rolkra/explore)

The function `explore`() from the `explore` package also allows for interactive data exploration.

---

# Extracurricular activities

Check out the appendix slides on exploring missing data and outliers and/or watch the [*YouTube* tutorial *EDA with the tidyverse: Exploring Variation*](https://www.youtube.com/watch?v=26hbyVb00xs) by Mauro Lepore for even more EDA inspiration.