Data Literacy: Introduction to R

.title[
# Data Literacy: Introduction to R
]
.subtitle[
## Exploratory Data Analysis
]
.author[
### Veronika Batzdorfer
]
.date[
### 2025-05-24
]

---

---

## Exploring EDA

In this session, we will focus on the following:

- summary statistics (for numeric variables)
- frequencies & proportions (for categorical variables)
- cross-tabulations & correlations

---

## Disclaimer: A flood 🌊 of packages 📦

In the previous sessions, we have tried to focus on a small set of packages (mostly from `base R` and the `tidyverse`) and took a deep dive 🥽 exploring their functionalities. By comparison, in this session, we will browse through more different packages.

---

## Wrangling pipeline: Import data, select, & rename variables

``` r
library(sjlabelled)
library(tidyverse)
library(haven)
library(dplyr)

stackoverflow_survey_questions <- read_csv("./data/stackoverflow_survey_questions.csv")

stackoverflow_survey_single_response <- read_csv("./data/stackoverflow_survey_single_response.csv")

qname_levels_single_response_crosswalk <- read_csv("./data/qname_levels_single_response_crosswalk.csv")
```

``` r
tuesdata_eda <- stackoverflow_survey_single_response %>% 
  select(age,
         main_branch,
         ai_select,
         ai_sent,
         ai_acc,
         ai_complex,
         ai_threat,
         partici_freq = so_part_freq
   )
```
]

---

## Wrangling pipeline: Change variable types & recode values

``` r
tuesdata_eda <- tuesdata_eda %>% 
    mutate(skills_code = recode(main_branch,
                            `1` = 5,
                            `5` = 4,
                            `3` = 3,
                            `4` = 2,
                            `2` = 1),
           ai_threat = recode_factor(ai_threat,
                               `1` = "Unsure",
                               `2` = "No",
                               `3` = "yes"),
           agec = recode_factor(age,
                                `1`= "18-24 years",
                                `2`= "25-34 years",
                                `3` = "35-44 years",
                                `4` = "45-54 years",
                                `5` = "55-64 years",
                                `6` = "65 years or older",
                                .ordered = TRUE),
           ai_acc = recode_factor(ai_acc,
                                      `1`= "Highly distrust",
                                      `4`= "Somewhat distrust",
                                      `5` = "Somewhat trust",
                                      `2` = "Highly trust",
                                      `3` = "Neither trust nor distrust"
                                      )
           
           )
```
]

---

## Summary statistics

To make sense of quantitative data we can reduce their information to unique values.

**That's a simple definition of summary statistics**

As such, we can use summarizing functions of
- location (e.g., the mean),
- spread (e.g., standard deviation),
- the shape of the distribution (e.g., skewness), and
- relations between variables (e.g., correlation coefficients)

---

## Summary statistics: `summary()`

A quick and easy way to check some summary statistics for your data set is the `base R` function `summary()` which can be applied to individual variables...

``` r
summary(tuesdata_eda$ai_acc)
```

```
##            Highly distrust          Somewhat distrust 
##                       2932                       8395 
##             Somewhat trust               Highly trust 
##                      15039                       1016 
## Neither trust nor distrust                       NA's 
##                       9920                      28135
```

as well as whole data frames:

``` r
summary(tuesdata_eda)
```

---

```
##       age         main_branch      ai_select        ai_sent     
##  Min.   :1.000   Min.   :1.000   Min.   :1.000   Min.   :1.000  
##  1st Qu.:2.000   1st Qu.:1.000   1st Qu.:2.000   1st Qu.:1.000  
##  Median :2.000   Median :1.000   Median :3.000   Median :2.000  
##  Mean   :2.629   Mean   :1.503   Mean   :2.375   Mean   :2.386  
##  3rd Qu.:3.000   3rd Qu.:1.000   3rd Qu.:3.000   3rd Qu.:4.000  
##  Max.   :8.000   Max.   :5.000   Max.   :3.000   Max.   :6.000  
##                                  NA's   :4530    NA's   :19564  
##                         ai_acc        ai_complex   
##  Highly distrust           : 2932   Min.   :1.000  
##  Somewhat distrust         : 8395   1st Qu.:1.000  
##  Somewhat trust            :15039   Median :2.000  
##  Highly trust              : 1016   Mean   :2.232  
##  Neither trust nor distrust: 9920   3rd Qu.:3.000  
##  NA's                      :28135   Max.   :5.000  
##                                     NA's   :28416  
##   ai_threat      partici_freq    skills_code   
##  Unsure: 8878   Min.   :1.000   Min.   :1.000  
##  No    :30423   1st Qu.:4.000   1st Qu.:5.000  
##  yes   : 5388   Median :5.000   Median :5.000  
##  NA's  :20748   Mean   :4.016   Mean   :4.388  
##                 3rd Qu.:5.000   3rd Qu.:5.000  
##                 Max.   :6.000   Max.   :5.000  
##                 NA's   :20200                  
##                 agec      
##  18-24 years      :14098  
##  25-34 years      :23911  
##  35-44 years      :14942  
##  45-54 years      : 6249  
##  55-64 years      : 2575  
##  65 years or older:  772  
##  NA's             : 2890
```
]

---

## Summary statistics with the `datawizard` package 🧙

The [`datawizard` package](https://easystats.github.io/datawizard/) provides a function for summary statistics.

``` r
library(datawizard)

tuesdata_eda %>%
  select(where(is.numeric)) %>%
  describe_distribution(quartiles = TRUE)
```

---

```
## Variable     | Mean |   SD | IQR |        Range |  Quartiles
## ------------------------------------------------------------
## age          | 2.63 | 1.58 |   1 | [1.00, 8.00] | 2.00, 3.00
## main_branch  | 1.50 | 1.02 |   0 | [1.00, 5.00] | 1.00, 1.00
## ai_select    | 2.37 | 0.85 |   1 | [1.00, 3.00] | 2.00, 3.00
## ai_sent      | 2.39 | 1.68 |   3 | [1.00, 6.00] | 1.00, 4.00
## ai_complex   | 2.23 | 1.11 |   2 | [1.00, 5.00] | 1.00, 3.00
## partici_freq | 4.02 | 1.43 |   1 | [1.00, 6.00] | 4.00, 5.00
## skills_code  | 4.39 | 1.21 |   0 | [1.00, 5.00] | 5.00, 5.00
## 
## Variable     | Skewness | Kurtosis |     n | n_Missing
## ------------------------------------------------------
## age          |     1.69 |     3.22 | 65437 |         0
## main_branch  |     1.93 |     2.62 | 65437 |         0
## ai_select    |    -0.80 |    -1.14 | 60907 |      4530
## ai_sent      |     0.77 |    -1.11 | 45873 |     19564
## ai_complex   |     0.62 |    -0.47 | 37021 |     28416
## partici_freq |    -1.25 |     0.11 | 45237 |     20200
## skills_code  |    -1.76 |     1.71 | 65437 |         0
```
]

---

## Summary statistics with `dplyr`

`dplyr` provides a helpful function for creating summary statistics: `summarize()`

`summarize()` is a [vectorized](https://win-vector.com/2019/01/03/what-does-it-mean-to-write-vectorized-code-in-r/) function that can be used to create summary statistics for variables using functions like...

- `mean()`
- `sd()`
- `min()`
- `max()`

- etc.

---

## Summary statistics with `dplyr`

While creating summary statistics using `summarize()` from `dplyr()` requires writing more code, it is the most flexible option. Another nice benefit of `summarize()` is that it produces a .highlight[`tibble`] which can be used for further analyses or for creating plots or tables.

---

## `dplyr::summarize()`

``` r
tuesdata_eda  %>% 
  summarize(
    mean_partici_freq = mean(partici_freq, na.rm = TRUE),
    sd_partici_freq = sd(partici_freq, na.rm = TRUE),
    var_partici_freq = var(partici_freq, na.rm = TRUE),
    min_partici_freq = min(partici_freq, na.rm = TRUE),
    max_partici_freq = max(partici_freq, na.rm = TRUE)
  ) %>% 
  round(2) # round to 2 decimal places
```

```
## # A tibble: 1 × 5
##   mean_partici_freq sd_partici_freq var_partici_freq
##               <dbl>           <dbl>            <dbl>
## 1              4.02            1.43             2.04
## # ℹ 2 more variables: min_partici_freq <dbl>,
## #   max_partici_freq <dbl>
```
]

---

## `dplyr::group_by()`

The `dplyr` function `group_by()` creates data frames (tibbles) that are grouped by one or more variables.

``` r
tuesdata_eda  %>% 
  filter(!is.na(skills_code)) %>% # only use cases where skills_code is not missing
  group_by(skills_code) %>%
   summarize(
    mean_partici_freq = mean(partici_freq, na.rm = TRUE),
    sd_partici_freq = sd(partici_freq, na.rm = TRUE),
    var_partici_freq = var(partici_freq, na.rm = TRUE),
    min_partici_freq = min(partici_freq, na.rm = TRUE),
    max_partici_freq = max(partici_freq, na.rm = TRUE)
  ) %>% 
  ungroup()
```

---

```
## # A tibble: 5 × 6
##   skills_code mean_partici_freq sd_partici_freq var_partici_freq
##         <dbl>             <dbl>           <dbl>            <dbl>
## 1           1              4.04            1.14             1.29
## 2           2              4.01            1.36             1.84
## 3           3              3.98            1.43             2.06
## 4           4              4.17            1.37             1.87
## 5           5              4.01            1.45             2.10
## # ℹ 2 more variables: min_partici_freq <dbl>,
## #   max_partici_freq <dbl>
```

---

## Grouping and ungrouping

Keep in mind that `group_by()` changes the class of a dataframe. To avoid (accidentally) continuing to perform operations groupwise, it is necessary to use the .highlight[`ungroup()`] function (typically at the end of a pipe).

``` r
class(tuesdata_eda)
```

```
## [1] "tbl_df"     "tbl"        "data.frame"
```

``` r
tuesdata_eda %>% 
  group_by(skills_code) %>% 
  class()
```

```
## [1] "grouped_df" "tbl_df"     "tbl"        "data.frame"
```

---

## `dplyr::across()`

To produce grouped summary statistics for multiple variables we can use the `dplyr` function `across()`  *Note*: We only use cases without missing data for any of the variables here (= listwise deletion).

``` r
tuesdata_eda %>%
#-- Select only the 'skills_code' column and all columns that start with "ai"
  select(skills_code, starts_with("ai"), agec) %>%
#-- Remove the columns 'ai_threat' and 'ai_acc' from the selection
  select(-ai_threat, -ai_acc) %>%
  
#-- Drop rows that contain any NA (missing) values
  drop_na() %>%
  
#-- Group the data by the 'age' column
  group_by(agec) %>%
#-- For each group, summarize the AI-related columns
  summarize(
    across(
      starts_with("ai"),        # Apply functions to all columns starting with "ai"
      list(
        mean = mean,            # Calculate mean
        sd = sd                 # Calculate standard deviation
      ),
      .names = "{col}_{fn}"     # Name new columns like "ai_future_mean", "ai_future_sd", etc.
    )
  )
```

---

```
## # A tibble: 6 × 7
##   agec         ai_select_mean ai_select_sd ai_sent_mean ai_sent_sd
##   <ord>                 <dbl>        <dbl>        <dbl>      <dbl>
## 1 18-24 years               3            0         2.33       1.70
## 2 25-34 years               3            0         2.42       1.74
## 3 35-44 years               3            0         2.51       1.76
## 4 45-54 years               3            0         2.50       1.77
## 5 55-64 years               3            0         2.50       1.77
## 6 65 years or…              3            0         2.69       1.81
## # ℹ 2 more variables: ai_complex_mean <dbl>, ai_complex_sd <dbl>
```

---

## Frequencies: `table()`

A simple way of looking at frequencies (e.g., for categorical variables) that we have already seen before is the `base R` function `table()`.

``` r
table(tuesdata_eda$agec)
```

```
## 
##       18-24 years       25-34 years       35-44 years 
##             14098             23911             14942 
##       45-54 years       55-64 years 65 years or older 
##              6249              2575               772
```

If you also want to include `NA` in the frequency counts, you need to specify the argument `useNA = "always"`.

``` r
table(tuesdata_eda$agec, useNA = "always")
```

```
## 
##       18-24 years       25-34 years       35-44 years 
##             14098             23911             14942 
##       45-54 years       55-64 years 65 years or older 
##              6249              2575               772 
##              <NA> 
##              2890
```

---

## Proportions with `prop.table()`

If you want proportions instead of raw counts, you can use the `base R` function `prop.table()`.

**NB**: You need to apply this function to an output produced by `table()`.

``` r
prop.table(table(tuesdata_eda$ai_acc))
```

```
## 
##            Highly distrust          Somewhat distrust 
##                 0.07860168                 0.22505496 
##             Somewhat trust               Highly trust 
##                 0.40316873                 0.02723715 
## Neither trust nor distrust 
##                 0.26593748
```

``` r
prop.table(table(tuesdata_eda$ai_acc , useNA = "always"))
```

```
## 
##            Highly distrust          Somewhat distrust 
##                 0.04480646                 0.12829133 
##             Somewhat trust               Highly trust 
##                 0.22982411                 0.01552638 
## Neither trust nor distrust                       <NA> 
##                 0.15159619                 0.42995553
```
]

---

## Proportions with `prop.table()`

If you want fewer decimals places in the output, you can wrap the the `prop.table()` function in a `round()` call.

``` r
round(prop.table(table(tuesdata_eda$ai_acc, useNA = "always")), 3) # rounded to 3 decimal places
```

```
## 
##            Highly distrust          Somewhat distrust 
##                      0.045                      0.128 
##             Somewhat trust               Highly trust 
##                      0.230                      0.016 
## Neither trust nor distrust                       <NA> 
##                      0.152                      0.430
```

Or if you want percentages...

``` r
round((prop.table(table(tuesdata_eda$ai_acc, useNA = "always")) * 100), 2)
```

```
## 
##            Highly distrust          Somewhat distrust 
##                       4.48                      12.83 
##             Somewhat trust               Highly trust 
##                      22.98                       1.55 
## Neither trust nor distrust                       <NA> 
##                      15.16                      43.00
```

---

## Frequencies and proportions with `janitor::tabyl()`

The [`janitor` package](https://github.com/sfirke/janitor) that we briefly mentioned in the first session on data wrangling also provides the [`tabyl()` function](https://sfirke.github.io/janitor/articles/tabyls.html) for creating frequency and proportion tables:

``` r
library(janitor)

threat_stats <- tuesdata_eda %>% 
  tabyl(ai_threat) %>% 
  adorn_pct_formatting(digits = 2, 
                       affix_sign = TRUE)

threat_stats
```

```
##  ai_threat     n percent valid_percent
##     Unsure  8878  13.57%        19.87%
##         No 30423  46.49%        68.08%
##        yes  5388   8.23%        12.06%
##       <NA> 20748  31.71%             -
```

---

## Frequencies and proportions with `janitor::tabyl()`

A nice thing about `tabyl()` is that is produces a (special type of) data frame which we can use for plotting or creating tables.

``` r
class(threat_stats)
```

```
## [1] "tabyl"      "data.frame"
```

---

## Frequencies and proportions with `dplyr`

We can also use `group_by()` and `summarize()` to get frequencies and proportions for variables in our data set.

``` r
tuesdata_eda  %>% 
  filter(!is.na(ai_threat)) %>% 
  group_by(ai_threat) %>% 
  summarize(n = n()) %>% 
  mutate(proportion = n/sum(n)) %>% 
  ungroup()
```

```
## # A tibble: 3 × 3
##   ai_threat     n proportion
##   <fct>     <int>      <dbl>
## 1 Unsure     8878      0.199
## 2 No        30423      0.681
## 3 yes        5388      0.121
```

---

## Frequencies and proportions with `dplyr`

Instead of using `group_by` and `summarize()` to get frequency counts, we can also use .highlight[`count()`] from `dplyr` as a shorthand.

``` r
tuesdata_eda %>% 
  filter(!is.na(ai_threat)) %>% 
  count(ai_threat) %>% 
  mutate(proportion = n/sum(n)) %>% 
  ungroup()
```

```
## # A tibble: 3 × 3
##   ai_threat     n proportion
##   <fct>     <int>      <dbl>
## 1 Unsure     8878      0.199
## 2 No        30423      0.681
## 3 yes        5388      0.121
```

---

## Tables in `R`

There are typically two types of outputs you can produce with `R` for further use in reports or publications: tables and plots.

However, as summary statistics, frequencies, and proportions are often presented in tables, we will briefly discuss how to create tables as output in `R` in the following.

As with almost everything in `R`, there are many options (read: packages) for creating tables. We will show examples from three popular options:
- [`stargazer`](https://cran.r-project.org/web/packages/stargazer/index.html)
- [`gtsummary`](http://www.danieldsjoberg.com/gtsummary/) + [`flextable`](https://davidgohel.github.io/flextable/index.html)

---

## Summary tables with `stargazer`

While there is an ever-growing list of `R` packages for creating tables with many new (promising) contenders (such as [`flextable`](https://davidgohel.github.io/flextable/index.html) or [`gt`](https://gt.rstudio.com/index.html)), `stargazer` is an established and widely-used tool for creating ACSII (plain text), `LaTeX`, and `HTML` table output.

---

## Summary tables with `stargazer`

If we, e.g., want to create a summary statistics table as text output (e.g., for printing in the `R` console), we can use `stargazer` for that.

**NB**: As the main `stargazer()` function does not work with tibbles, we need to convert our data to a regular data frame.

``` r
library(stargazer)

tuesdata_eda  %>% 
  select(ai_threat,
         partici_freq,
         ai_acc) %>% 
  as.data.frame() %>% 
  stargazer(type = "text",
            digits = 2,
            title="Descriptive statistics")
```

---

```
## 
## Descriptive statistics
## =========================================
## Statistic      N    Mean St. Dev. Min Max
## -----------------------------------------
## partici_freq 45,237 4.02   1.43    1   6 
## -----------------------------------------
```

---

## Summary tables with `stargazer`

`stargazer` also supports `HTML` and `LaTeX` as output formats. We can export the output for further use (e.g., with our `LaTeX` editor of choice).

``` r
# We create a directory for output files first
dir.create("./output")

tuesdata_eda   %>% 
 select(ai_threat,
         partici_freq,
         ai_acc) %>% 
  as.data.frame() %>% 
  stargazer(type = "latex",
            digits = 2,
            out = "./output/stargazer_summary.tex",
            title="Descriptive statistics")
```

---

## Relationships between variables

In the following, we will briefly discuss some options for two methods of exploring relationships between variables:

- crosstabulation (for categorical variables)

- correlations (for numeric and/or binary variables)

---

## Crosstabs

You can use the `base R` functions `table()` and `prop.table()`

``` r
table(tuesdata_eda$ai_complex, tuesdata_eda$agec) # rows, columns
```

```
##    
##     18-24 years 25-34 years 35-44 years 45-54 years 55-64 years
##   1        2688        4537        2632         945         258
##   2        3068        4669        2455         993         309
##   3        1680        2903        1762         728         288
##   4        1077        1756         986         311          96
##   5         278         423         241         137          49
##    
##     65 years or older
##   1                50
##   2                86
##   3                66
##   4                11
##   5                22
```

``` r
round(prop.table(table(tuesdata_eda$ai_complex, tuesdata_eda$agec))*100, 2)
```

```
##    
##     18-24 years 25-34 years 35-44 years 45-54 years 55-64 years
##   1        7.57       12.78        7.41        2.66        0.73
##   2        8.64       13.15        6.91        2.80        0.87
##   3        4.73        8.18        4.96        2.05        0.81
##   4        3.03        4.95        2.78        0.88        0.27
##   5        0.78        1.19        0.68        0.39        0.14
##    
##     65 years or older
##   1              0.14
##   2              0.24
##   3              0.19
##   4              0.03
##   5              0.06
```

---

## Crosstabs with `dplyr`

We can use also functions from `dplyr` to create crosstabs including frequencies.

``` r
tuesdata_eda %>%
  
#-- Filter out rows where 'ai_complex' or 'agec' is missing (i.e., keep only complete cases)
  filter(!is.na(ai_complex),
         !is.na(agec)) %>%
  
#-- Count how many times each combi of 'agec' (age category) and 'ai_complex' appears
  count(agec, ai_complex) %>%
  
#-- Reshape the data: make each unique value of 'ai_complex' a separate column
  pivot_wider(
    names_from = ai_complex,   # Column values in 'ai_complex' become new column names
    values_from = n            # Use counts (`n`) as the values in the new columns
  )
```

```
## # A tibble: 6 × 6
##   agec                `1`   `2`   `3`   `4`   `5`
##   <ord>             <int> <int> <int> <int> <int>
## 1 18-24 years        2688  3068  1680  1077   278
## 2 25-34 years        4537  4669  2903  1756   423
## 3 35-44 years        2632  2455  1762   986   241
## 4 45-54 years         945   993   728   311   137
## 5 55-64 years         258   309   288    96    49
## 6 65 years or older    50    86    66    11    22
```

---

## Crosstabs with `dplyr`
Create crosstabs including percentages

``` r
tuesdata_eda %>% 
  filter(!is.na(ai_complex),
         !is.na(agec)) %>% 
  count(agec, ai_complex)  %>% 
  mutate(proportion = n/sum(n)*100) %>%
  select(-n) %>% 
  pivot_wider(names_from = ai_complex,
              values_from = proportion)
```

```
## # A tibble: 6 × 6
##   agec                 `1`    `2`   `3`    `4`    `5`
##   <ord>              <dbl>  <dbl> <dbl>  <dbl>  <dbl>
## 1 18-24 years        7.57   8.64  4.73  3.03   0.783 
## 2 25-34 years       12.8   13.2   8.18  4.95   1.19  
## 3 35-44 years        7.41   6.91  4.96  2.78   0.679 
## 4 45-54 years        2.66   2.80  2.05  0.876  0.386 
## 5 55-64 years        0.727  0.870 0.811 0.270  0.138 
## 6 65 years or older  0.141  0.242 0.186 0.0310 0.0620
```

---

## Crosstabs with the `janitor` package

The `tabyl()` function from the `janitor` package provides quite a few options for crosstabs.

``` r
library(janitor)

tuesdata_eda %>% 
  filter(!is.na(ai_complex),
         !is.na(agec)) %>% 
  tabyl(agec, ai_complex) %>% 
  adorn_totals(where = c("row","col")) %>% 
  adorn_percentages(denominator = "row") %>% 
  adorn_pct_formatting(digits = 2) %>% 
  adorn_ns(position = "front")
```

```
##               agec               1               2              3
##        18-24 years  2,688 (30.58%)  3,068 (34.90%) 1,680 (19.11%)
##        25-34 years  4,537 (31.75%)  4,669 (32.68%) 2,903 (20.32%)
##        35-44 years  2,632 (32.59%)  2,455 (30.40%) 1,762 (21.82%)
##        45-54 years    945 (30.35%)    993 (31.89%)   728 (23.38%)
##        55-64 years    258 (25.80%)    309 (30.90%)   288 (28.80%)
##  65 years or older     50 (21.28%)     86 (36.60%)    66 (28.09%)
##              Total 11,110 (31.29%) 11,580 (32.62%) 7,427 (20.92%)
##               4             5            Total
##  1,077 (12.25%)   278 (3.16%)  8,791 (100.00%)
##  1,756 (12.29%)   423 (2.96%) 14,288 (100.00%)
##    986 (12.21%)   241 (2.98%)  8,076 (100.00%)
##    311  (9.99%)   137 (4.40%)  3,114 (100.00%)
##     96  (9.60%)    49 (4.90%)  1,000 (100.00%)
##     11  (4.68%)    22 (9.36%)    235 (100.00%)
##  4,237 (11.93%) 1,150 (3.24%) 35,504 (100.00%)
```
]

---

## Other options for crosstabulation in `R`

As with most things in `R`, there are many options for creating crosstabs. Some alternatives to the ones we presented before include the `CrossTable()` and `crosstab()` functions from the [`descr` package](https://cran.r-project.org/web/packages/descr/index.html) or the `ctable()` function from the [`summarytools` package](https://github.com/dcomtois/summarytools).

Another quite very versatile option for creating crosstabs as well as output based on those (e.g., in combination with the `flextable` package we have briefly used before) is the [`crosstable` package](https://danchaltiel.github.io/crosstable/)

---

## Chi-Square Test

We can use the `summary()` function from `base R` in combination with `table()` to perform a chi-square test.

``` r
#--Are age and AI acceptance independent? (Does AI acceptance differ across age groups?)
summary(table(tuesdata_eda $agec, tuesdata_eda$ai_acc ))
```

```
## Number of cases in table: 35766 
## Number of factors: 2 
## Test for independence of all factors:
## 	Chisq = 155.91, df = 20, p-value = 4.615e-23
```

---

## Chi-Square Test

Most of the other packages that include functions for crosstabs that we mentioned before can also be used for chi-square tests: For example, the `janitor` package.

``` r
tuesdata_eda %>% 
  filter(!is.na(ai_acc),
         !is.na(agec)) %>% 
  tabyl(agec, ai_acc) %>%
  chisq.test()
```

```
## 
## 	Pearson's Chi-squared test
## 
## data:  .
## X-squared = 155.91, df = 20, p-value < 2.2e-16
```

---

## Correlations with `base R`

The `base R` function `cor()` computes the correlation coefficient(s) between two or more variables. This function can be used to calculate *Pearson's r*, *Kendall's tau*, and *Spearman's rho*. We also need to specify how we want to deal with missing values (e.g., use pairwise complete observations). For example, let's look at the correlations between the coding skills, age and participation frequency:

``` r
skills <- tuesdata_eda  %>% 
  select(skills_code,age,partici_freq)

cor(skills,
    use = "pairwise.complete.obs",
    method = "pearson")
```

```
##                skills_code         age  partici_freq
## skills_code   1.0000000000 -0.18168441 -0.0002540199
## age          -0.1816844148  1.00000000 -0.0366443756
## partici_freq -0.0002540199 -0.03664438  1.0000000000
```
]

---

## The `corrr` package

The [`corrr` package](https://corrr.tidymodels.org/) is part of the [`tidymodels` suite of packages](https://www.tidymodels.org/) and provides various functions for displaying correlations. The main function is `correlate()` which produces a `tibble` as output.

``` r
library(corrr)

correlate(skills)
```

```
## # A tibble: 3 × 4
##   term         skills_code     age partici_freq
##   <chr>              <dbl>   <dbl>        <dbl>
## 1 skills_code    NA        -0.182     -0.000254
## 2 age            -0.182    NA         -0.0366  
## 3 partici_freq   -0.000254 -0.0366    NA
```
]

---

## The `corrr` package

The `corrr` package provides several functions for tweaking/optimizing the output of the `correlate()` function. Here's one example:

``` r
skills %>% 
  correlate() %>% 
  shave() %>% 
  fashion()
```

```
##           term skills_code  age partici_freq
## 1  skills_code                              
## 2          age        -.18                  
## 3 partici_freq        -.00 -.04
```
]

---

## The `correlation` package

The [`correlation` package](https://easystats.github.io/correlation/) is part of the [`easystats` collection of packages](https://easystats.github.io/easystats/) (also called the `Easyverse`). It provides a much wider range of correlation types than the `base R` function `cor()` and the `correlate()` function from the `corrr` package (e.g., biserial and tetrachoric correlations for factors). Its core is the `correlation()` function.

``` r
library(correlation)

correlation(skills)
```

---

```
## # Correlation Matrix (pearson-method)
## 
## Parameter1  |   Parameter2 |         r |         95% CI |      t
## ----------------------------------------------------------------
## skills_code |          age |     -0.18 | [-0.19, -0.17] | -47.26
## skills_code | partici_freq | -2.54e-04 | [-0.01,  0.01] |  -0.05
## age         | partici_freq |     -0.04 | [-0.05, -0.03] |  -7.80
## 
## Parameter1  |    df |         p
## -------------------------------
## skills_code | 65435 | < .001***
## skills_code | 45235 | 0.957    
## age         | 45235 | < .001***
## 
## p-value adjustment method: Holm (1979)
## Observations: 45237-65437
```

---

## Guilty by ~~association~~ correlation

While correlation coefficients are useful for exploring relationships between variables, they can also be misleading. For example, if we do correlation analysis and encounter a (Pearson's) correlation coefficient close to 0, we often think of relationships as pictured below.

---

## Guilty by ~~association~~ correlation

This data set has **the same correlation coefficient (Pearson's r of -0.06)** as the one on the previous slide:

---

## Guilty by ~~association~~ correlation 🦖

So does this one...

---

## Guilty by ~~association~~ correlation

We could go on... The previous three examples all come from the [`datasauRus` package](https://jumpingrivers.github.io/datasauRus/) which essentially is an extension of [Anscombe's quartet](https://en.wikipedia.org/wiki/Anscombe%27s_quartet) and includes 13 data sets with the same (Pearson) correlation between x and y.

---

## Trust no singular value!

Importantly, the x- and y-variables in these `datasaurus_dozen` data set also all have the same means and standard deviations.

``` r
datasaurus_dozen %>% 
  group_by(dataset) %>%
  summarize(
    mean_x = mean(x), 
    mean_y = mean(y), 
    sd_x = sd(x), 
    sd_y = sd(y), 
    corr = cor(x, y, method = "pearson")
  )
```

```
## # A tibble: 13 × 6
##    dataset    mean_x mean_y  sd_x  sd_y    corr
##    <chr>       <dbl>  <dbl> <dbl> <dbl>   <dbl>
##  1 away         54.3   47.8  16.8  26.9 -0.0641
##  2 bullseye     54.3   47.8  16.8  26.9 -0.0686
##  3 circle       54.3   47.8  16.8  26.9 -0.0683
##  4 dino         54.3   47.8  16.8  26.9 -0.0645
##  5 dots         54.3   47.8  16.8  26.9 -0.0603
##  6 h_lines      54.3   47.8  16.8  26.9 -0.0617
##  7 high_lines   54.3   47.8  16.8  26.9 -0.0685
##  8 slant_down   54.3   47.8  16.8  26.9 -0.0690
##  9 slant_up     54.3   47.8  16.8  26.9 -0.0686
## 10 star         54.3   47.8  16.8  26.9 -0.0630
## 11 v_lines      54.3   47.8  16.8  26.9 -0.0694
## 12 wide_lines   54.3   47.8  16.8  26.9 -0.0666
## 13 x_shape      54.3   47.8  16.8  26.9 -0.0656
```

---

---

## Plot your data!

The message from the `datasaurus_dozen` examples should be clear. Relying only on singular values that summarize the location or spread of a single variable or the association of two variables is not a good idea. To avoid reducing a ~~mountain to a molehill~~ dinosaur to a lack of correlation, it is important to plot your data to explore:

- univariate distributions

- grouped univariate distributions (if you want to compare groups)

- bivariate distributions

For that reason (and because it is fun and `R` has a lot to offer in that regard), we will dive 🥽 right into data visualization in the next session...

---

## Other packages for EDA

Besides the ones we have covered in this session and `summarytools`, there still are plenty of others that also provide some interesting functionalities. Some of these additional/alternative options are:
- [`inspectdf`](https://alastairrushworth.github.io/inspectdf/)
- [`skimr`](https://docs.ropensci.org/skimr/)
- [`descriptr`](https://descriptr.rsquaredacademy.com/index.html)
- [`DataExplorer`](https://boxuancui.github.io/DataExplorer/)
- [`explore`](https://github.com/rolkra/explore)
- [`dataReporter`](https://github.com/ekstroem/dataReporter)

---

## Automated EDA reports

Some of the EDA packages provide functions for generating automated EDA reports, for example:
- the `dfSummary()` function from the `summarytools` package
- the `skim()` function from the [`skimr` package](https://docs.ropensci.org/skimr/)
- the `make_report()` function from the [`dataReporter` package](https://github.com/ekstroem/dataReporter)
- the `report()` function from the [`explore` package](https://github.com/rolkra/explore)

The function `explore`() from the `explore` package also allows for interactive data exploration.

---

# Extracurricular activities

Check out the appendix slides on exploring missing data and outliers and/or watch the [*YouTube* tutorial *EDA with the tidyverse: Exploring Variation*](https://www.youtube.com/watch?v=26hbyVb00xs) by Mauro Lepore for even more EDA inspiration.