class: center, middle, inverse, title-slide .title[ # Data Literacy: Introduction to R ] .subtitle[ ## Exploratory Data Analysis ] .author[ ### Veronika Batzdorfer ] .date[ ### 2024-11-21 ] --- layout: true --- ## Exploring EDA As stated before, exploratory data analysis can take many shapes and forms. In this session, we will focus on the following: - summary statistics (for numeric variables) - frequencies & proportions (for categorical variables) - cross-tabulations & correlations *Note*: In this session, we will focus on numeric summaries and descriptions of our data. Notably, data visualization typically also is a big part of EDA. We will cover that in the following session on data visualization with `R`. --- ## Disclaimer: A flood 🌊 of packages 📦 In the previous sessions, we have tried to focus on a small set of packages (mostly from `base R` and the `tidyverse`) and took a deep dive 🥽 exploring their functionalities. By comparison, in this session, we will browse through more different packages. As we have said before, there typically is more than one option (and package) for doing things in `R`. Given the high number of packages that have been developed for EDA, this topic is a very suitable example for demonstrating this. --- ## Repetition: Data wrangling pipeline As a repetition and reminder, we will quickly go some of the wrangling steps we discussed before for variables listed on the previous slide in the following. Of course, it is possible to do the whole wrangling in one pipe. However, to check if everything worked it is advisable to break up the pipe into smaller chunks (a nice tool for checking and debugging pipes that also provides an *RStudio* addin is the package [`ViewPipeSteps`](https://github.com/daranzolin/ViewPipeSteps)). Also, splitting up the wrangling pipe steps allows us to show them on the slides. --- ## Wrangling pipeline: Import data, select, & rename variables .small[ ```r library(sjlabelled) library(tidyverse) library(haven) library(dplyr) stackoverflow_survey_questions <- read_csv("./data/stackoverflow_survey_questions.csv") stackoverflow_survey_single_response <- read_csv("./data/stackoverflow_survey_single_response.csv") qname_levels_single_response_crosswalk <- read_csv("./data/qname_levels_single_response_crosswalk.csv") ``` ```r tuesdata_eda <- stackoverflow_survey_single_response %>% select(age, main_branch, ai_select, ai_sent, ai_acc, ai_complex, ai_threat, partici_freq = so_part_freq ) ``` ] --- ## Wrangling pipeline: Change variable types & recode values .small[ ```r tuesdata_eda <- tuesdata_eda %>% mutate(skills_code = recode(main_branch, `1` = 5, `5` = 4, `3` = 3, `4` = 2, `2` = 1), ai_threat = recode_factor(ai_threat, `1` = "Unsure", `2` = "No", `3` = "yes"), agec = recode_factor(age, `1`= "18-24 years", `2`= "25-34 years", `3` = "35-44 years", `4` = "45-54 years", `5` = "55-64 years", `6` = "65 years or older", .ordered = TRUE), ai_acc = recode_factor(ai_acc, `1`= "Highly distrust", `4`= "Somewhat distrust", `5` = "Somewhat trust", `2` = "Highly trust", `3` = "Neither trust nor distrust" ) ) ``` ] --- ## Explore your data: First look 👀 To get a first impression of the data set, we can use some of the functions we discussed in the sessions on *Data Import & Export* and *Data Wrangling Basics*, such as `dim()`, `head()`, or `str()` from `base R`, `glimpse()` from `dplyr`, or `View()` in *RStudio*. While looking at the the full data set can give us a general understanding of the data and their format and also show if (and how) we may need to wrangle them (further), it is difficult to make sense of the data just by looking at it. --- ## Summary statistics To make sense of quantitative data we can reduce their information to unique values. -- .center[ ~ **That's a simple definition of summary statistics** ~] -- As such, we can use summarizing functions of - location (e.g., the mean), - spread (e.g., standard deviation), - the shape of the distribution (e.g., skewness), and - relations between variables (e.g., correlation coefficients) --- ## Summary statistics: `summary()` A quick and easy way to check some summary statistics for your data set is the `base R` function `summary()` which can be applied to individual variables... ```r summary(tuesdata_eda$ai_acc) ``` ``` ## Highly distrust Somewhat distrust ## 2932 8395 ## Somewhat trust Highly trust ## 15039 1016 ## Neither trust nor distrust NA's ## 9920 28135 ``` as well as whole data frames: ```r summary(tuesdata_eda) ``` .right[↪️] --- class: middle .small[ ``` ## age main_branch ai_select ai_sent ## Min. :1.000 Min. :1.000 Min. :1.000 Min. :1.000 ## 1st Qu.:2.000 1st Qu.:1.000 1st Qu.:2.000 1st Qu.:1.000 ## Median :2.000 Median :1.000 Median :3.000 Median :2.000 ## Mean :2.629 Mean :1.503 Mean :2.375 Mean :2.386 ## 3rd Qu.:3.000 3rd Qu.:1.000 3rd Qu.:3.000 3rd Qu.:4.000 ## Max. :8.000 Max. :5.000 Max. :3.000 Max. :6.000 ## NA's :4530 NA's :19564 ## ai_acc ai_complex ai_threat ## Highly distrust : 2932 Min. :1.000 Unsure: 8878 ## Somewhat distrust : 8395 1st Qu.:1.000 No :30423 ## Somewhat trust :15039 Median :2.000 yes : 5388 ## Highly trust : 1016 Mean :2.232 NA's :20748 ## Neither trust nor distrust: 9920 3rd Qu.:3.000 ## NA's :28135 Max. :5.000 ## NA's :28416 ## partici_freq skills_code agec ## Min. :1.000 Min. :1.000 18-24 years :14098 ## 1st Qu.:4.000 1st Qu.:5.000 25-34 years :23911 ## Median :5.000 Median :5.000 35-44 years :14942 ## Mean :4.016 Mean :4.388 45-54 years : 6249 ## 3rd Qu.:5.000 3rd Qu.:5.000 55-64 years : 2575 ## Max. :6.000 Max. :5.000 65 years or older: 772 ## NA's :20200 NA's : 2890 ``` ] --- ## Summary statistics with the `datawizard` package 🧙 The [`datawizard` package](https://easystats.github.io/datawizard/) which we introduced as an interesting alternative or addition for data wrangling before also provides a function for summary statistics. This function also offers several options for customizing the output. ```r library(datawizard) tuesdata_eda %>% select(where(is.numeric)) %>% describe_distribution(quartiles = TRUE) ``` .right[↪️] --- class: middle .small[ ``` ## Variable | Mean | SD | IQR | Range | Quartiles | Skewness | Kurtosis | n | n_Missing ## ------------------------------------------------------------------------------------------------------ ## age | 2.63 | 1.58 | 1 | [1.00, 8.00] | 2.00, 3.00 | 1.69 | 3.22 | 65437 | 0 ## main_branch | 1.50 | 1.02 | 0 | [1.00, 5.00] | 1.00, 1.00 | 1.93 | 2.62 | 65437 | 0 ## ai_select | 2.37 | 0.85 | 1 | [1.00, 3.00] | 2.00, 3.00 | -0.80 | -1.14 | 60907 | 4530 ## ai_sent | 2.39 | 1.68 | 3 | [1.00, 6.00] | 1.00, 4.00 | 0.77 | -1.11 | 45873 | 19564 ## ai_complex | 2.23 | 1.11 | 2 | [1.00, 5.00] | 1.00, 3.00 | 0.62 | -0.47 | 37021 | 28416 ## partici_freq | 4.02 | 1.43 | 1 | [1.00, 6.00] | 4.00, 5.00 | -1.25 | 0.11 | 45237 | 20200 ## skills_code | 4.39 | 1.21 | 0 | [1.00, 5.00] | 5.00, 5.00 | -1.76 | 1.71 | 65437 | 0 ``` ] --- ## Summary statistics with `dplyr` `dplyr` provides a helpful function for creating summary statistics: `summarize()` `summarize()` is a [vectorized](https://win-vector.com/2019/01/03/what-does-it-mean-to-write-vectorized-code-in-r/) function that can be used to create summary statistics for variables using functions like... - `mean()` - `sd()` - `min()` - `max()` - etc. --- ## Summary statistics with `dplyr` While creating summary statistics using `summarize()` from `dplyr()` requires writing more code, it is the most flexible option. Another nice benefit of `summarize()` is that it produces a `tibble` which can be used for further analyses or for creating plots or tables. --- ## `dplyr::summarize()` .small[ ```r tuesdata_eda %>% summarize( mean_partici_freq = mean(partici_freq, na.rm = TRUE), sd_partici_freq = sd(partici_freq, na.rm = TRUE), var_partici_freq = var(partici_freq, na.rm = TRUE), min_partici_freq = min(partici_freq, na.rm = TRUE), max_partici_freq = max(partici_freq, na.rm = TRUE) ) %>% round(2) # round to 2 decimal places ``` ``` ## # A tibble: 1 × 5 ## mean_partici_freq sd_partici_freq var_partici_freq min_partici_freq ## <dbl> <dbl> <dbl> <dbl> ## 1 4.02 1.43 2.04 1 ## # ℹ 1 more variable: max_partici_freq <dbl> ``` ] --- ## `dplyr::group_by()` The `dplyr` function `group_by()` creates data frames (tibbles) that are grouped by one or more variables. This can, e.g., be used to produce grouped summary statistics. ```r tuesdata_eda %>% filter(!is.na(skills_code)) %>% # only use cases where skills_code is not missing group_by(skills_code) %>% summarize( mean_partici_freq = mean(partici_freq, na.rm = TRUE), sd_partici_freq = sd(partici_freq, na.rm = TRUE), var_partici_freq = var(partici_freq, na.rm = TRUE), min_partici_freq = min(partici_freq, na.rm = TRUE), max_partici_freq = max(partici_freq, na.rm = TRUE) ) %>% ungroup() ``` .right[↪️] --- class: middle ``` ## # A tibble: 5 × 6 ## skills_code mean_partici_freq sd_partici_freq var_partici_freq ## <dbl> <dbl> <dbl> <dbl> ## 1 1 4.04 1.14 1.29 ## 2 2 4.01 1.36 1.84 ## 3 3 3.98 1.43 2.06 ## 4 4 4.17 1.37 1.87 ## 5 5 4.01 1.45 2.10 ## # ℹ 2 more variables: min_partici_freq <dbl>, max_partici_freq <dbl> ``` --- ## Grouping and ungrouping Keep in mind that `group_by()` changes the class of a dataframe. To avoid (accidentally) continuing to perform operations groupwise, it is necessary to use the `ungroup()` function (typically at the end of a pipe). In that way, this function is similar to the *Split by* option in *SPSS*. ```r class(tuesdata_eda) ``` ``` ## [1] "tbl_df" "tbl" "data.frame" ``` ```r tuesdata_eda %>% group_by(skills_code) %>% class() ``` ``` ## [1] "grouped_df" "tbl_df" "tbl" "data.frame" ``` --- ## `dplyr::across()` To produce grouped summary statistics for multiple variables we can use the `dplyr` function `across()` which we already saw in action in the second data wrangling session. *Note*: We only use cases without missing data for any of the variables here (= listwise deletion). ```r tuesdata_eda %>% select(skills_code, starts_with("ai")) %>% select(-ai_threat, -ai_acc) %>% drop_na() %>% group_by(skills_code) %>% summarize(across(starts_with("ai"), list(mean = mean, sd = sd), .names = "{col}_{fn}")) ``` .right[↪️] --- class: middle ``` ## # A tibble: 5 × 7 ## skills_code ai_select_mean ai_select_sd ai_sent_mean ai_sent_sd ## <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 1 3 0 2.42 1.74 ## 2 2 3 0 2.30 1.64 ## 3 3 3 0 2.45 1.76 ## 4 4 3 0 2.59 1.82 ## 5 5 3 0 2.42 1.74 ## # ℹ 2 more variables: ai_complex_mean <dbl>, ai_complex_sd <dbl> ``` --- ## Frequencies: `table()` A simple way of looking at frequencies (e.g., for categorical variables) that we have already seen before is the `base R` function `table()`. ```r table(tuesdata_eda$agec) ``` ``` ## ## 18-24 years 25-34 years 35-44 years 45-54 years ## 14098 23911 14942 6249 ## 55-64 years 65 years or older ## 2575 772 ``` If you also want to include `NA` in the frequency counts, you need to specify the argument `useNA = "always"`. ```r table(tuesdata_eda$agec, useNA = "always") ``` ``` ## ## 18-24 years 25-34 years 35-44 years 45-54 years ## 14098 23911 14942 6249 ## 55-64 years 65 years or older <NA> ## 2575 772 2890 ``` --- ## Proportions with `prop.table()` If you want proportions instead of raw counts, you can use the `base R` function `prop.table()`. **NB**: You need to apply this function to an output produced by `table()`. .small[ ```r prop.table(table(tuesdata_eda$ai_acc)) ``` ``` ## ## Highly distrust Somewhat distrust ## 0.07860168 0.22505496 ## Somewhat trust Highly trust ## 0.40316873 0.02723715 ## Neither trust nor distrust ## 0.26593748 ``` ```r prop.table(table(tuesdata_eda$ai_acc , useNA = "always")) ``` ``` ## ## Highly distrust Somewhat distrust ## 0.04480646 0.12829133 ## Somewhat trust Highly trust ## 0.22982411 0.01552638 ## Neither trust nor distrust <NA> ## 0.15159619 0.42995553 ``` ] --- ## Proportions with `prop.table()` If you want fewer decimals places in the output, you can wrap the the `prop.table()` function in a `round()` call. ```r round(prop.table(table(tuesdata_eda$ai_acc, useNA = "always")), 3) # rounded to 3 decimal places ``` ``` ## ## Highly distrust Somewhat distrust ## 0.045 0.128 ## Somewhat trust Highly trust ## 0.230 0.016 ## Neither trust nor distrust <NA> ## 0.152 0.430 ``` Or if you want percentages... ```r round((prop.table(table(tuesdata_eda$ai_acc, useNA = "always")) * 100), 2) ``` ``` ## ## Highly distrust Somewhat distrust ## 4.48 12.83 ## Somewhat trust Highly trust ## 22.98 1.55 ## Neither trust nor distrust <NA> ## 15.16 43.00 ``` --- ## Frequencies and proportions with `janitor::tabyl()` The [`janitor` package](https://github.com/sfirke/janitor) that we briefly mentioned in the first session on data wrangling also provides the [`tabyl()` function](https://sfirke.github.io/janitor/articles/tabyls.html) for creating frequency and proportion tables: .small[ ```r library(janitor) threat_stats <- tuesdata_eda %>% tabyl(ai_threat) %>% adorn_pct_formatting(digits = 2, affix_sign = TRUE) threat_stats ``` ``` ## ai_threat n percent valid_percent ## Unsure 8878 13.57% 19.87% ## No 30423 46.49% 68.08% ## yes 5388 8.23% 12.06% ## <NA> 20748 31.71% - ``` --- ## Frequencies and proportions with `janitor::tabyl()` A nice thing about `tabyl()` is that is produces a (special type of) data frame which we can, e.g., use for plotting or creating tables (which we will discuss later on). ```r class(threat_stats) ``` ``` ## [1] "tabyl" "data.frame" ``` --- ## Frequencies and proportions with `dplyr` We can also use `group_by()` and `summarize()` to get frequencies and proportions for variables in our data set. ```r tuesdata_eda %>% filter(!is.na(ai_threat)) %>% group_by(ai_threat) %>% summarize(n = n()) %>% mutate(proportion = n/sum(n)) %>% ungroup() ``` ``` ## # A tibble: 3 × 3 ## ai_threat n proportion ## <fct> <int> <dbl> ## 1 Unsure 8878 0.199 ## 2 No 30423 0.681 ## 3 yes 5388 0.121 ``` --- ## Frequencies and proportions with `dplyr` Instead of using `group_by` and `summarize()` to get frequency counts, we can also use `count()` from `dplyr` as a shorthand. ```r tuesdata_eda %>% filter(!is.na(ai_threat)) %>% count(ai_threat) %>% mutate(proportion = n/sum(n)) %>% ungroup() ``` ``` ## # A tibble: 3 × 3 ## ai_threat n proportion ## <fct> <int> <dbl> ## 1 Unsure 8878 0.199 ## 2 No 30423 0.681 ## 3 yes 5388 0.121 ``` --- ## Tables in `R` There are typically two types of outputs you can produce with `R` for further use in reports or publications: tables and plots. We will cover the creation of plots in the following session(s) on data visualization. However, as summary statistics, frequencies, and proportions are often presented in tables, we will briefly discuss how to create tables as output in `R` in the following. As with almost everything in `R`, there are many options (read: packages) for creating tables. We will show examples from three popular options: - [`stargazer`](https://cran.r-project.org/web/packages/stargazer/index.html) - [`gtsummary`](http://www.danieldsjoberg.com/gtsummary/) + [`flextable`](https://davidgohel.github.io/flextable/index.html) --- ## Summary tables with `stargazer` While there is an ever-growing list of `R` packages for creating tables with many new (promising) contenders (such as [`flextable`](https://davidgohel.github.io/flextable/index.html) or [`gt`](https://gt.rstudio.com/index.html)), `stargazer` is an established and widely-used tool for creating ACSII (plain text), `LaTeX`, and `HTML` table output. --- ## Summary tables with `stargazer` If we, e.g., want to create a summary statistics table as text output (e.g., for printing in the `R` console), we can use `stargazer` for that. **NB**: As the main `stargazer()` function does not work with tibbles, we need to convert our data to a regular data frame. ```r library(stargazer) tuesdata_eda %>% select(ai_threat, partici_freq, ai_acc) %>% as.data.frame() %>% stargazer(type = "text", digits = 2, title="Descriptive statistics") ``` .right[↪️] --- class: middle ``` ## ## Descriptive statistics ## ========================================= ## Statistic N Mean St. Dev. Min Max ## ----------------------------------------- ## partici_freq 45,237 4.02 1.43 1 6 ## ----------------------------------------- ``` --- ## Summary tables with `stargazer` `stargazer` also supports `HTML` and `LaTeX` as output formats. We can export the output for further use (e.g., with our `LaTeX` editor of choice). ```r # We create a directory for output files first dir.create("./output") tuesdata_eda %>% select(ai_threat, partici_freq, ai_acc) %>% as.data.frame() %>% stargazer(type = "latex", digits = 2, out = "./output/stargazer_summary.tex", title="Descriptive statistics") ``` .small[ *Note*: If you look at the help file for the `stargazer()` function (via `?stargazer`), you will see that it provides a lot of customization options. ] --- ## Relationships between variables In addition to checking summary statistics for individual variables, another thing that you quite possibly also want to look at as part of your EDA are the relationships between (specific) variables in your data set. There are many ways to do so and the appropriate choice of methods, of course, depends on the types of variables you want to explore. In the following, we will briefly discuss some options for two methods of exploring relationships between variables: - crosstabulation (for categorical variables) - correlations (for numeric and/or binary variables) --- ## Crosstabs You can also use the `base R` functions `table()` and `prop.table()` we have used before for creating univariate frequency and proportion tables to generate crosstabs. ```r table(tuesdata_eda$ai_complex, tuesdata_eda$agec) # rows, columns ``` ``` ## ## 18-24 years 25-34 years 35-44 years 45-54 years 55-64 years ## 1 2688 4537 2632 945 258 ## 2 3068 4669 2455 993 309 ## 3 1680 2903 1762 728 288 ## 4 1077 1756 986 311 96 ## 5 278 423 241 137 49 ## ## 65 years or older ## 1 50 ## 2 86 ## 3 66 ## 4 11 ## 5 22 ``` ```r round(prop.table(table(tuesdata_eda$ai_complex, tuesdata_eda$agec))*100, 2) ``` ``` ## ## 18-24 years 25-34 years 35-44 years 45-54 years 55-64 years ## 1 7.57 12.78 7.41 2.66 0.73 ## 2 8.64 13.15 6.91 2.80 0.87 ## 3 4.73 8.18 4.96 2.05 0.81 ## 4 3.03 4.95 2.78 0.88 0.27 ## 5 0.78 1.19 0.68 0.39 0.14 ## ## 65 years or older ## 1 0.14 ## 2 0.24 ## 3 0.19 ## 4 0.03 ## 5 0.06 ``` --- ## Crosstabs with `dplyr` We can use also functions from `dplyr` to create crosstabs including frequencies. ```r tuesdata_eda %>% filter(!is.na(ai_complex), !is.na(agec)) %>% count(agec, ai_complex) %>% pivot_wider(names_from = ai_complex, values_from = n) ``` ``` ## # A tibble: 6 × 6 ## agec `1` `2` `3` `4` `5` ## <ord> <int> <int> <int> <int> <int> ## 1 18-24 years 2688 3068 1680 1077 278 ## 2 25-34 years 4537 4669 2903 1756 423 ## 3 35-44 years 2632 2455 1762 986 241 ## 4 45-54 years 945 993 728 311 137 ## 5 55-64 years 258 309 288 96 49 ## 6 65 years or older 50 86 66 11 22 ``` .small[ *Note*: We have only briefly mentioned the function in the first session on data wrangling (when introducing the concept of tidy data), but - as you might guess from its name - the function `pivot_wider()`, which is part of the [`tidyr` package](https://tidyr.tidyverse.org/), changes the format of a data set from long to wide. ] --- ## Crosstabs with `dplyr` We can also use functions from `dplyr` in a similar fashion as we have done for a single variable to create crosstabs including percentages. ```r tuesdata_eda %>% filter(!is.na(ai_complex), !is.na(agec)) %>% count(agec, ai_complex) %>% mutate(proportion = n/sum(n)*100) %>% select(-n) %>% pivot_wider(names_from = ai_complex, values_from = proportion) ``` ``` ## # A tibble: 6 × 6 ## agec `1` `2` `3` `4` `5` ## <ord> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 18-24 years 7.57 8.64 4.73 3.03 0.783 ## 2 25-34 years 12.8 13.2 8.18 4.95 1.19 ## 3 35-44 years 7.41 6.91 4.96 2.78 0.679 ## 4 45-54 years 2.66 2.80 2.05 0.876 0.386 ## 5 55-64 years 0.727 0.870 0.811 0.270 0.138 ## 6 65 years or older 0.141 0.242 0.186 0.0310 0.0620 ``` --- ## Crosstabs with the `janitor` package The `tabyl()` function from the `janitor` package provides quite a few options for crosstabs. We will only show one extended example here, but you can learn more in the [`tabyl` vignette](https://cran.r-project.org/web/packages/janitor/vignettes/tabyls.html). .small[ ```r library(janitor) tuesdata_eda %>% filter(!is.na(ai_complex), !is.na(agec)) %>% tabyl(agec, ai_complex) %>% adorn_totals(where = c("row","col")) %>% adorn_percentages(denominator = "row") %>% adorn_pct_formatting(digits = 2) %>% adorn_ns(position = "front") ``` ``` ## agec 1 2 3 ## 18-24 years 2,688 (30.58%) 3,068 (34.90%) 1,680 (19.11%) ## 25-34 years 4,537 (31.75%) 4,669 (32.68%) 2,903 (20.32%) ## 35-44 years 2,632 (32.59%) 2,455 (30.40%) 1,762 (21.82%) ## 45-54 years 945 (30.35%) 993 (31.89%) 728 (23.38%) ## 55-64 years 258 (25.80%) 309 (30.90%) 288 (28.80%) ## 65 years or older 50 (21.28%) 86 (36.60%) 66 (28.09%) ## Total 11,110 (31.29%) 11,580 (32.62%) 7,427 (20.92%) ## 4 5 Total ## 1,077 (12.25%) 278 (3.16%) 8,791 (100.00%) ## 1,756 (12.29%) 423 (2.96%) 14,288 (100.00%) ## 986 (12.21%) 241 (2.98%) 8,076 (100.00%) ## 311 (9.99%) 137 (4.40%) 3,114 (100.00%) ## 96 (9.60%) 49 (4.90%) 1,000 (100.00%) ## 11 (4.68%) 22 (9.36%) 235 (100.00%) ## 4,237 (11.93%) 1,150 (3.24%) 35,504 (100.00%) ``` ] --- ## Other options for crosstabulation in `R` As with most things in `R`, there are many options for creating crosstabs. Some alternatives to the ones we presented before include the `CrossTable()` and `crosstab()` functions from the [`descr` package](https://cran.r-project.org/web/packages/descr/index.html) or the `ctable()` function from the [`summarytools` package](https://github.com/dcomtois/summarytools). Another quite very versatile option for creating crosstabs as well as output based on those (e.g., in combination with the `flextable` package we have briefly used before) is the [`crosstable` package](https://danchaltiel.github.io/crosstable/) --- ## Chi-Square Test We can use the `summary()` function from `base R` in combination with `table()` to perform a chi-square test. ```r summary(table(tuesdata_eda $agec, tuesdata_eda$ai_acc )) ``` ``` ## Number of cases in table: 35766 ## Number of factors: 2 ## Test for independence of all factors: ## Chisq = 155.91, df = 20, p-value = 4.615e-23 ``` --- ## Chi-Square Test Most of the other packages that include functions for crosstabs that we mentioned before can also be used for chi-square tests: For example, the `janitor` package. ```r tuesdata_eda %>% filter(!is.na(ai_acc), !is.na(agec)) %>% tabyl(agec, ai_acc) %>% chisq.test() ``` ``` ## ## Pearson's Chi-squared test ## ## data: . ## X-squared = 155.91, df = 20, p-value < 2.2e-16 ``` --- ## Correlations Again, as with the crosstabs examples, there are many different options for calculating and displaying correlations in `R`. In addition to the `base R` functions, we will look at two packages in this part: [`corrr`](https://corrr.tidymodels.org/) and [`correlation`](https://github.com/easystats/correlation). *Note*: While we will not cover that in this session, `corrr` and `correlation` also offer some nice options for plotting correlations. --- ## Correlations with `base R` The `base R` function `cor()` computes the correlation coefficient(s) between two or more variables. This function can be used to calculate *Pearson's r*, *Kendall's tau*, and *Spearman's rho*. We also need to specify how we want to deal with missing values (e.g., use pairwise complete observations). For example, let's look at the correlations between the trust variables in our data set: .small[ ```r skills <- tuesdata_eda %>% select(skills_code,age,partici_freq) cor(skills, use = "pairwise.complete.obs", method = "pearson") ``` ``` ## skills_code age partici_freq ## skills_code 1.0000000000 -0.18168441 -0.0002540199 ## age -0.1816844148 1.00000000 -0.0366443756 ## partici_freq -0.0002540199 -0.03664438 1.0000000000 ``` ] --- ## The `corrr` package The [`corrr` package](https://corrr.tidymodels.org/) is part of the [`tidymodels` suite of packages](https://www.tidymodels.org/) and provides various functions for displaying correlations. The main function is `correlate()` which produces a `tibble` as output. .small[ ```r library(corrr) correlate(skills) ``` ``` ## # A tibble: 3 × 4 ## term skills_code age partici_freq ## <chr> <dbl> <dbl> <dbl> ## 1 skills_code NA -0.182 -0.000254 ## 2 age -0.182 NA -0.0366 ## 3 partici_freq -0.000254 -0.0366 NA ``` ] --- ## The `corrr` package The `corrr` package provides several functions for tweaking/optimizing the output of the `correlate()` function. Here's one example: .small[ ```r skills %>% correlate() %>% shave() %>% fashion() ``` ``` ## term skills_code age partici_freq ## 1 skills_code ## 2 age -.18 ## 3 partici_freq -.00 -.04 ``` ] --- ## The `correlation` package The [`correlation` package](https://easystats.github.io/correlation/) is part of the [`easystats` collection of packages](https://easystats.github.io/easystats/) (also called the `Easyverse`). It provides a much wider range of correlation types than the `base R` function `cor()` and the `correlate()` function from the `corrr` package (e.g., biserial and tetrachoric correlations for factors). Its core is the `correlation()` function. ```r library(correlation) correlation(skills) ``` .right[↪️] --- class: middle ``` ## # Correlation Matrix (pearson-method) ## ## Parameter1 | Parameter2 | r | 95% CI | t | df | p ## ------------------------------------------------------------------------------------ ## skills_code | age | -0.18 | [-0.19, -0.17] | -47.26 | 65435 | < .001*** ## skills_code | partici_freq | -2.54e-04 | [-0.01, 0.01] | -0.05 | 45235 | 0.957 ## age | partici_freq | -0.04 | [-0.05, -0.03] | -7.80 | 45235 | < .001*** ## ## p-value adjustment method: Holm (1979) ## Observations: 45237-65437 ``` --- ## Guilty by ~~association~~ correlation While correlation coefficients are useful for exploring relationships between variables, they can also be misleading. For example, if we do correlation analysis and encounter a (Pearson's) correlation coefficient close to 0, we often think of relationships as pictured below. <img src="data:image/png;base64,#3_1_Exploratory_Data_Analysis_files/figure-html/dino-plot-1-1.png" width="45%" style="display: block; margin: auto;" /> --- ## Guilty by ~~association~~ correlation This data set has **the same correlation coefficient (Pearson's r of -0.06)** as the one on the previous slide: <img src="data:image/png;base64,#3_1_Exploratory_Data_Analysis_files/figure-html/dino-plot-2-1.png" width="50%" style="display: block; margin: auto;" /> --- ## Guilty by ~~association~~ correlation 🦖 So does this one... <img src="data:image/png;base64,#3_1_Exploratory_Data_Analysis_files/figure-html/dino-plot-3-1.png" width="60%" style="display: block; margin: auto;" /> --- ## Guilty by ~~association~~ correlation We could go on... The previous three examples all come from the [`datasauRus` package](https://jumpingrivers.github.io/datasauRus/) which essentially is an extension of [Anscombe's quartet](https://en.wikipedia.org/wiki/Anscombe%27s_quartet) and includes 13 data sets with the same (Pearson) correlation between x and y. <img src="data:image/png;base64,#3_1_Exploratory_Data_Analysis_files/figure-html/dino-plot-4-1.png" width="50%" style="display: block; margin: auto;" /> --- ## Trust no singular value! Importantly, the x- and y-variables in these `datasaurus_dozen` data set also all have the same means and standard deviations. ```r datasaurus_dozen %>% group_by(dataset) %>% summarize( mean_x = mean(x), mean_y = mean(y), sd_x = sd(x), sd_y = sd(y), corr = cor(x, y, method = "pearson") ) ``` .right[↪️] --- class: middle .small[ ``` ## # A tibble: 13 × 6 ## dataset mean_x mean_y sd_x sd_y corr ## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 away 54.3 47.8 16.8 26.9 -0.0641 ## 2 bullseye 54.3 47.8 16.8 26.9 -0.0686 ## 3 circle 54.3 47.8 16.8 26.9 -0.0683 ## 4 dino 54.3 47.8 16.8 26.9 -0.0645 ## 5 dots 54.3 47.8 16.8 26.9 -0.0603 ## 6 h_lines 54.3 47.8 16.8 26.9 -0.0617 ## 7 high_lines 54.3 47.8 16.8 26.9 -0.0685 ## 8 slant_down 54.3 47.8 16.8 26.9 -0.0690 ## 9 slant_up 54.3 47.8 16.8 26.9 -0.0686 ## 10 star 54.3 47.8 16.8 26.9 -0.0630 ## 11 v_lines 54.3 47.8 16.8 26.9 -0.0694 ## 12 wide_lines 54.3 47.8 16.8 26.9 -0.0666 ## 13 x_shape 54.3 47.8 16.8 26.9 -0.0656 ``` ] --- ## Plot your data! The message from the `datasaurus_dozen` examples should be clear. Relying only on singular values that summarize the location or spread of a single variable or the association of two variables is not a good idea. To avoid reducing a ~~mountain to a molehill~~ dinosaur to a lack of correlation, it is important to plot your data to explore: - univariate distributions - grouped univariate distributions (if you want to compare groups) - bivariate distributions For that reason (and because it is fun and `R` has a lot to offer in that regard), we will dive 🥽 right into data visualization in the next session... --- ## Other packages for EDA The [`summarytools` package](https://github.com/dcomtois/summarytools) provides quite a lot of functionalities for EDA, provides an exhaustive [introduction](https://htmlpreview.github.io/?https://github.com/dcomtois/summarytools/blob/master/doc/introduction.html), and also [works nicely with `R Markdown`](https://htmlpreview.github.io/?https://github.com/dcomtois/summarytools/blob/master/doc/rmarkdown.html) (which we will cover on Thursday). A (minor) downside of this package is that you need to install additional software to use it on *MacOS* and *Linux* (it only works out-of-the-box on *Windows*). --- ## Other packages for EDA Besides the ones we have covered in this session and `summarytools`, there still are plenty of others that also provide some interesting functionalities. Some of these additional/alternative options are: - [`inspectdf`](https://alastairrushworth.github.io/inspectdf/) - [`skimr`](https://docs.ropensci.org/skimr/) - [`descriptr`](https://descriptr.rsquaredacademy.com/index.html) - [`DataExplorer`](https://boxuancui.github.io/DataExplorer/) - [`explore`](https://github.com/rolkra/explore) - [`dataReporter`](https://github.com/ekstroem/dataReporter) --- ## Automated EDA reports Some of the EDA packages provide functions for generating automated EDA reports, for example: - the `dfSummary()` function from the `summarytools` package - the `skim()` function from the [`skimr` package](https://docs.ropensci.org/skimr/) - the `make_report()` function from the [`dataReporter` package](https://github.com/ekstroem/dataReporter) - the `report()` function from the [`explore` package](https://github.com/rolkra/explore) The function `explore`() from the `explore` package also allows for interactive data exploration. --- # Extracurricular activities Check out the appendix slides on exploring missing data and outliers and/or watch the [*YouTube* tutorial *EDA with the tidyverse: Exploring Variation*](https://www.youtube.com/watch?v=26hbyVb00xs) by Mauro Lepore for even more EDA inspiration.