class: center, middle, inverse, title-slide .title[ # Data Literacy: Introduction to R ] .subtitle[ ## Visualization & Statistical Principles ] .author[ ### Veronika Batzdorfer ] .date[ ### 2026-05-22 ] --- layout: true --- ## We'll start rather basic <img src="data:image/png;base64,#https://github.com/jobreu/r-intro-gesis-2021/blob/main/content/img/trump.jpg?raw=true" width="85%" style="display: block; margin: auto;" /> .footnote[https://twitter.com/katjaberlin/status/1290667772779913218] --- ## Content of the visualization sessions .pull-left[ **`Base R` visualization** - Standard plotting procedures in R - very short ] .pull-right[ **`tidyverse`/`ggplot2` visualization** - Modern interface to graphics - grammar of graphics ] --- ## Visualization in the Research Cycle Data visualization is not just the final step—it's integral to the entire research process: .pull-left[ **Exploration Phase** - Detect outliers and anomalies - Identify patterns and relationships - Spot data quality issues - Formulate hypotheses ] .pull-right[ **Communication Phase** - Present findings clearly - Support arguments with evidence - Enable reproducibility - Facilitate peer review ] .center[ <small>Source: R for Data Science (Wickham & Grolemund)</small> ] --- ### Data for this session ``` r library(palmerpenguins) library(tidyverse) library(ggbeeswarm) df <- penguins %>% drop_na() ``` --- ## Graphics in `R` The `graphics` package is already included. .pull-left[ ``` r #-- Adjust plot margins-- # Create summary data first spec_counts <- df %>% count(species) %>% arrange(desc(n)) par(mar = c(5, 4, 4, 2)) barplot( height = spec_counts$n, names.arg = spec_counts$species, las = 2, main = "Penguin Species Count", ylab = "Number of Records", col = c("darkorange", "purple", "cyan4"), cex.names = 0.9 ) ``` ] .pull-right[ <img src="data:image/png;base64,#3_2_Data_Visualization_Part_1_files/figure-html/unnamed-chunk-2-1.png" style="display: block; margin: auto;" /> ] --- ## Let's start from the beginning The most basic function to plot in R is `plot()`. .pull-left[ ``` r plot(df$bill_length_mm, df$flipper_length_mm, main = "Bill vs. Flipper Length", xlab = "Bill Length (mm)", ylab = "Flipper Length (mm)", col = "darkblue", pch = 19) ``` ] .pull-right[ <img src="data:image/png;base64,#3_2_Data_Visualization_Part_1_files/figure-html/unnamed-chunk-3-1.png" style="display: block; margin: auto;" /> ] --- ## Where to go from here with `base R` graphics? .pull-left[ Using similar procedures, we can add more and more stuff to our plot or edit its elements: - regression lines - legends - annotations - colors - etc. ] .pull-right[ We can also create different *plot types*, such as - histograms - barplots - boxplots - densities - pie charts - etc. ] --- ## Example: A simple boxplot .pull-left[ ``` r boxplot( df$body_mass_g ~ df$year ) ``` ] .pull-right[ <img src="data:image/png;base64,#3_2_Data_Visualization_Part_1_files/figure-html/unnamed-chunk-4-1.png" style="display: block; margin: auto;" /> ] --- ## The `par()` and `dev.off()` functions for plots .highlight[`par()`] stands for graphical parameters and is called before the actual plotting function. It prepares the graphics device in `R`. The most commonly used options are for "telling" the device that 2, 3, 4, or `x` plots have to be printed. We can, e.g., use `mfrow` for specifying how many rows (the first value in the vector) and columns (the second value in the vector) we aim to plot. ``` r par(mfrow = c(2, 2)) ``` One caveat of using this function is that we actively have to turn off the device before generating another independent plot. ``` r dev.off() ``` --- ## Exporting with *RStudio* <img src="data:image/png;base64,#https://github.com/jobreu/r-intro-gesis-2021/blob/main/content/img/saveGraphic.PNG?raw=true" style="display: block; margin: auto;" /> --- ## Saving plots via a command Alternatively, you can also export plots with the commands `png()`, `pdf()` or `jpeg()`, for example. For this purpose, you first have to wrap the plot call between one of those functions and a `dev.off()`call. ``` r png("Plot.png") hist(df$body_mass_g) dev.off() ``` ``` r pdf("Plot.pdf") hist(df$body_mass_g) dev.off() ``` ``` r jpeg("Plot.jpeg") hist(df$body_mass_g) dev.off() ``` --- ## What is `ggplot2`? `ggplot2` is another `R` package for creating plots and is part of the `tidyverse`. It uses the *grammar of graphics*. Some things to note about `ggplot2`: - it is well-suited for multi-dimensional data - it expects data (frames) as input - components of the plot are added as layers ``` r plot_call + layer_1 + layer_2 + ... + layer_n ``` --- ## Barplots as in `base R` .tinyish[ .pull-left[ ``` r ggplot(df, aes(x = fct_infreq(species))) + geom_bar(fill = c("darkorange", "purple", "cyan4")) + labs( title = "Penguin Species Count", y = "Number of Records", x = NULL ) ``` ] ] .pull-right[ <img src="data:image/png;base64,#3_2_Data_Visualization_Part_1_files/figure-html/unnamed-chunk-11-1.png" style="display: block; margin: auto;" /> ] --- ## Boxplots as in `base R` .tinyish[ .pull-left[ ``` r ggplot(df, aes(x = factor(year), y = body_mass_g)) + geom_boxplot() ``` ] ] .pull-right[ <img src="data:image/png;base64,#3_2_Data_Visualization_Part_1_files/figure-html/unnamed-chunk-12-1.png" style="display: block; margin: auto;" /> ] --- ## Components of a plot According to Wickham (2010, 8)* a layered plot consists of the following components: <span class="footnote"> <small><small><span class="red bold">*</span> http://dx.doi.org/10.1198/jcgs.2009.07098</small></small> </span> - data and aesthetic mappings, - geometric objects, - scales, - and facet specification ``` r plot_call + data + aesthetics + geometries + scales + facets ``` --- ## Data requirements You can use one single data frame to create a plot in `ggplot2`. This creates a smooth workflow from data wrangling to the final presentation of the results. <br> <img src="data:image/png;base64,#https://github.com/jobreu/r-intro-gesis-2021/blob/main/content/img/data-science_man.png?raw=true" width="65%" style="display: block; margin: auto;" /> <small><small>Source: http://r4ds.had.co.nz</small></small> --- ## Why the long format? 🐴 `ggplot2` prefers data in long format (**NB**: of course, only if this is possible and makes sense for the data set at hand) .pull-left[ - every element we aim to plot is an observation - no thinking required how a specific variable relates to an observation - most importantly, the long format is more parsimonious - it requires less memory and less disk space ] .pull-right[ <img src="data:image/png;base64,#https://github.com/jobreu/r-intro-gesis-2021/blob/main/content/img/long.png?raw=true" width="40%" style="display: block; margin: auto;" /> <small><small>Source: https://github.com/gadenbuie/tidyexplain#tidy-data</small></small> ] --- ## Before we start The architecture of building plots in `ggplot` is similar to standard `R` graphics. There is an initial plotting call, and subsequently, more stuff is added to the plot. However, in `base R`, it is sometimes tricky to find out how to add (or remove) certain plot elements. For example, think of removing the axis ticks in the scatter plot. We will systematically explore which elements are used in `ggplot` in this session. --- ## Creating your own plot Three components are important: - Plot initiation and data input - aesthetics definition - so-called geoms --- ## Grammar of graphics: Initiation .pull-left[ `ggplot()` is the most basic command to create a plot: ``` r ggplot() ``` ] .pull-right[ <img src="data:image/png;base64,#3_2_Data_Visualization_Part_1_files/figure-html/unnamed-chunk-13-1.png" style="display: block; margin: auto;" /> ] **But it doesn't show anything...** --- ## What now? Data input! .pull-left[ ``` r ggplot(data = df ) ``` ] .pull-right[ <img src="data:image/png;base64,#3_2_Data_Visualization_Part_1_files/figure-html/unnamed-chunk-14-1.png" style="display: block; margin: auto;" /> ] **Still nothing there...** --- ## `aes`thetics! .pull-left[ `ggplot` requires information about the variables to plot. ``` r ggplot(data = df, aes(x = bill_length_mm, y = flipper_length_mm)) ``` ] .pull-right[ <img src="data:image/png;base64,#3_2_Data_Visualization_Part_1_files/figure-html/unnamed-chunk-15-1.png" style="display: block; margin: auto;" /> ] **That's a little bit better, right?** --- ## `geom`s! .pull-left[ Finally, `ggplot` needs information *how* to plot the variables. ``` r ggplot(data = df, aes(x = bill_length_mm, y = flipper_length_mm)) + geom_point(aes(color = species)) ``` ] .pull-right[ <img src="data:image/png;base64,#3_2_Data_Visualization_Part_1_files/figure-html/unnamed-chunk-16-1.png" style="display: block; margin: auto;" /> ] **A scatter plot!** --- ## Add a fancy `geom` .pull-left[ We can also add more than one `geom`. ``` r ggplot(data = df, aes(x = bill_length_mm, y = flipper_length_mm)) + geom_point(aes(color = species))+ geom_smooth(method = "lm", se = FALSE) ``` ] .pull-right[ <img src="data:image/png;base64,#3_2_Data_Visualization_Part_1_files/figure-html/unnamed-chunk-17-1.png" style="display: block; margin: auto;" /> ] **A regression line!** (without confidence intervals; the regression behind this operation is run automatically) --- ## Going further: adding group `aes`thetics .pull-left[ We can add different colors for different groups in our data. ``` r ggplot(data = df, aes(x = bill_length_mm, y = flipper_length_mm, group = species)) + geom_point(aes(color = species))+ geom_smooth(method = "lm", se = FALSE) ``` ] .pull-right[ <img src="data:image/png;base64,#3_2_Data_Visualization_Part_1_files/figure-html/unnamed-chunk-18-1.png" style="display: block; margin: auto;" /> ] --- ## Manipulating group `aes`thetics .pull-left[ We can also change the colors that are used in the plot. ``` r ggplot(data = df, aes(x = bill_length_mm, y = flipper_length_mm, group = species, color=species)) + geom_point(aes(color = species))+ geom_smooth(method = "lm", se = FALSE) ``` ] .pull-right[ <img src="data:image/png;base64,#3_2_Data_Visualization_Part_1_files/figure-html/unnamed-chunk-19-1.png" style="display: block; margin: auto;" /> ] The legend is drawn automatically, that's handy! --- ## Using another color palette .pull-left[ ``` r ggplot(data = df, aes(x = bill_length_mm, y = flipper_length_mm, group = species, color=species)) + geom_point(aes(color = species))+ geom_smooth(method = "lm", se = FALSE)+ scale_color_brewer( palette = "Dark2" ) ``` ] .pull-right[ <img src="data:image/png;base64,#3_2_Data_Visualization_Part_1_files/figure-html/unnamed-chunk-20-1.png" style="display: block; margin: auto;" /> ] --- ## Difference between `color` and `fill` Notably, there are two components of the plot or `geom` associated with colors: `color` and `fill`. Generally, .highlight[`color`] refers to the geometry borders, such as a line. .highlight[`fill`] refers to a geometry area, such as a polygon. Remember when using `scale_color_brewer` or `scale_fill_brewer` in your plots. --- ## Colors and `theme`s One particular strength of `ggplot2` lies in its immense theming capabilities. The package has some built-in theme functions that makes theming a plot fairly easy, e.g., - `theme_bw()` - `theme_apa()` - `theme_void()` - etc. See: https://ggplot2.tidyverse.org/reference/ggtheme.html --- ## Alternative to being too colorful: facets .pull-left[ ``` r ggplot(data = df, aes(x = bill_length_mm, y = flipper_length_mm, group = species, color=species)) + geom_smooth( color = "black", method = "lm", se = FALSE ) + facet_wrap(~species, ncol = 3, nrow=2) + papaja::theme_apa() ``` ] .pull-right[ <img src="data:image/png;base64,#3_2_Data_Visualization_Part_1_files/figure-html/unnamed-chunk-21-1.png" style="display: block; margin: auto;" /> ] --- ## The `theme()` argument in general The most direct interface for manipulating your theme is the `theme()` argument. Here you can change the appearance of: - axis labels - captions and titles - legend - grid layout - the wrapping strips - ... --- ## Example: changing the grid layout & axis labels .pull-left[ ``` r ggplot(data = df, aes(x = bill_length_mm, y = flipper_length_mm, group = species, color=species)) + geom_smooth( color = "black", method = "lm", se = FALSE ) + facet_wrap(~species, ncol = 3, nrow=2) + theme_bw()+ theme( panel.grid.major = element_blank(), panel.grid.minor = element_blank(), strip.background = element_rect(fill = "white") ) ``` ] .pull-right[ <img src="data:image/png;base64,#3_2_Data_Visualization_Part_1_files/figure-html/unnamed-chunk-22-1.png" style="display: block; margin: auto;" /> ] --- ## Example: changing axis labels .pull-left[ ``` r ggplot(data = df, aes(x = bill_length_mm, y = flipper_length_mm, group = species, color=species)) + geom_smooth( color = "black", method = "lm", se = FALSE ) + facet_wrap(~species, ncol = 3, nrow=2) + theme_bw()+ theme( panel.grid.major = element_blank(), panel.grid.minor = element_blank(), strip.background = element_rect(fill = "white") ) + ylab("Flossenlänge [mm]") + xlab("Schnabellänge [mm]") ``` ] .pull-right[ <img src="data:image/png;base64,#3_2_Data_Visualization_Part_1_files/figure-html/unnamed-chunk-23-1.png" style="display: block; margin: auto;" /> ] --- ## A note on plotting options .pull-left[ Working with combined aesthetics and different data inputs can become challenging. Particularly, plotting similar aesthetics which interfere with the automatic procedures can create conflicts. Some 'favorites' include: - Multiple legends - and various color scales for similar `geoms` ] .pull-right[ <img src="data:image/png;base64,#https://github.com/jobreu/r-intro-gesis-2021/blob/main/content/img/800px-The_Scream.jpg?raw=true" style="display: block; margin: auto;" /> ] .right[ <small><small>Source: https://de.wikipedia.org/wiki/Der_Schrei#/media/File:The_Scream.jpg</small></small> ] --- ## `ggplot` plots are 'simple' objects In contrast to standard `R` plots, `ggplot2` outputs are standard objects like any other object in `R` (they are lists). So there is no graphics device involved from which we have to record our plot to re-use it later. We can just use it directly. ``` r my_fancy_plot <- ggplot(df, aes(x = bill_length_mm, y = flipper_length_mm, color = species)) + geom_point() my_fancy_plot <- my_fancy_plot + geom_smooth() ``` Additionally, there is also no need to call `dev.off()` --- ## It makes combining plots easy As of today, there are now a lot of packages that help to combine `ggplot2`s fairly easily. For example, the [`cowplot` package](https://cran.r-project.org/web/packages/cowplot/index.html) provides a really flexible framework. Yet, fiddling with this package can become quite complicated. A very easy-to-use package for combining `ggplot`s is [`patchwork` package](https://cran.r-project.org/web/packages/patchwork/index.html). --- ## Plotting side by side in one row .pull-left[ ``` r library(patchwork) p_hist <- ggplot(df, aes(x = body_mass_g)) + geom_histogram() p_box <- ggplot(df, aes(y = body_mass_g)) + geom_boxplot() p_density <- ggplot(df, aes(x = body_mass_g)) + geom_density() (p_hist | p_box | p_density) + plot_layout(ncol = 3) + plot_annotation(title = "Body Mass: Three Views") ``` ] .pull-right[ <img src="data:image/png;base64,#3_2_Data_Visualization_Part_1_files/figure-html/unnamed-chunk-25-1.png" style="display: block; margin: auto;" /> ] --- ## Plotting in two columns .pull-left[ ``` r p_hist / p_box ``` ] .pull-right[ <img src="data:image/png;base64,#3_2_Data_Visualization_Part_1_files/figure-html/unnamed-chunk-26-1.png" style="display: block; margin: auto;" /> ] --- ## There's more You can also annotate plots with titles, subtitles, captions, and tags. You can nest plots and introduce more complex layouts. Check out the [`patchwork` repository on *GitHub*](https://github.com/thomasp85/patchwork). --- ## Exporting ggplot graphics Exporting `ggplot2` graphics is fairly easy with the `ggsave()` function. It automatically detects the file format. You can also define the plot height, width, and dpi, which is particularly useful to produce high-class graphics for publications. ``` r ggsave("nice_plot.png", p_hist, dpi = 300) ``` Or: ``` r ggsave("nice_plot.tiff", p_hist, dpi = 300) ``` --- ## Visual exploratory data analysis .pull-left[ ``` r library(visdat) vis_miss(df [,1:5]) ``` ] .pull-right[ <img src="data:image/png;base64,#3_2_Data_Visualization_Part_1_files/figure-html/unnamed-chunk-29-1.png" style="display: block; margin: auto;" /> ] --- ## Common ggplot2 Errors & How to Fix Them Mistakes are learning opportunities! Let's explore common errors and their solutions. --- ## Mistake 1 ? .pull-left[ ``` r ggplot(df, aes(x = bill_length_mm, y = flipper_length_mm, color = "blue")) + # Creates a legend! geom_point() ``` ] .pull-right[ ``` r ggplot(df, aes(x = bill_length_mm, y = flipper_length_mm)) + geom_point(color = "blue") ``` ] --- ## Mistake 1: Result <img src="data:image/png;base64,#3_2_Data_Visualization_Part_1_files/figure-html/mistake1_demo_blind-1.png" style="display: block; margin: auto;" /> --- ## Mistake 1: Aesthetics in Wrong Place **The Error:** Putting data-dependent aesthetics outside `aes()` or static values inside `aes()` .pull-left[ ``` r # WRONG: color inside aes() but as static string ggplot(df, aes(x = bill_length_mm, y = flipper_length_mm, color = "blue")) + # Creates a legend! geom_point() ``` ] .pull-right[ ``` r # CORRECT: static color outside aes() ggplot(df, aes(x = bill_length_mm, y = flipper_length_mm)) + geom_point(color = "blue") ``` ] **What happens:** `"blue"` is treated as data, creating one group called "blue" --- ## Mistake 1: Result <img src="data:image/png;base64,#3_2_Data_Visualization_Part_1_files/figure-html/mistake1_demo-1.png" style="display: block; margin: auto;" /> --- ## Mistake 2: Forgetting + at Line End **The Error:** Putting `+` at the beginning of a line instead of the end .pull-left[ ``` r # WRONG ggplot(df, aes(x = species, y = body_mass_g)) + geom_boxplot() + theme_bw() ``` ] .pull-right[ ``` r # CORRECT ggplot(df, aes(x = species, y = body_mass_g)) + geom_boxplot() + theme_bw() ``` ] **What happens:** R thinks the expression is complete and `+` starts a new (invalid) expression --- ## Mistake 3: Missing Values Silent Removal **The Error:** Not noticing that ggplot silently removes missing values ``` r df_with_na <- penguins # keep NAs # This produces a warning but still plots! ggplot(df_with_na, aes(x = bill_length_mm, y = flipper_length_mm)) + geom_point() ``` <img src="data:image/png;base64,#3_2_Data_Visualization_Part_1_files/figure-html/mistake3_demo-1.png" width="40%" style="display: block; margin: auto;" /> **The Warning:** `Removed 2 rows containing missing values` **Lesson:** Always check for missing values first! --- ## Mistake 4 ? .pull-left[ ``` r ggplot(df, aes(x = factor(year), y = body_mass_g)) + geom_boxplot() ``` ] .pull-right[ ``` r ggplot(df, aes(x = year, y = body_mass_g)) + geom_boxplot() ``` ] --- ## Mistake 4: Wrong Data Type **The Error:** Treating continuous data as categorical or vice versa .pull-left[ ``` r # CORRECT: convert to factor ggplot(df, aes(x = factor(year), y = body_mass_g)) + geom_boxplot() ``` ] .pull-right[ ``` r # WRONG: year treated as continuous ggplot(df, aes(x = year, y = body_mass_g)) + geom_boxplot() # Error: Continuous value supplied to discrete scale ``` ] --- ## Mistake 5: Overplotting **The Error:** Too many points overlapping, hiding the true distribution <img src="data:image/png;base64,#3_2_Data_Visualization_Part_1_files/figure-html/mistake5_demo-1.png" style="display: block; margin: auto;" /> --- ## Statistical Principles in Visualization --- ## Penguin Example Let's create a synthetic example with penguin data patterns: ``` r # Create synthetic data set.seed(42) # Three islands with different characteristics island_data <- bind_rows( tibble( island = "Torgersen", bill_length = rnorm(50, 35, 3), body_mass_base = rnorm(50, 3000, 200) ) %>% mutate(body_mass = body_mass_base - 20 * (bill_length - 35)), tibble( island = "Biscoe", bill_length = rnorm(50, 45, 3), body_mass_base = rnorm(50, 5000, 300) ) %>% mutate(body_mass = body_mass_base - 30 * (bill_length - 45)), tibble( island = "Dream", bill_length = rnorm(50, 40, 3), body_mass_base = rnorm(50, 4000, 250) ) %>% mutate(body_mass = body_mass_base - 25 * (bill_length - 40)) ) %>% select(island, bill_length, body_mass) ``` --- ## The Aggregate View .pull-left[ ``` r # AGGREGATE: Looks like positive correlation! ggplot(island_data, aes(x = bill_length, y = body_mass)) + geom_point() + geom_smooth(method = "lm", se = FALSE) + labs(title = "Aggregate View: Positive Correlation", subtitle = "Bill length vs Body Mass (All Islands Combined)") ``` ] .pull-right[ <img src="data:image/png;base64,#3_2_Data_Visualization_Part_1_files/figure-html/unnamed-chunk-30-1.png" style="display: block; margin: auto;" /> ] --- ## The Aggregate View .pull-left[ ``` r # AGGREGATE: Looks like positive correlation! ggplot(island_data, aes(x = bill_length, y = body_mass)) + geom_point() + geom_smooth(method = "lm", se = FALSE) + labs(title = "Aggregate View: Positive Correlation", subtitle = "Bill length vs Body Mass (All Islands Combined)") ``` ] .pull-right[ <img src="data:image/png;base64,#3_2_Data_Visualization_Part_1_files/figure-html/unnamed-chunk-31-1.png" style="display: block; margin: auto;" /> ] **Conclusion:** Longer bills = heavier penguins? --- ## The Stratified View .pull-left[ ``` r # STRATIFIED: Negative correlation within each island! ggplot(island_data, aes(x = bill_length, y = body_mass, color = island)) + geom_point() + geom_smooth(method = "lm", se = FALSE) + labs(title = "Stratified View: Negative Correlation", subtitle = "Within each island, longer bills = lighter penguins!") ``` ] .pull-right[ <img src="data:image/png;base64,#3_2_Data_Visualization_Part_1_files/figure-html/unnamed-chunk-32-1.png" style="display: block; margin: auto;" /> ] --- ## The Stratified View .pull-left[ ``` r # STRATIFIED: Negative correlation within each island! ggplot(island_data, aes(x = bill_length, y = body_mass, color = island)) + geom_point() + geom_smooth(method = "lm", se = FALSE) + labs(title = "Stratified View: Negative Correlation", subtitle = "Within each island, longer bills = lighter penguins!") ``` ] .pull-right[ <img src="data:image/png;base64,#3_2_Data_Visualization_Part_1_files/figure-html/unnamed-chunk-33-1.png" style="display: block; margin: auto;" /> ] **Reality:** Within each island, longer bills = lighter penguins! The aggregate is confounded by island differences. --- ## Simpson's Paradox: When Aggregation Misleads **Definition:** A trend appears in different groups of data but disappears or reverses when combined. **Why it matters:** Visualization choices can reveal or hide this paradox! --- ## Simpson's Paradox: Side by Side <img src="data:image/png;base64,#3_2_Data_Visualization_Part_1_files/figure-html/simpson_compare-1.png" style="display: block; margin: auto;" /> --- ## Exercise: Simpson's Paradox Hands-On **Task (15 min):** 1. Load the `mpg` dataset from ggplot2: `data(mpg)` 2. Create a scatter plot of `hwy` (highway mpg) vs `displ` (engine displacement) 3. Add a regression line (what's the trend?) 4. Now color by `class` (vehicle class) and add separate regression lines 5. **Question:** Does the overall trend match the within-class trends? **Hint:** Use `geom_smooth(method = "lm", se = FALSE)` --- ## Exercise: Simpson's Paradox Solution .pull-left[ ``` r data(mpg) # Step 1: Aggregate view p_mpg_agg <- ggplot(mpg, aes(x = displ, y = hwy)) + geom_point() + geom_smooth(method = "lm", se = FALSE, color = "red") + labs(title = "Aggregate: Negative Trend", subtitle = "Larger engines = lower highway MPG") # Step 2: Stratified by class p_mpg_strat <- ggplot(mpg, aes(x = displ, y = hwy, color = class)) + geom_point() + geom_smooth(method = "lm", se = FALSE) + labs(title = "By Vehicle Class", subtitle = "Trend varies by class") p_mpg_agg + p_mpg_strat ``` ] .pull-right[ <img src="data:image/png;base64,#3_2_Data_Visualization_Part_1_files/figure-html/simpson_exercise_solution-1.png" width="100%" style="display: block; margin: auto;" /> ] --- ### Principle 1 ``` r # Create data showing effect of treatment treatment_data <- tibble( group = c("Control", "Treatment", "Control", "Treatment"), time = c("Before", "Before", "After", "After"), score = c(50, 52, 55, 70) ) p_treated <- ggplot(filter(treatment_data, time == "After"), aes(x = group, y = score, fill = group)) + geom_col() + labs(title = "Treatment") + theme_bw() p_treated ``` <img src="data:image/png;base64,#3_2_Data_Visualization_Part_1_files/figure-html/baseline_demo_blind-1.png" width="50%" style="display: block; margin: auto;" /> --- ### Principle 1: The Importance of Baselines **Concept:** Always show the baseline or reference point for comparison. ``` r # GOOD: Showing change from baseline p_good <- ggplot(treatment_data, aes(x = time, y = score, group = group, color = group)) + geom_line(size = 1.5) + geom_point(size = 3) + labs(title = "GOOD: Change from baseline", subtitle = "Now we see both groups improved!") + theme_bw() p_treated + p_good ``` <img src="data:image/png;base64,#3_2_Data_Visualization_Part_1_files/figure-html/baseline_demo-1.png" width="60%" style="display: block; margin: auto;" /> --- ## Principle 2 ``` r set.seed(123) uncertainty_data <- tibble( group = rep(c("A", "B", "C"), each = 30), value = c(rnorm(30, 50, 10), rnorm(30, 55, 15), rnorm(30, 60, 5)) ) p_means <- uncertainty_data %>% group_by(group) %>% summarise(mean_val = mean(value)) %>% ggplot(aes(x = group, y = mean_val, fill = group)) + geom_col() + labs(title = "What do we see?") + theme_bw() p_means ``` <img src="data:image/png;base64,#3_2_Data_Visualization_Part_1_files/figure-html/uncertainty_demo_blind-1.png" width="60%" style="display: block; margin: auto;" /> --- ### Principle 2: Visualizing Uncertainty **Concept:** Point estimates without intervals can be misleading. ``` r # GOOD: With error bars p_error <- ggplot(uncertainty_data, aes(x = group, y = value, fill = group)) + stat_summary(fun = mean, geom = "bar") + stat_summary(fun.data = mean_cl_normal, geom = "errorbar", width = 0.2) + labs(title = "Means with 95% CI", subtitle = "Overlap suggests no significant difference") + theme_bw() p_means + p_error ``` <img src="data:image/png;base64,#3_2_Data_Visualization_Part_1_files/figure-html/uncertainty_demo-1.png" width="60%" style="display: block; margin: auto;" /> --- ### Principle 3 ``` r trend_data <- tibble( year = 2010:2020, value = c(95, 94, 93, 94, 95, 96, 95, 94, 93, 92, 91) ) p_trun <- ggplot(trend_data, aes(x = year, y = value)) + geom_line(size = 1.5, color = "red") + geom_point(size = 3) + coord_cartesian(ylim = c(85, 100)) + theme_bw() p_trun ``` <img src="data:image/png;base64,#3_2_Data_Visualization_Part_1_files/figure-html/axis_manipulation_0-1.png" width="60%" style="display: block; margin: auto;" /> --- ### Principle 3: Cherry-Picking & Axis Manipulation **Concept:** How axis limits can change the story ``` r # MANIPULATED: Truncated y-axis exaggerates change # HONEST: Full y-axis p_honest <- ggplot(trend_data, aes(x = year, y = value)) + geom_line(size = 1.5, color = "blue") + geom_point(size = 3) + coord_cartesian(ylim = c(0, 100)) + # Full range labs(title = "HONEST: y-axis 0-100", subtitle = "Actually a modest 4% change") + theme_bw() p_trun + p_honest ``` <img src="data:image/png;base64,#3_2_Data_Visualization_Part_1_files/figure-html/axis_manipulation-1.png" width="60%" style="display: block; margin: auto;" /> --- ### Fix the Mistakes 10 min. The following code contains 5 common ggplot2 mistakes. Fix them! ``` r # BROKEN CODE - fix the mistakes! library(ggplot2) ggplot(penguins, aes(x = species, y = body_mass_g)) + geom_violin() + aes(fill = "blue") + scale_fill_manual(values = c("red", "green", "blue")) + theme_minimal + labs(title = "Penguin Mass by Species") ``` **Hints:** - Check where the `+` signs are - Is `fill` being used correctly? - Did you call `theme_minimal()` as a function? - Are the manual fill values appropriate? --- ## Solution ``` r ggplot(penguins, aes(x = species, y = body_mass_g, fill = species)) + geom_violin() + scale_fill_brewer(palette = "Set2") + theme_minimal() + labs(title = "Penguin Mass by Species") ``` <img src="data:image/png;base64,#3_2_Data_Visualization_Part_1_files/figure-html/exercise3_solution-1.png" width="60%" style="display: block; margin: auto;" /> --- ## Summary: Key Takeaways 1. **Visualization is integral to the research cycle** - not just the end product 2. **Learn from mistakes** - common errors teach us how ggplot2 thinks 3. **Simpson's Paradox** - always consider confounding variables and stratify 4. **Show uncertainty** - error bars, confidence intervals, distributions 5. **Avoid manipulation** - honest axes, full context, no cherry-picking 6. **The Datasaurus Dozen** - never trust summary statistics alone --- ## Some additional resources - [ggplot2 - Elegant Graphics for Data Analysis](https://www.springer.com/gp/book/9783319242750) by Hadley Wickham - [Chapter 3](https://r4ds.had.co.nz/data-visualisation.html) in *R for Data Science* - [Fundamentals of Data Visualization](https://serialmentor.com/dataviz/) by Claus O. Wilke - [Data Visualization - A Practical Introduction](https://press.princeton.edu/titles/13826.html) by Kieran Healy - [data-to-viz](https://www.data-to-viz.com/) - [R Graph Gallery](https://www.r-graph-gallery.com/) - [BBC Visual and Data Journalism cookbook for R graphics](https://bbc.github.io/rcookbook/#how_to_create_bbc_style_graphics) - [List of `ggplot2` extensions](https://exts.ggplot2.tidyverse.org/)