+ - 0:00:00
Notes for current slide
Notes for next slide

Data Literacy: Introduction to R

Data Visualization - Part 1

Veronika Batzdorfer

2025-05-24

1 / 53

Content of the visualization sessions

Base R visualization

  • Standard plotting procedures in R
  • very short

tidyverse/ggplot2 visualization

  • Modern interface to graphics
  • grammar of graphics

There's more that we won't cover:

3 / 53
4 / 53

Data for this session

amount_topcountries <- stackoverflow_survey_single_response %>%
filter(!is.na(country)) %>%
group_by(country) %>%
summarise(Amount = n()) %>%
arrange(desc(Amount)) %>%
head(20) %>% # Top 20 first
mutate(country = reorder(country, Amount)) %>% # Reorder country factor
ungroup()
amount_topcountries %>% head()
## # A tibble: 6 × 2
## country Amount
## <fct> <int>
## 1 United States of America 11095
## 2 Germany 4947
## 3 India 4231
## 4 United Kingdom of Great Britain and Northern Ireland 3224
## 5 Ukraine 2672
## 6 France 2110
5 / 53

Graphics in R

The graphics package is already included.

#-- Adjust plot margins--
par(mar = c(10, 4, 4, 2))
barplot(
height = amount_topcountries$Amount,
names.arg = amount_topcountries$country,
las = 2, # Rotate labels for readability
main = "Top 20 Countries by Count",
ylab = "Number of Records",
cex.names = 0.9, # Resize labels
border = NA # Remove border
)
#-- Reset margins after the plot
par(mar = c(5, 4, 4, 2))

6 / 53

Let's start from the beginning

The most basic function to plot in R is plot().

options(scipen = 999) #prevent exponent. scientific notation
plot_df <- stackoverflow_survey_single_response %>%
filter(!is.na(converted_comp_yearly),
!is.na(years_code))
plot(plot_df$years_code,
plot_df$converted_comp_yearly)

7 / 53

Adding to the plot: titles & labels

plot(
jitter(df$years_coding, 2),
jitter(df$participation_frequency, 2),
pch = 1,
main =
"Relationship Experience and Participation Frequency on SO",
xlab =
"Year of Experience",
ylab =
"SO Participation Frequency"
)

8 / 53

Where to go from here with base R graphics?

Using similar procedures, we can add more and more stuff to our plot or edit its elements:

  • regression lines
  • legends
  • annotations
  • colors
  • etc.

We can also create different plot types, such as

  • histograms
  • barplots
  • boxplots
  • densities
  • pie charts
  • etc.
9 / 53

Example: A simple boxplot

boxplot(
df$years_coding ~
df$participation_frequency
)

10 / 53

The par() and dev.off() functions for plots

par() stands for graphical parameters and is called before the actual plotting function. It prepares the graphics device in R. The most commonly used options are for "telling" the device that 2, 3, 4, or x plots have to be printed.

We can, e.g., use mfrow for specifying how many rows (the first value in the vector) and columns (the second value in the vector) we aim to plot.

par(mfrow = c(2, 2))

One caveat of using this function is that we actively have to turn off the device before generating another independent plot.

dev.off()
11 / 53

Exporting plots

It's nice that R provides such pleasant plotting opportunities. However, to include them in our papers, we need to export them. As said in the beginning, numerous export formats are available in R.

12 / 53

Exporting with RStudio

13 / 53

Saving plots via a command

Alternatively, you can also export plots with the commands png(), pdf() or jpeg(), for example. For this purpose, you first have to wrap the plot call between one of those functions and a dev.off()call.

png("Plot.png")
hist(df$years_coding)
dev.off()
pdf("Plot.pdf")
hist(df$years_coding)
dev.off()
jpeg("Plot.jpeg")
hist(df$years_coding)
dev.off()
14 / 53

What is ggplot2?

ggplot2 is another R package for creating plots and is part of the tidyverse.

It uses the grammar of graphics. Some things to note about ggplot2:

  • it is well-suited for multi-dimensional data
  • it expects data (frames) as input
  • components of the plot are added as layers
plot_call +
layer_1 +
layer_2 +
... +
layer_n
15 / 53

Barplots as in base R

ggplot(df , aes(x = age_group)) +
geom_bar()

18 / 53

Boxplots as in base R

ggplot(
df ,
aes(
x = as.factor(main_branch),
y = years_coding )
) +
geom_boxplot()

19 / 53

Components of a plot

According to Wickham (2010, 8)* a layered plot consists of the following components:

* http://dx.doi.org/10.1198/jcgs.2009.07098

  • data and aesthetic mappings,
  • geometric objects,
  • scales,
  • and facet specification
plot_call +
data +
aesthetics +
geometries +
scales +
facets
20 / 53

Data requirements

You can use one single data frame to create a plot in ggplot2. This creates a smooth workflow from data wrangling to the final presentation of the results.


Source: http://r4ds.had.co.nz

21 / 53

Why the long format? 🐴

ggplot2 prefers data in long format (NB: of course, only if this is possible and makes sense for the data set at hand)

  • every element we aim to plot is an observation
  • no thinking required how a specific variable relates to an observation
  • most importantly, the long format is more parsimonious
  • it requires less memory and less disk space
22 / 53

Before we start

The architecture of building plots in ggplot is similar to standard R graphics. There is an initial plotting call, and subsequently, more stuff is added to the plot.

However, in base R, it is sometimes tricky to find out how to add (or remove) certain plot elements. For example, think of removing the axis ticks in the scatter plot.

We will systematically explore which elements are used in ggplot in this session.

23 / 53

Creating your own plot

We do not want to give a lecture on the theory behind data visualization (if you want that, we suggest having a look at the excellent book Fundamentals of Data Visualization by Claus O. Wilke).

Creating plots is all about practice... and 'borrowing' code from others.

Three components are important:

  • Plot initiation and data input
  • aesthetics definition
  • so-called geoms
24 / 53

Plot initiation

Now, let's start from the beginning and have a closer look at the grammar of graphics.

ggplot() is the most basic command to create a plot:

ggplot()

But it doesn't show anything...

25 / 53

What now? Data input!

ggplot(data = df )

Still nothing there...

26 / 53

aesthetics!

ggplot requires information about the variables to plot.

ggplot(data = df ) +
aes(x = years_coding, y = yearly_compensation)

That's a little bit better, right?

27 / 53

geoms!

Finally, ggplot needs information how to plot the variables.

ggplot(data = df ) +
aes(x = years_coding, y = yearly_compensation) +
geom_point()

A scatter plot!

28 / 53

Add a fancy geom

We can also add more than one geom.

ggplot(data = df) +
aes(x = years_coding, y = yearly_compensation) +
geom_jitter() +
geom_smooth(method = "lm", se = FALSE)

A regression line! (without confidence intervals; the regression behind this operation is run automatically)

29 / 53

Going further: adding group aesthetics

We can add different colors for different groups in our data.

df %>%
filter(!is.na(ai_threat)) %>%
ggplot(aes(
x = years_coding,
y = participation_frequency,
group = main_branch
)) +
geom_smooth(method = "lm", se = FALSE)

30 / 53

Manipulating group aesthetics

We can also change the colors that are used in the plot.

df %>%
filter(!is.na(ai_threat)) %>%
ggplot(
aes(
x = years_coding,
y = participation_frequency,
group = main_branch,
color = main_branch)) +
geom_smooth(method = "lm", se = FALSE)

The legend is drawn automatically, that's handy!

31 / 53

Using another color palette

df %>%
filter(!is.na(ai_threat)) %>%
ggplot(
aes(
x = years_coding,
y = participation_frequency,
group = main_branch,
color = main_branch)) +
geom_smooth(method = "lm", se = FALSE) +
scale_color_brewer(
palette = "Dark2"
)

32 / 53

Difference between color and fill

Notably, there are two components of the plot or geom associated with colors: color and fill.

Generally, color refers to the geometry borders, such as a line. fill refers to a geometry area, such as a polygon.

Remember when using scale_color_brewer or scale_fill_brewer in your plots.

33 / 53

Colors and themes

One particular strength of ggplot2 lies in its immense theming capabilities. The package has some built-in theme functions that makes theming a plot fairly easy, e.g.,

  • theme_bw()
  • theme_apa()
  • theme_void()
  • etc.

See: https://ggplot2.tidyverse.org/reference/ggtheme.html

34 / 53

Alternative to being too colorful: facets

df %>%
filter(!is.na(ai_threat)) %>%
ggplot(
aes(
x = years_coding,
y = participation_frequency)) +
geom_smooth(
color = "black",
method = "lm",
se = FALSE
) +
facet_wrap(~main_branch, ncol = 3, nrow=2) +
papaja::theme_apa()

35 / 53

The theme() argument in general

The most direct interface for manipulating your theme is the theme() argument. Here you can change the appearance of:

  • axis labels
  • captions and titles
  • legend
  • grid layout
  • the wrapping strips
  • ...
36 / 53

Example: changing the grid layout & axis labels

df %>%
filter(!is.na(ai_threat)) %>%
ggplot(
aes(
x = years_coding,
y = participation_frequency)) +
geom_smooth(
color = "black",
method = "lm",
se = FALSE
) +
facet_wrap(~main_branch, ncol = 3, nrow = 2) +
theme_bw()+
theme(
panel.grid.major =
element_blank(),
panel.grid.minor =
element_blank(),
strip.background =
element_rect(fill = "white")
)

37 / 53

Example: changing axis labels

df %>%
filter(!is.na(ai_threat)) %>%
ggplot(
aes(
x = years_coding,
y = participation_frequency)) +
geom_smooth(
color = "black",
method = "lm",
se = FALSE
) +
facet_wrap(~main_branch, ncol = 3, nrow = 2) +
theme_bw()+
theme(
panel.grid.major =
element_blank(),
panel.grid.minor =
element_blank(),
strip.background =
element_rect(fill = "white")
) +
ylab("Participation Frequency") +
xlab("Year of Coding Experience")

38 / 53

A note on plotting options

Working with combined aesthetics and different data inputs can become challenging.

Particularly, plotting similar aesthetics which interfere with the automatic procedures can create conflicts.

Some 'favorites' include:

  • Multiple legends
  • and various color scales for similar geoms

39 / 53

ggplot plots are 'simple' objects

In contrast to standard R plots, ggplot2 outputs are standard objects like any other object in R (they are lists). So there is no graphics device involved from which we have to record our plot to re-use it later. We can just use it directly.

my_fancy_plot <-
ggplot(data = df,
aes(
x = years_coding,
y = participation_frequency
)
) +
geom_point()
my_fancy_plot <-
my_fancy_plot +
geom_smooth()

Additionally, there is also no need to call dev.off()

40 / 53

It makes combining plots easy

As of today, there are now a lot of packages that help to combine ggplot2s fairly easily. For example, the cowplot package provides a really flexible framework.

Yet, fiddling with this package can become quite complicated. A very easy-to-use package for combining ggplots is patchwork package.

41 / 53

Plotting side by side in one row

library(patchwork)
my_barplot <-
ggplot(
df ,
aes(x = years_coding)
) +
geom_bar()
my_boxplot <-
ggplot(
df ,
aes(y = years_pro_coding)
) +
geom_boxplot()
my_barplot | my_boxplot

42 / 53

Plotting in two columns

my_barplot / my_boxplot

43 / 53

There's more

You can also annotate plots with titles, subtitles, captions, and tags.

You can nest plots and introduce more complex layouts.

If you're interested in this, you should check out the patchwork repository on GitHub as everything is really well-documented there.

44 / 53

Exporting ggplot graphics

Exporting ggplot2 graphics is fairly easy with the ggsave() function. It automatically detects the file format. You can also define the plot height, width, and dpi, which is particularly useful to produce high-class graphics for publications.

nice_plot <-
ggplot(
df ,
aes(x = years_coding)
) +
geom_bar()
ggsave("nice_plot.png", nice_plot, dpi = 300)

Or:

ggsave("nice_plot.tiff", nice_plot, dpi = 300)
45 / 53

Visual exploratory data analysis

In the session on Exploratory Data Analysis (EDA), we have said that visualization should be part of EDA. We can use ggplot2 for this, but there also are many packages out there that offer helpful visualization functions. We will look at two of those, visdat (for visualizing missing data patterns) and GGAlly (for visualizing correlations) in the following. Many of these packages build on ggplot2 and their output can, hence, be further customized or extended using ggplot2 or its extension packages.

46 / 53

Plotting the structure of missing data

library(visdat)
vis_miss(df [,18:23])

47 / 53

Fancier barplots: Relative frequencies

library(scales)
df %>%
ggplot(
aes(
x = age_group,
fill = age_group # Fill bars with color based on age group
)
) +
geom_bar( # Create a bar plot
aes(
y = (..count..)/sum(..count..) # Compute relative frequencies (proportions)
)
) +
scale_y_continuous(
labels = percent # Format y-axis labels as percentages
) +
ylab("Relative Frequencies")+
theme_classic() +
theme(legend.position = "none") # Hide the legend

48 / 53

Fancier barplots: Percentages & formatting

df %>%
filter(!is.na(ai_complexity)) %>%
ggplot(aes(x = ai_complexity,
fill = ai_complexity)) +
geom_bar(aes(y = (..count..)/sum(..count..))) +
scale_y_continuous(labels = percent,
expand = expansion(mult = c(0, 0.1))) +
ylab("Relative Frequencies") +
xlab("")+
theme_classic()+
theme(legend.position = "none")

49 / 53

Are those particpating considering themselves as members?

survey <- stackoverflow_survey_single_response %>%
mutate(
so_comm = recode(as.character(so_comm),
`1` = "Neutral",
`2` = "No, not at all",
`3` = "No, not really",
`4` = "Not sure",
`5` = "Yes, definitely",
`6` = "Yes, somewhat"),
so_part_freq = recode(as.character(so_part_freq),
`1` = "A few times per month or weekly",
`2` = "A few times per week",
`3` = "Daily or almost daily",
`4` = "I have never participated in Q&A on Stack Overflow",
`5` = "Less than once per month or monthly",
`6` = "Multiple times per day")
)
50 / 53
library(scales)
library(ggrepel)
survey %>%
select(country, so_comm, so_part_freq) %>%
filter(!is.na(country), !is.na(so_comm), !is.na(so_part_freq)) %>%
group_by(country) %>%
summarise(
ConsiderMember = mean(so_comm %in% c("Yes, definitely", "Yes, somewhat"), na.rm = TRUE),
#return a logical vector true/false for each row (country) if in freq. group and then calc. mean
particip = mean(so_part_freq %in% c(
"Multiple times per day",
"Daily or almost daily",
"A few times per week",
"A few times per month or weekly"), na.rm = TRUE),
n = n()
) %>%
filter(n > 700) %>%
ggplot(aes(particip, ConsiderMember, label = country, col = ConsiderMember)) +
geom_text_repel(size = 3, point.padding = 0.25) +
geom_point(aes(size = n), alpha = 1) +
scale_y_continuous(labels = percent_format()) +
scale_x_continuous(labels = percent_format()) +
scale_size_continuous(labels = comma_format()) +
scale_color_gradientn(colors = viridis::viridis(50))+
theme_minimal() +
theme(legend.position = "none") +
labs(
x = "% who participated at least weekly",
y = "% who consider themselves as SO Community",
title = "Community Membership by Country and Stack Overflow Participation"
)

51 / 53

52 / 53
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
oTile View: Overview of Slides
Esc Back to slideshow