class: center, middle, inverse, title-slide .title[ # Data Literacy: Introduction to R ] .subtitle[ ## Data Visualization - Part 1 ] .author[ ### Veronika Batzdorfer ] .date[ ### 2024-11-22 ] --- layout: true --- ## We'll start rather basic <img src="data:image/png;base64,#https://github.com/jobreu/r-intro-gesis-2021/blob/main/content/img/trump.jpg?raw=true" width="85%" style="display: block; margin: auto;" /> .footnote[https://twitter.com/katjaberlin/status/1290667772779913218] --- ## Content of the visualization sessions .pull-left[ **`Base R` visualization** - Standard plotting procedures in R - very short ] .pull-right[ **`tidyverse`/`ggplot2` visualization** - Modern interface to graphics - grammar of graphics ] There's more that we won't cover: - [`lattice`](https://cran.r-project.org/web/packages/lattice/index.html) plots, for example --- ### Data for this session ```r library(tidyverse) library(dplyr) stackoverflow_survey_questions <- read_csv("./data/stackoverflow_survey_questions.csv") stackoverflow_survey_single_response <- read_csv("./data/stackoverflow_survey_single_response.csv") qname_levels_single_response_crosswalk <- read_csv("./data/qname_levels_single_response_crosswalk.csv") df <- stackoverflow_survey_single_response %>% rename( response_id = response_id, main_branch = main_branch, age_group = age, remote_work = remote_work, education_level = ed_level, years_coding = years_code, years_pro_coding = years_code_pro, dev_type = dev_type, org_size = org_size, purchase_influence = purchase_influence, build_vs_buy = buildvs_buy, visit_frequency = so_visit_freq, has_account = so_account, participation_frequency = so_part_freq, community_belief = so_comm, ai_usage = ai_select, ai_sentiment = ai_sent, ai_acc = ai_acc, ai_complexity = ai_complex, ai_threat = ai_threat, survey_length = survey_length, survey_ease = survey_ease, yearly_compensation = converted_comp_yearly, r_used = r_used, r_want_to_use = r_want_to_use ) ``` --- ```r # Recode main_branch df <- df %>% mutate( main_branch = case_when( main_branch == 1 ~ "Developer by profession", main_branch == 2 ~ "Learning to code", main_branch == 3 ~ "Not primarily a developer", main_branch == 4 ~ "Hobbyist", main_branch == 5 ~ "Former developer", TRUE ~ as.character(main_branch) ) ) # Recode age_group df <- df %>% mutate( age_group = case_when( age_group == 1 ~ "18-24", age_group == 2 ~ "25-34", age_group == 3 ~ "35-44", age_group == 4 ~ "45-54", age_group == 5 ~ "55-64", age_group == 6 ~ "65+", age_group == 7 ~ "Prefer not to say", age_group == 8 ~ "Under 18", TRUE ~ as.character(age_group) ) ) # Recode remote_work df <- df %>% mutate( remote_work = case_when( remote_work == 1 ~ "Hybrid", remote_work == 2 ~ "In-person", remote_work == 3 ~ "Remote", TRUE ~ as.character(remote_work) ) ) # Recode education_level df <- df %>% mutate( education_level = case_when( education_level == 1 ~ "Associate degree", education_level == 2 ~ "Bachelor’s degree", education_level == 3 ~ "Master’s degree", education_level == 4 ~ "Primary/elementary school", education_level == 5 ~ "Professional degree", education_level == 6 ~ "Secondary school", education_level == 7 ~ "Some college", education_level == 8 ~ "Other", TRUE ~ as.character(education_level) ) ) # Recode so_visit_freq df <- df %>% mutate( visit_frequency = case_when( visit_frequency == 1 ~ "A few times per month or weekly", visit_frequency == 2 ~ "A few times per week", visit_frequency == 3 ~ "Daily or almost daily", visit_frequency == 4 ~ "Less than once per month or monthly", visit_frequency == 5 ~ "Multiple times per day", TRUE ~ as.character(visit_frequency) ) ) df <- df %>% mutate( ai_sentiment = case_when( ai_sentiment == 1 ~ "Favorable", ai_sentiment == 2 ~ "Indifferent", ai_sentiment == 3 ~ "Unfavorable", ai_sentiment == 4 ~ "Unsure", ai_sentiment == 5 ~ "Very favorable", ai_sentiment == 6 ~ "Very unfavorable", TRUE ~ as.character(ai_sentiment) ) ) # Recode ai_threat df <- df %>% mutate( ai_threat = case_when( ai_threat == 1 ~ "I'm not sure", ai_threat == 2 ~ "No", ai_threat == 3 ~ "Yes", TRUE ~ as.character(ai_threat) ) ) # Recode ai_complex df <- df %>% mutate( ai_complexity = case_when( ai_complexity == 1 ~ "Bad at complex tasks", ai_complexity == 2 ~ "Good", ai_complexity == 3 ~ "Neither ", ai_complexity == 4 ~ "Very poor at complex tasks", ai_complexity == 5 ~ "Very well at complex tasks", TRUE ~ as.character(ai_complexity) ) ) # Recode ai_acc df <- df %>% mutate( ai_acc = case_when( ai_acc == 1 ~ "Highly distrust", ai_acc == 2 ~ "Highly trust", ai_acc == 3 ~ "Neither trust nor distrust", ai_acc == 4 ~ "Somewhat distrust", ai_acc == 5 ~ "Somewhat trust", TRUE ~ as.character(ai_acc) ) ) ``` --- ## Graphics in `R` The `graphics` package is already included for that. .pull-left[ ```r main_branch_counts <- table(df$main_branch) # Adjust plot margins to ensure labels fit par(mar = c(8, 4, 4, 2)) # Increase bottom margin # Create the bar plot with smaller font size barplot( main_branch_counts, main = "Distribution of Main Branch (Skill Level)", xlab = "", ylab = "Count", col = "darkgreen", border = "black", las = 2, cex.names = 0.8, cex.main = 0.9, cex.lab = 0.9, cex.axis = 0.8 ) # Reset margins after the plot par(mar = c(5, 4, 4, 2)) ``` ] .pull-right[ <img src="data:image/png;base64,#3_2_Data_Visualization_Part_1_files/figure-html/unnamed-chunk-2-1.png" style="display: block; margin: auto;" /> ] --- ## Ok, but let's start from the beginning The most basic function to plot in R is `plot()`. .pull-left[ ```r options(scipen = 999) plot(df$yearly_compensation) ``` ] .pull-right[ <img src="data:image/png;base64,#3_2_Data_Visualization_Part_1_files/figure-html/unnamed-chunk-3-1.png" style="display: block; margin: auto;" /> ] --- #### Adding to the plot: titles & labels .pull-left[ ```r plot( jitter(df$years_coding, 2), jitter(df$participation_frequency, 2), pch = 1, main = "Relationship Experience and Participation Frequency on SO", xlab = "Year of Experience", ylab = "SO Participation Frequency" ) ``` ] .pull-right[ <img src="data:image/png;base64,#3_2_Data_Visualization_Part_1_files/figure-html/unnamed-chunk-4-1.png" style="display: block; margin: auto;" /> ] --- ## Where to go from here with `base R` graphics? .pull-left[ Using similar procedures, we can add more and more stuff to our plot or edit its elements: - regression lines - legends - annotations - colors - etc. ] .pull-right[ We can also create different *plot types*, such as - histograms - barplots - boxplots - densities - pie charts - etc. ] --- ## Example: A simple boxplot .pull-left[ ```r boxplot( df$years_coding ~ df$participation_frequency ) ``` ] .pull-right[ <img src="data:image/png;base64,#3_2_Data_Visualization_Part_1_files/figure-html/unnamed-chunk-5-1.png" style="display: block; margin: auto;" /> ] --- ## The `par()` and `dev.off()` functions for plots `par()` stands for graphical parameters and is called before the actual plotting function. It prepares the graphics device in `R`. The most commonly used options are for "telling" the device that 2, 3, 4, or `x` plots have to be printed. We can, e.g., use `mfrow` for specifying how many rows (the first value in the vector) and columns (the second value in the vector) we aim to plot. ```r par(mfrow = c(2, 2)) ``` One caveat of using this function is that we actively have to turn off the device before generating another independent plot. ```r dev.off() ``` --- ## Exporting plots It's nice that `R` provides such pleasant plotting opportunities. However, to include them in our papers, we need to export them. As said in the beginning, numerous export formats are available in `R`. --- ## Exporting with *RStudio* <img src="data:image/png;base64,#https://github.com/jobreu/r-intro-gesis-2021/blob/main/content/img/saveGraphic.PNG?raw=true" style="display: block; margin: auto;" /> --- ## Saving plots via a command Alternatively, you can also export plots with the commands `png()`, `pdf()` or `jpeg()`, for example. For this purpose, you first have to wrap the plot call between one of those functions and a `dev.off()`call. ```r png("Plot.png") hist(df$years_coding) dev.off() ``` ```r pdf("Plot.pdf") hist(df$years_coding) dev.off() ``` ```r jpeg("Plot.jpeg") hist(df$years_coding) dev.off() ``` --- ## A personal note on `base R` plotting But to be honest: I do not use all the other functions that often. The syntax is sometimes cumbersome with all the `par()` or `dev.off()` calls, and manipulating parameters simply feels somewhat "outdated". In the following, we will turn towards more modern techniques using the `ggplot2` package. Yet, we still believe that it is worthwhile to become comfortable with `base R` plotting since `ggplot2`, e.g., may sometimes be "too much" for simple quick data exploration. **As so often, in the end, it's also a matter of taste.** --- ## What is `ggplot2`? `ggplot2` is another `R` package for creating plots and is part of the `tidyverse`. It uses the *grammar of graphics*. Some things to note about `ggplot2`: - it is well-suited for multi-dimensional data - it expects data (frames) as input - components of the plot are added as layers ```r plot_call + layer_1 + layer_2 + ... + layer_n ``` --- ## `ggplot2` examples .pull-left[ <img src="data:image/png;base64,#https://github.com/jobreu/r-intro-gesis-2021/blob/main/content/img/143_radar_chart_multi_indiv_2.png?raw=true" style="display: block; margin: auto;" /> ] .pull-right[ <img src="data:image/png;base64,#https://github.com/jobreu/r-intro-gesis-2021/blob/main/content/img/21_ggplot2_ddensity_plot.png?raw=true" style="display: block; margin: auto;" /> ] <small><small>Sources: https://www.r-graph-gallery.com/wp-content/uploads/2016/05/143_radar_chart_multi_indiv_2.png and https://www.r-graph-gallery.com/wp-content/uploads/2015/09/21_ggplot2_ddensity_plot.png</small></small> --- ## `ggplot2` examples .pull-left[ <img src="data:image/png;base64,#https://github.com/jobreu/r-intro-gesis-2021/blob/main/content/img/51_scatterplot_linear_model_with_CI_ggplot2.png?raw=true" style="display: block; margin: auto;" /> ] .pull-right[ <img src="data:image/png;base64,#https://github.com/jobreu/r-intro-gesis-2021/blob/main/content/img/328_Hexbin_map_USA_4.png?raw=true" style="display: block; margin: auto;" /> ] <small><small>Sources: https://www.r-graph-gallery.com/wp-content/uploads/2015/11/51_scatterplot_linear_model_with_CI_ggplot2-300x300.png and https://www.r-graph-gallery.com/wp-content/uploads/2017/12/328_Hexbin_map_USA_4-300x200.png</small></small> --- ## Barplots as in `base R` .tinyish[ .pull-left[ ```r ggplot(df , aes(x = age_group)) + geom_bar() ``` ] ] .pull-right[ <img src="data:image/png;base64,#3_2_Data_Visualization_Part_1_files/figure-html/unnamed-chunk-12-1.png" style="display: block; margin: auto;" /> ] --- ## Boxplots as in `base R` .tinyish[ .pull-left[ ```r ggplot( df , aes( x = as.factor(main_branch), y = years_coding ) ) + geom_boxplot() ``` ] ] .pull-right[ <img src="data:image/png;base64,#3_2_Data_Visualization_Part_1_files/figure-html/unnamed-chunk-13-1.png" style="display: block; margin: auto;" /> ] --- ## Components of a plot According to Wickham (2010, 8)* a layered plot consists of the following components: <span class="footnote"> <small><small><span class="red bold">*</span> http://dx.doi.org/10.1198/jcgs.2009.07098</small></small> </span> - data and aesthetic mappings, - geometric objects, - scales, - and facet specification ```r plot_call + data + aesthetics + geometries + scales + facets ``` --- ## Data requirements You can use one single data frame to create a plot in `ggplot2`. This creates a smooth workflow from data wrangling to the final presentation of the results. <br> <img src="data:image/png;base64,#https://github.com/jobreu/r-intro-gesis-2021/blob/main/content/img/data-science_man.png?raw=true" width="65%" style="display: block; margin: auto;" /> <small><small>Source: http://r4ds.had.co.nz</small></small> --- ## Why the long format? 🐴 `ggplot2` prefers data in long format (**NB**: of course, only if this is possible and makes sense for the data set at hand) .pull-left[ We may want to get used to it as this format has some benefits: - every element we aim to plot is an observation - no thinking required how a specific variable relates to an observation - most importantly, the long format is more parsimonious - it requires less memory and less disk space ] .pull-right[ <img src="data:image/png;base64,#https://github.com/jobreu/r-intro-gesis-2021/blob/main/content/img/long.png?raw=true" width="40%" style="display: block; margin: auto;" /> <small><small>Source: https://github.com/gadenbuie/tidyexplain#tidy-data</small></small> ] --- ## Before we start The architecture of building plots in `ggplot` is similar to standard `R` graphics. There is an initial plotting call, and subsequently, more stuff is added to the plot. However, in `base R`, it is sometimes tricky to find out how to add (or remove) certain plot elements. For example, think of removing the axis ticks in the scatter plot. We will systematically explore which elements are used in `ggplot` in this session. --- ## Creating your own plot We do not want to give a lecture on the theory behind data visualization (if you want that, we suggest having a look at the excellent book [*Fundamentals of Data Visualization*](https://serialmentor.com/dataviz/) by Claus O. Wilke). Creating plots is all about practice... and 'borrowing' code from others. Three components are important: - Plot initiation and data input - aesthetics definition - so-called geoms --- ## Plot initiation Now, let's start from the beginning and have a closer look at the *grammar of graphics*. .pull-left[ `ggplot()` is the most basic command to create a plot: ```r ggplot() ``` ] .pull-right[ <img src="data:image/png;base64,#3_2_Data_Visualization_Part_1_files/figure-html/unnamed-chunk-14-1.png" style="display: block; margin: auto;" /> ] **But it doesn't show anything...** --- ## What now? Data input! .pull-left[ ```r ggplot(data = df ) ``` ] .pull-right[ <img src="data:image/png;base64,#3_2_Data_Visualization_Part_1_files/figure-html/unnamed-chunk-15-1.png" style="display: block; margin: auto;" /> ] **Still nothing there...** --- ## `aes`thetics! .pull-left[ `ggplot` requires information about the variables to plot. ```r ggplot(data = df ) + aes(x = years_coding, y = yearly_compensation) ``` ] .pull-right[ <img src="data:image/png;base64,#3_2_Data_Visualization_Part_1_files/figure-html/unnamed-chunk-16-1.png" style="display: block; margin: auto;" /> ] **That's a little bit better, right?** --- ## `geom`s! .pull-left[ Finally, `ggplot` needs information *how* to plot the variables. ```r ggplot(data = df ) + aes(x = years_coding, y = yearly_compensation) + geom_point() ``` ] .pull-right[ <img src="data:image/png;base64,#3_2_Data_Visualization_Part_1_files/figure-html/unnamed-chunk-17-1.png" style="display: block; margin: auto;" /> ] **A scatter plot!** --- ## Add a fancy `geom` .pull-left[ We can also add more than one `geom`. ```r ggplot(data = df) + aes(x = years_coding, y = yearly_compensation) + geom_jitter() + geom_smooth(method = "lm", se = FALSE) ``` ] .pull-right[ <img src="data:image/png;base64,#3_2_Data_Visualization_Part_1_files/figure-html/unnamed-chunk-18-1.png" style="display: block; margin: auto;" /> ] **A regression line!** (without confidence intervals; the regression behind this operation is run automatically) --- ## Going further: adding group `aes`thetics .pull-left[ We can add different colors for different groups in our data. ```r df %>% filter(!is.na(ai_threat)) %>% ggplot(aes( x = years_coding, y = participation_frequency, group = ai_threat )) + geom_smooth(method = "lm", se = FALSE) ``` ] .pull-right[ <img src="data:image/png;base64,#3_2_Data_Visualization_Part_1_files/figure-html/unnamed-chunk-19-1.png" style="display: block; margin: auto;" /> ] --- ## Manipulating group `aes`thetics .pull-left[ We can also change the colors that are used in the plot. ```r df %>% filter(!is.na(ai_threat)) %>% ggplot( aes( x = years_coding, y = participation_frequency, group = ai_threat, color = ai_threat)) + geom_smooth(method = "lm", se = FALSE) ``` ] .pull-right[ <img src="data:image/png;base64,#3_2_Data_Visualization_Part_1_files/figure-html/unnamed-chunk-20-1.png" style="display: block; margin: auto;" /> ] The legend is drawn automatically, that's handy! --- ## Using another color palette .pull-left[ ```r df %>% filter(!is.na(ai_threat)) %>% ggplot( aes( x = years_coding, y = participation_frequency, group = ai_threat, color = ai_threat)) + geom_smooth(method = "lm", se = FALSE) + scale_color_brewer( palette = "Dark2" ) ``` ] .pull-right[ <img src="data:image/png;base64,#3_2_Data_Visualization_Part_1_files/figure-html/unnamed-chunk-21-1.png" style="display: block; margin: auto;" /> ] --- ## Difference between `color` and `fill` Notably, there are two components of the plot or `geom` associated with colors: `color` and `fill`. Generally, `color` refers to the geometry borders, such as a line. `fill` refers to a geometry area, such as a polygon. Have this difference in mind when you use `scale_color_brewer` or `scale_fill_brewer` in your plots. Manipulating these colors and their corresponding legends in an elaborate plot can get a little tricky. --- ## Colors and `theme`s One particular strength of `ggplot2` lies in its immense theming capabilities. The package has some built-in theme functions that makes theming a plot fairly easy, e.g., - `theme_bw()` - `theme_dark()` - `theme_void()` - etc. See: https://ggplot2.tidyverse.org/reference/ggtheme.html --- ## Alternative to being too colorful: facets .pull-left[ ```r df %>% filter(!is.na(ai_threat)) %>% ggplot( aes( x = years_coding, y = participation_frequency)) + geom_smooth( color = "black", method = "lm", se = FALSE ) + facet_wrap(~ai_threat, ncol = 3) + theme_light() ``` ] .pull-right[ <img src="data:image/png;base64,#3_2_Data_Visualization_Part_1_files/figure-html/unnamed-chunk-22-1.png" style="display: block; margin: auto;" /> ] --- ## The `theme()` argument in general The most direct interface for manipulating your theme is the `theme()` argument. Here you can change the appearance of: - axis labels - captions and titles - legend - grid layout - the wrapping strips - ... --- ## Example: changing the grid layout & axis labels .pull-left[ ```r df %>% filter(!is.na(ai_threat)) %>% ggplot( aes( x = years_coding, y = participation_frequency)) + geom_smooth( color = "black", method = "lm", se = FALSE ) + facet_wrap(~ai_threat, ncol = 3) + theme_bw()+ theme( panel.grid.major = element_blank(), panel.grid.minor = element_blank(), strip.background = element_rect(fill = "white") ) ``` ] .pull-right[ <img src="data:image/png;base64,#3_2_Data_Visualization_Part_1_files/figure-html/unnamed-chunk-23-1.png" style="display: block; margin: auto;" /> ] --- ## Example: changing axis labels .pull-left[ ```r df %>% filter(!is.na(ai_threat)) %>% ggplot( aes( x = years_coding, y = participation_frequency)) + geom_smooth( color = "black", method = "lm", se = FALSE ) + facet_wrap(~ai_threat, ncol = 3) + theme_bw()+ theme( panel.grid.major = element_blank(), panel.grid.minor = element_blank(), strip.background = element_rect(fill = "white") ) + ylab("Participation Frequency") + xlab("Year of Coding Experience") ``` ] .pull-right[ <img src="data:image/png;base64,#3_2_Data_Visualization_Part_1_files/figure-html/unnamed-chunk-24-1.png" style="display: block; margin: auto;" /> ] --- ## A note on plotting options .pull-left[ Working with combined aesthetics and different data inputs can become challenging. Particularly, plotting similar aesthetics which interfere with the automatic procedures can create conflicts. Some 'favorites' include: - Multiple legends - and various color scales for similar `geoms` ] .pull-right[ <img src="data:image/png;base64,#https://github.com/jobreu/r-intro-gesis-2021/blob/main/content/img/800px-The_Scream.jpg?raw=true" style="display: block; margin: auto;" /> ] .right[ <small><small>Source: https://de.wikipedia.org/wiki/Der_Schrei#/media/File:The_Scream.jpg</small></small> ] --- ## `ggplot` plots are 'simple' objects In contrast to standard `R` plots, `ggplot2` outputs are standard objects like any other object in `R` (they are lists). So there is no graphics device involved from which we have to record our plot to re-use it later. We can just use it directly. ```r my_fancy_plot <- ggplot(data = df, aes( x = years_coding, y = participation_frequency ) ) + geom_point() my_fancy_plot <- my_fancy_plot + geom_smooth() ``` Additionally, there is also no need to call `dev.off()` --- ## It makes combining plots easy As of today, there are now a lot of packages that help to combine `ggplot2`s fairly easily. For example, the [`cowplot` package](https://cran.r-project.org/web/packages/cowplot/index.html) provides a really flexible framework. Yet, fiddling with this package can become quite complicated. A very easy-to-use package for combining `ggplot`s is [`patchwork` package](https://cran.r-project.org/web/packages/patchwork/index.html). --- ## Plotting side by side in one row .pull-left[ ```r library(patchwork) my_barplot <- ggplot( df , aes(x = years_coding) ) + geom_bar() my_boxplot <- ggplot( df , aes(y = years_pro_coding) ) + geom_boxplot() my_barplot | my_boxplot ``` ] .pull-right[ <img src="data:image/png;base64,#3_2_Data_Visualization_Part_1_files/figure-html/unnamed-chunk-26-1.png" style="display: block; margin: auto;" /> ] --- ## Plotting in two columns .pull-left[ ```r my_barplot / my_boxplot ``` ] .pull-right[ <img src="data:image/png;base64,#3_2_Data_Visualization_Part_1_files/figure-html/unnamed-chunk-27-1.png" style="display: block; margin: auto;" /> ] --- ## There's more You can also annotate plots with titles, subtitles, captions, and tags. You can nest plots and introduce more complex layouts. If you're interested in this, you should check out the [`patchwork` repository on *GitHub*](https://github.com/thomasp85/patchwork) as everything is really well-documented there. --- ## Exporting ggplot graphics Exporting `ggplot2` graphics is fairly easy with the `ggsave()` function. It automatically detects the file format. You can also define the plot height, width, and dpi, which is particularly useful to produce high-class graphics for publications. ```r nice_plot <- ggplot( df , aes(x = years_coding) ) + geom_bar() ggsave("nice_plot.png", nice_plot, dpi = 300) ``` Or: ```r ggsave("nice_plot.tiff", nice_plot, dpi = 300) ``` --- ## Visual exploratory data analysis In the session on *Exploratory Data Analysis* (EDA), we have said that visualization should be part of EDA. We can use `ggplot2` for this, but there also are many packages out there that offer helpful visualization functions. We will look at two of those, `visdat` (for visualizing missing data patterns) and `GGAlly` (for visualizing correlations) in the following. Many of these packages build on `ggplot2` and their output can, hence, be further customized or extended using `ggplot2` or its extension packages. --- ## Plotting the structure of missing data .pull-left[ ```r library(visdat) vis_miss(df [,18:23]) ``` ] .pull-right[ <img src="data:image/png;base64,#3_2_Data_Visualization_Part_1_files/figure-html/unnamed-chunk-30-1.png" style="display: block; margin: auto;" /> ] --- ## Fancier barplots: Relative frequencies .pull-left[ ```r library(scales) df %>% ggplot( aes( x = age_group, fill = age_group ) ) + geom_bar( aes( y = (..count..)/sum(..count..) ) ) + scale_y_continuous( labels = percent ) + ylab("Relative Frequencies")+ theme_classic() + theme(legend.position = "none") ``` ] .pull-right[ <img src="data:image/png;base64,#3_2_Data_Visualization_Part_1_files/figure-html/unnamed-chunk-31-1.png" style="display: block; margin: auto;" /> ] --- ## Fancier barplots: Percentages & formatting .pull-left[ ```r df %>% filter(!is.na(ai_complexity)) %>% ggplot(aes(x = ai_complexity, fill = ai_complexity)) + geom_bar(aes(y = (..count..)/sum(..count..))) + scale_y_continuous(labels = percent, expand = expansion(mult = c(0, 0.1))) + ylab("Relative Frequencies") + xlab("")+ theme_classic()+ theme(legend.position = "none") ``` ] .pull-right[ <img src="data:image/png;base64,#3_2_Data_Visualization_Part_1_files/figure-html/unnamed-chunk-32-1.png" style="display: block; margin: auto;" /> ] --- ## Correlation plots .pull-left[ ```r library(GGally) # Remove missing values df_clean <- df %>% filter(!is.na(ai_sentiment) & !is.na(ai_acc) & !is.na(ai_complexity) & !is.na(ai_threat)) # to factors df_clean <- df_clean %>% mutate( ai_sentiment = factor(ai_sentiment, levels = c("Very unfavorable", "Somewhat unfavorable", "Neutral", "Somewhat favorable", "Very favorable")), ai_acc = factor(ai_acc, levels = c("Very distrustful", "Somewhat distrustful", "Neutral", "Somewhat trust", "Very trust")), ai_complexity = factor(ai_complexity, levels = c("Bad at complex tasks", "Neutral", "Good at complex tasks")), ai_threat = factor(ai_threat, levels = c("No", "Yes")) ) cor_matrix <- df_clean %>% mutate_all(factor) %>% mutate_all(~as.integer(factor(.))) %>% cor(method = "spearman", use = "complete.obs") # Complete case analysis print(cor_matrix) ggcorr(cor_matrix, label = TRUE, label_size = 2) ``` ] .pull-right[ ``` ## response_id main_branch age_group remote_work ## response_id 1.000000000 0.164323319 -0.148279931 0.045541077 ## main_branch 0.164323319 1.000000000 -0.082842878 -0.081439566 ## age_group -0.148279931 -0.082842878 1.000000000 0.134707320 ## remote_work 0.045541077 -0.081439566 0.134707320 1.000000000 ## education_level -0.063259472 0.042774246 0.044058911 -0.002923373 ## years_coding -0.057852605 -0.028833301 0.697523531 0.069588683 ## years_pro_coding -0.073903343 -0.103291289 0.791243278 0.174225412 ## dev_type 0.018511718 0.078342432 0.012291304 -0.096248425 ## org_size -0.042346948 0.093091663 -0.029415088 0.098455694 ## purchase_influence -0.046379997 0.059371083 -0.035070729 -0.027534405 ## build_vs_buy 0.051007494 -0.031386031 -0.013012664 0.023620009 ## country -0.097186361 -0.076478901 0.135455735 0.103262618 ## currency -0.053975491 -0.072631419 0.058415678 0.155750503 ## comp_total 0.029753830 -0.055667145 0.081770863 0.046390647 ## visit_frequency 0.014274239 0.016976935 -0.129076794 0.048497872 ## has_account NA NA NA NA ## participation_frequency 0.034169039 -0.006159126 0.001479647 0.009595035 ## community_belief -0.045390953 -0.141106771 0.069556561 0.041645185 ## ai_usage NA NA NA NA ## ai_sentiment 0.102007178 0.016621799 0.021355449 0.007902663 ## ai_acc NA NA NA NA ## ai_complexity NA NA NA NA ## ai_threat 0.064963455 0.004569483 -0.040697208 0.137603407 ## survey_length 0.079190004 -0.062120609 -0.040178396 0.058054343 ## survey_ease -0.021310080 0.026167318 -0.003773888 -0.066514850 ## yearly_compensation 0.022360855 -0.037501771 0.430395995 0.063041050 ## r_used 0.034817385 -0.037505365 0.088003502 -0.010698923 ## r_want_to_use -0.002606607 -0.033469491 0.108917316 -0.046601549 ## education_level years_coding years_pro_coding ## response_id -0.063259472 -0.057852605 -0.073903343 ## main_branch 0.042774246 -0.028833301 -0.103291289 ## age_group 0.044058911 0.697523531 0.791243278 ## remote_work -0.002923373 0.069588683 0.174225412 ## education_level 1.000000000 0.108325079 0.036936340 ## years_coding 0.108325079 1.000000000 0.877682312 ## years_pro_coding 0.036936340 0.877682312 1.000000000 ## dev_type 0.064007000 0.059515867 0.083546204 ## org_size 0.124333724 -0.065635299 -0.036543985 ## purchase_influence -0.015699942 -0.167274964 -0.122405400 ## build_vs_buy 0.020076005 0.012249343 0.011391147 ## country -0.153027285 0.075065867 0.100023733 ## currency -0.183735555 -0.012142301 0.017882032 ## comp_total -0.093786063 0.044297143 0.112663841 ## visit_frequency 0.037175223 -0.107707615 -0.130732523 ## has_account NA NA NA ## participation_frequency 0.028908681 0.035955272 0.050614769 ## community_belief -0.016438711 0.046208533 0.065453844 ## ai_usage NA NA NA ## ai_sentiment -0.097540100 -0.061711654 0.048786156 ## ai_acc NA NA NA ## ai_complexity NA NA NA ## ai_threat -0.137655167 -0.099787155 -0.136182635 ## survey_length 0.039336578 -0.008812184 -0.020357660 ## survey_ease -0.049303289 -0.021059869 0.024911413 ## yearly_compensation -0.039976803 0.449317464 0.469264679 ## r_used 0.035780570 0.058272674 0.002107932 ## r_want_to_use 0.015218577 0.108337652 0.043369851 ## dev_type org_size purchase_influence ## response_id 0.018511718 -0.042346948 -0.046379997 ## main_branch 0.078342432 0.093091663 0.059371083 ## age_group 0.012291304 -0.029415088 -0.035070729 ## remote_work -0.096248425 0.098455694 -0.027534405 ## education_level 0.064007000 0.124333724 -0.015699942 ## years_coding 0.059515867 -0.065635299 -0.167274964 ## years_pro_coding 0.083546204 -0.036543985 -0.122405400 ## dev_type 1.000000000 0.083498975 -0.064158718 ## org_size 0.083498975 1.000000000 -0.098442456 ## purchase_influence -0.064158718 -0.098442456 1.000000000 ## build_vs_buy -0.042264261 -0.014514943 0.006545222 ## country -0.068286874 -0.186177751 0.054106541 ## currency 0.014175689 -0.028133024 0.092433319 ## comp_total 0.004035800 0.041898075 0.134824831 ## visit_frequency -0.053486393 -0.092671855 -0.017782554 ## has_account NA NA NA ## participation_frequency -0.045339854 -0.042455110 -0.050442897 ## community_belief 0.008340008 -0.045408900 -0.016103889 ## ai_usage NA NA NA ## ai_sentiment 0.066028207 -0.005249052 0.084059666 ## ai_acc NA NA NA ## ai_complexity NA NA NA ## ai_threat -0.072714868 -0.002764507 0.056734516 ## survey_length -0.104552281 0.032902715 -0.142042800 ## survey_ease -0.032040297 0.008952926 -0.044444639 ## yearly_compensation 0.104943751 -0.150360307 0.002308850 ## r_used 0.095639273 0.053534615 0.141820049 ## r_want_to_use 0.017423375 0.127361747 0.118096004 ## build_vs_buy country currency comp_total ## response_id 0.051007494 -0.09718636 -0.05397549 0.02975383 ## main_branch -0.031386031 -0.07647890 -0.07263142 -0.05566714 ## age_group -0.013012664 0.13545574 0.05841568 0.08177086 ## remote_work 0.023620009 0.10326262 0.15575050 0.04639065 ## education_level 0.020076005 -0.15302729 -0.18373556 -0.09378606 ## years_coding 0.012249343 0.07506587 -0.01214230 0.04429714 ## years_pro_coding 0.011391147 0.10002373 0.01788203 0.11266384 ## dev_type -0.042264261 -0.06828687 0.01417569 0.00403580 ## org_size -0.014514943 -0.18617775 -0.02813302 0.04189808 ## purchase_influence 0.006545222 0.05410654 0.09243332 0.13482483 ## build_vs_buy 1.000000000 0.04980502 0.12757616 -0.09365334 ## country 0.049805022 1.00000000 0.75526817 0.18215559 ## currency 0.127576160 0.75526817 1.00000000 0.20566388 ## comp_total -0.093653342 0.18215559 0.20566388 1.00000000 ## visit_frequency 0.015756942 -0.03891360 -0.06879638 0.03649558 ## has_account NA NA NA NA ## participation_frequency 0.104131761 -0.03962814 -0.07997838 -0.01809675 ## community_belief 0.019564032 0.05563624 0.01271872 0.02057449 ## ai_usage NA NA NA NA ## ai_sentiment -0.052978270 -0.08627177 -0.08013472 0.09373389 ## ai_acc NA NA NA NA ## ai_complexity NA NA NA NA ## ai_threat -0.031555778 -0.05812340 -0.05380360 0.01392942 ## survey_length 0.137700454 -0.12127842 -0.05721348 0.08199237 ## survey_ease -0.039207320 0.06493793 -0.01033303 0.04194963 ## yearly_compensation -0.035137365 0.44317477 0.30792209 0.22566212 ## r_used -0.024704927 0.17059583 0.10417857 -0.01822880 ## r_want_to_use -0.054227233 0.13318225 0.09510533 -0.04770680 ## visit_frequency has_account participation_frequency ## response_id 0.01427424 NA 0.0341690390 ## main_branch 0.01697693 NA -0.0061591258 ## age_group -0.12907679 NA 0.0014796465 ## remote_work 0.04849787 NA 0.0095950348 ## education_level 0.03717522 NA 0.0289086811 ## years_coding -0.10770762 NA 0.0359552719 ## years_pro_coding -0.13073252 NA 0.0506147687 ## dev_type -0.05348639 NA -0.0453398539 ## org_size -0.09267185 NA -0.0424551095 ## purchase_influence -0.01778255 NA -0.0504428971 ## build_vs_buy 0.01575694 NA 0.1041317611 ## country -0.03891360 NA -0.0396281380 ## currency -0.06879638 NA -0.0799783755 ## comp_total 0.03649558 NA -0.0180967507 ## visit_frequency 1.00000000 NA -0.1352814075 ## has_account NA 1 NA ## participation_frequency -0.13528141 NA 1.0000000000 ## community_belief 0.19937032 NA -0.0365081788 ## ai_usage NA NA NA ## ai_sentiment 0.03433023 NA -0.0531274802 ## ai_acc NA NA NA ## ai_complexity NA NA NA ## ai_threat 0.03672628 NA 0.0244884595 ## survey_length -0.01405976 NA -0.1038068403 ## survey_ease 0.10317334 NA -0.0811395153 ## yearly_compensation -0.15609534 NA 0.0835903849 ## r_used -0.07746261 NA -0.1261998129 ## r_want_to_use -0.06912702 NA 0.0005877856 ## community_belief ai_usage ai_sentiment ai_acc ## response_id -0.045390953 NA 0.102007178 NA ## main_branch -0.141106771 NA 0.016621799 NA ## age_group 0.069556561 NA 0.021355449 NA ## remote_work 0.041645185 NA 0.007902663 NA ## education_level -0.016438711 NA -0.097540100 NA ## years_coding 0.046208533 NA -0.061711654 NA ## years_pro_coding 0.065453844 NA 0.048786156 NA ## dev_type 0.008340008 NA 0.066028207 NA ## org_size -0.045408900 NA -0.005249052 NA ## purchase_influence -0.016103889 NA 0.084059666 NA ## build_vs_buy 0.019564032 NA -0.052978270 NA ## country 0.055636240 NA -0.086271766 NA ## currency 0.012718716 NA -0.080134718 NA ## comp_total 0.020574490 NA 0.093733891 NA ## visit_frequency 0.199370317 NA 0.034330233 NA ## has_account NA NA NA NA ## participation_frequency -0.036508179 NA -0.053127480 NA ## community_belief 1.000000000 NA 0.073157593 NA ## ai_usage NA 1 NA NA ## ai_sentiment 0.073157593 NA 1.000000000 NA ## ai_acc NA NA NA 1 ## ai_complexity NA NA NA NA ## ai_threat -0.006273226 NA 0.027834784 NA ## survey_length -0.052292011 NA -0.138985632 NA ## survey_ease 0.040790072 NA 0.042868121 NA ## yearly_compensation 0.001123081 NA 0.053335425 NA ## r_used 0.070398575 NA 0.010118369 NA ## r_want_to_use 0.036014985 NA 0.009029552 NA ## ai_complexity ai_threat survey_length survey_ease ## response_id NA 0.064963455 0.079190004 -0.021310080 ## main_branch NA 0.004569483 -0.062120609 0.026167318 ## age_group NA -0.040697208 -0.040178396 -0.003773888 ## remote_work NA 0.137603407 0.058054343 -0.066514850 ## education_level NA -0.137655167 0.039336578 -0.049303289 ## years_coding NA -0.099787155 -0.008812184 -0.021059869 ## years_pro_coding NA -0.136182635 -0.020357660 0.024911413 ## dev_type NA -0.072714868 -0.104552281 -0.032040297 ## org_size NA -0.002764507 0.032902715 0.008952926 ## purchase_influence NA 0.056734516 -0.142042800 -0.044444639 ## build_vs_buy NA -0.031555778 0.137700454 -0.039207320 ## country NA -0.058123399 -0.121278424 0.064937932 ## currency NA -0.053803597 -0.057213481 -0.010333035 ## comp_total NA 0.013929415 0.081992369 0.041949629 ## visit_frequency NA 0.036726280 -0.014059759 0.103173339 ## has_account NA NA NA NA ## participation_frequency NA 0.024488460 -0.103806840 -0.081139515 ## community_belief NA -0.006273226 -0.052292011 0.040790072 ## ai_usage NA NA NA NA ## ai_sentiment NA 0.027834784 -0.138985632 0.042868121 ## ai_acc NA NA NA NA ## ai_complexity 1 NA NA NA ## ai_threat NA 1.000000000 0.010598349 -0.070979820 ## survey_length NA 0.010598349 1.000000000 -0.024218036 ## survey_ease NA -0.070979820 -0.024218036 1.000000000 ## yearly_compensation NA -0.086783071 -0.143368048 -0.055407127 ## r_used NA -0.062806303 -0.047464792 0.020435381 ## r_want_to_use NA -0.056047848 0.041600826 0.109722172 ## yearly_compensation r_used r_want_to_use ## response_id 0.022360855 0.034817385 -0.0026066073 ## main_branch -0.037501771 -0.037505365 -0.0334694907 ## age_group 0.430395995 0.088003502 0.1089173162 ## remote_work 0.063041050 -0.010698923 -0.0466015488 ## education_level -0.039976803 0.035780570 0.0152185771 ## years_coding 0.449317464 0.058272674 0.1083376524 ## years_pro_coding 0.469264679 0.002107932 0.0433698511 ## dev_type 0.104943751 0.095639273 0.0174233746 ## org_size -0.150360307 0.053534615 0.1273617474 ## purchase_influence 0.002308850 0.141820049 0.1180960038 ## build_vs_buy -0.035137365 -0.024704927 -0.0542272331 ## country 0.443174771 0.170595831 0.1331822527 ## currency 0.307922092 0.104178571 0.0951053306 ## comp_total 0.225662120 -0.018228798 -0.0477067964 ## visit_frequency -0.156095343 -0.077462610 -0.0691270205 ## has_account NA NA NA ## participation_frequency 0.083590385 -0.126199813 0.0005877856 ## community_belief 0.001123081 0.070398575 0.0360149848 ## ai_usage NA NA NA ## ai_sentiment 0.053335425 0.010118369 0.0090295524 ## ai_acc NA NA NA ## ai_complexity NA NA NA ## ai_threat -0.086783071 -0.062806303 -0.0560478479 ## survey_length -0.143368048 -0.047464792 0.0416008264 ## survey_ease -0.055407127 0.020435381 0.1097221719 ## yearly_compensation 1.000000000 0.124785798 0.0693379926 ## r_used 0.124785798 1.000000000 0.6642005029 ## r_want_to_use 0.069337993 0.664200503 1.0000000000 ``` <img src="data:image/png;base64,#3_2_Data_Visualization_Part_1_files/figure-html/unnamed-chunk-33-1.png" style="display: block; margin: auto;" /> ] --- ## Some additional resources - [ggplot2 - Elegant Graphics for Data Analysis](https://www.springer.com/gp/book/9783319242750) by Hadley Wickham - [Chapter 3](https://r4ds.had.co.nz/data-visualisation.html) in *R for Data Science* - [Fundamentals of Data Visualization](https://serialmentor.com/dataviz/) by Claus O. Wilke - [Data Visualization - A Practical Introduction](https://press.princeton.edu/titles/13826.html) by Kieran Healy - [data-to-viz](https://www.data-to-viz.com/) - [R Graph Gallery](https://www.r-graph-gallery.com/) - [BBC Visual and Data Journalism cookbook for R graphics](https://bbc.github.io/rcookbook/#how_to_create_bbc_style_graphics) - [List of `ggplot2` extensions](https://exts.ggplot2.tidyverse.org/)