class: center, middle, inverse, title-slide .title[ # Data Literacy: Introduction to R ] .subtitle[ ## Confirmatory Data Analysis ] .author[ ### Veronika Batzdorfer ] .date[ ### 2025-11-21 ] --- layout: true --- ## Content of this session .pull-left[ **What we will cover** - Using (more or less) simple regression models - OLS, GLM, and the like - How to re-use the results of these models - How to feed these results into tables ] .pull-right[ **What we won't cover** - Theory (and history) of hypothesis testing - Crazy complex models with elaborated estimators - e.g., no multilevel models - also no clustered standard errors - Bayesian statistics ] --- ## Last sessions - We learned how to import/ export data - We learned how to wrangle data, subsets, create new variables, recode, merge, pivot --- ## Recap Distribution ``` r #plot first variable hist(penguins_clean$bill_length_mm, col=rgb(0,0,1,0.25), breaks=10, xlab='Bill length [mm] ', main='Distribution' ) ``` <img src="data:image/png;base64,#4_1_Confirmatory_Data_Analysis_files/figure-html/unnamed-chunk-2-1.png" style="display: block; margin: auto;" /> --- ## Descriptive Statistics ### Select continuous variables for summary ``` r library(modelsummary) df.s <- penguins_clean %>% select(bill_length_mm, body_mass_g, flipper_length_mm) datasummary_skim(df.s, output = 'flextable') ``` <div class="tabwid"><style>.cl-b7f00790{}.cl-b7eb5a60{font-family:'Arial';font-size:11pt;font-weight:normal;font-style:normal;text-decoration:none;color:rgba(0, 0, 0, 1.00);background-color:transparent;}.cl-b7ed3ace{margin:0;text-align:left;border-bottom: 0 solid rgba(0, 0, 0, 1.00);border-top: 0 solid rgba(0, 0, 0, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);padding-bottom:5pt;padding-top:5pt;padding-left:5pt;padding-right:5pt;line-height: 1;background-color:transparent;}.cl-b7ed4cd0{width:0.75in;background-color:transparent;vertical-align: middle;border-bottom: 1.5pt solid rgba(102, 102, 102, 1.00);border-top: 1.5pt solid rgba(102, 102, 102, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);margin-bottom:0;margin-top:0;margin-left:0;margin-right:0;}.cl-b7ed4cda{width:0.75in;background-color:transparent;vertical-align: middle;border-bottom: 0 solid rgba(0, 0, 0, 1.00);border-top: 0 solid rgba(0, 0, 0, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);margin-bottom:0;margin-top:0;margin-left:0;margin-right:0;}.cl-b7ed4cdb{width:0.75in;background-color:transparent;vertical-align: middle;border-bottom: 1.5pt solid rgba(102, 102, 102, 1.00);border-top: 0 solid rgba(0, 0, 0, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);margin-bottom:0;margin-top:0;margin-left:0;margin-right:0;}</style><table data-quarto-disable-processing='true' class='cl-b7f00790'><thead><tr style="overflow-wrap:break-word;"><th class="cl-b7ed4cd0"><p class="cl-b7ed3ace"><span class="cl-b7eb5a60"> </span></p></th><th class="cl-b7ed4cd0"><p class="cl-b7ed3ace"><span class="cl-b7eb5a60">Unique</span></p></th><th class="cl-b7ed4cd0"><p class="cl-b7ed3ace"><span class="cl-b7eb5a60">Missing Pct.</span></p></th><th class="cl-b7ed4cd0"><p class="cl-b7ed3ace"><span class="cl-b7eb5a60">Mean</span></p></th><th class="cl-b7ed4cd0"><p class="cl-b7ed3ace"><span class="cl-b7eb5a60">SD</span></p></th><th class="cl-b7ed4cd0"><p class="cl-b7ed3ace"><span class="cl-b7eb5a60">Min</span></p></th><th class="cl-b7ed4cd0"><p class="cl-b7ed3ace"><span class="cl-b7eb5a60">Median</span></p></th><th class="cl-b7ed4cd0"><p class="cl-b7ed3ace"><span class="cl-b7eb5a60">Max</span></p></th></tr></thead><tbody><tr style="overflow-wrap:break-word;"><td class="cl-b7ed4cda"><p class="cl-b7ed3ace"><span class="cl-b7eb5a60">bill_length_mm</span></p></td><td class="cl-b7ed4cda"><p class="cl-b7ed3ace"><span class="cl-b7eb5a60">163</span></p></td><td class="cl-b7ed4cda"><p class="cl-b7ed3ace"><span class="cl-b7eb5a60">0</span></p></td><td class="cl-b7ed4cda"><p class="cl-b7ed3ace"><span class="cl-b7eb5a60">44.0</span></p></td><td class="cl-b7ed4cda"><p class="cl-b7ed3ace"><span class="cl-b7eb5a60">5.5</span></p></td><td class="cl-b7ed4cda"><p class="cl-b7ed3ace"><span class="cl-b7eb5a60">32.1</span></p></td><td class="cl-b7ed4cda"><p class="cl-b7ed3ace"><span class="cl-b7eb5a60">44.5</span></p></td><td class="cl-b7ed4cda"><p class="cl-b7ed3ace"><span class="cl-b7eb5a60">59.6</span></p></td></tr><tr style="overflow-wrap:break-word;"><td class="cl-b7ed4cda"><p class="cl-b7ed3ace"><span class="cl-b7eb5a60">body_mass_g</span></p></td><td class="cl-b7ed4cda"><p class="cl-b7ed3ace"><span class="cl-b7eb5a60">93</span></p></td><td class="cl-b7ed4cda"><p class="cl-b7ed3ace"><span class="cl-b7eb5a60">0</span></p></td><td class="cl-b7ed4cda"><p class="cl-b7ed3ace"><span class="cl-b7eb5a60">4207.1</span></p></td><td class="cl-b7ed4cda"><p class="cl-b7ed3ace"><span class="cl-b7eb5a60">805.2</span></p></td><td class="cl-b7ed4cda"><p class="cl-b7ed3ace"><span class="cl-b7eb5a60">2700.0</span></p></td><td class="cl-b7ed4cda"><p class="cl-b7ed3ace"><span class="cl-b7eb5a60">4050.0</span></p></td><td class="cl-b7ed4cda"><p class="cl-b7ed3ace"><span class="cl-b7eb5a60">6300.0</span></p></td></tr><tr style="overflow-wrap:break-word;"><td class="cl-b7ed4cdb"><p class="cl-b7ed3ace"><span class="cl-b7eb5a60">flipper_length_mm</span></p></td><td class="cl-b7ed4cdb"><p class="cl-b7ed3ace"><span class="cl-b7eb5a60">54</span></p></td><td class="cl-b7ed4cdb"><p class="cl-b7ed3ace"><span class="cl-b7eb5a60">0</span></p></td><td class="cl-b7ed4cdb"><p class="cl-b7ed3ace"><span class="cl-b7eb5a60">201.0</span></p></td><td class="cl-b7ed4cdb"><p class="cl-b7ed3ace"><span class="cl-b7eb5a60">14.0</span></p></td><td class="cl-b7ed4cdb"><p class="cl-b7ed3ace"><span class="cl-b7eb5a60">172.0</span></p></td><td class="cl-b7ed4cdb"><p class="cl-b7ed3ace"><span class="cl-b7eb5a60">197.0</span></p></td><td class="cl-b7ed4cdb"><p class="cl-b7ed3ace"><span class="cl-b7eb5a60">231.0</span></p></td></tr></tbody></table></div> --- ## Visualize Bivariate Relationships ``` r library(ggplot2) # Scatter pl. with trend line---- ggplot(penguins_clean, aes(x = flipper_length_mm, y = body_mass_g)) + geom_point(aes(color = species), alpha = 0.6) + geom_smooth(method = "lm", se = FALSE, color = "black", linetype = "dashed") + labs(title = "Body Mass by Flipper Length", subtitle = "Clear positive linear relationship", x = "Flipper Length (mm)", y = "Body Mass (g)") ``` <img src="data:image/png;base64,#4_1_Confirmatory_Data_Analysis_files/figure-html/unnamed-chunk-4-1.png" style="display: block; margin: auto;" /> Tip: `facet_wrap()` can show separate relationships by group! --- ## Simple Linear Regression: Our First Model ``` r # Estimate the model penguin_lm_simple <- lm( body_mass_g ~ flipper_length_mm, data = penguins_clean ) # View coefficients penguin_lm_simple ``` ``` ## ## Call: ## lm(formula = body_mass_g ~ flipper_length_mm, data = penguins_clean) ## ## Coefficients: ## (Intercept) flipper_length_mm ## -5872.09 50.15 ``` --- ## Interpreting Model Output: What Does It Mean? ``` r summary(penguin_lm_simple) ``` ``` ## ## Call: ## lm(formula = body_mass_g ~ flipper_length_mm, data = penguins_clean) ## ## Residuals: ## Min 1Q Median 3Q Max ## -1057.33 -259.79 -12.24 242.97 1293.89 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) -5872.09 310.29 -18.93 <2e-16 *** ## flipper_length_mm 50.15 1.54 32.56 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 393.3 on 331 degrees of freedom ## Multiple R-squared: 0.7621, Adjusted R-squared: 0.7614 ## F-statistic: 1060 on 1 and 331 DF, p-value: < 2.2e-16 ``` --- ## Interpreting Coefficients with `parameters` ``` r library(parameters) model_parameters(penguin_lm_simple) ``` ``` ## Parameter | Coefficient | SE | 95% CI | t(331) | p ## --------------------------------------------------------------------------------- ## (Intercept) | -5872.09 | 310.29 | [-6482.47, -5261.71] | -18.92 | < .001 ## flipper length mm | 50.15 | 1.54 | [ 47.12, 53.18] | 32.56 | < .001 ``` --- ## Formula Syntax in `R` ``` r y ~ x1 + x2 + x3 ``` *Note*: Formulas represent a specific object class in `R`. ``` r class(y ~ x1 + x2 + x3) ``` ``` ## [1] "formula" ``` --- ## Denoting the left-hand side with `~` In `R`, stating what your dependent variable is very similar to common mathematical notation: `$$y \sim N(\theta, \epsilon)$$` It states that a specific relationship is actually _estimated_, but we, fortunately, don't have to specify errors here. ``` r y ~ x1 + x2 + x3 ``` Yet, sometimes it may be a good idea to at least explicitly specify the intercept as here: ``` r y ~ 1 + x1 + x2 + x3 ``` --- ## Intercept We can also estimate models without an intercept: ``` r y ~ x1 + x2 + x3 - 1 ``` Or intercept-only models as well: ``` r y ~ 1 ``` --- ## Adding predictors with `+` You can add as many predictors/covariates as you want with the simple <span style="color:orange">`+`</span> operator. See: ``` r y ~ 1 + x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8 + x9 + x10 + x11 + x12 + x13 + x14 + x15 + x16 + x17 + x18 + x19 + x20 + x21 + x22 + x23 + x24 ``` There's also a shortcut for using all (remaining) variables in a data set as predictors: ``` r y ~ . ``` --- ## Multiple Regression: Adding More Predictors ``` r penguin_lm_multi <- lm( body_mass_g ~ flipper_length_mm + bill_length_mm + bill_depth_mm + species, data = penguins_clean ) # Compare with simple model anova(penguin_lm_simple, penguin_lm_multi) ``` ``` ## Analysis of Variance Table ## ## Model 1: body_mass_g ~ flipper_length_mm ## Model 2: body_mass_g ~ flipper_length_mm + bill_length_mm + bill_depth_mm + ## species ## Res.Df RSS Df Sum of Sq F Pr(>F) ## 1 331 51211963 ## 2 327 32397671 4 18814292 47.475 < 2.2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ``` Why multiple? --- ## Interaction effects with `*` and `:` We can also easily add interaction effects to a model. As this is the same as multiplying predictor variables, we also use the <span style="color:orange">**`*`**</span> sign for that. ``` r y ~ x1 * x2 ``` The code above creates a model formula that includes both the <span style="background-color: yellow">main effects</span> of `x1` and `x2`, and their interaction denoted by <span style="background-color: yellow">`x1:x2`</span>. We can even be more explicit and write that into the formula directly: ``` r y ~ x1 + x2 + x1:x2 ``` --- ## Interaction Effects with Penguins Does the effect of flipper length on body mass differ by species? ``` r penguin_lm_interact <- lm( body_mass_g ~ flipper_length_mm * species, data = penguins_clean ) ``` --- ## Interaction Effects ``` r model_parameters(penguin_lm_interact) ``` ``` ## Parameter | Coefficient | SE | 95% CI ## -------------------------------------------------------------------------------------- ## (Intercept) | -2508.09 | 892.41 | [-4263.68, -752.49] ## flipper length mm | 32.69 | 4.69 | [ 23.46, 41.92] ## species [Chinstrap] | -529.11 | 1525.11 | [-3529.37, 2471.16] ## species [Gentoo] | -4166.12 | 1431.58 | [-6982.39, -1349.84] ## flipper length mm × species [Chinstrap] | 1.88 | 7.86 | [ -13.59, 17.36] ## flipper length mm × species [Gentoo] | 21.48 | 6.97 | [ 7.77, 35.18] ## ## Parameter | t(327) | p ## --------------------------------------------------------- ## (Intercept) | -2.81 | 0.005 ## flipper length mm | 6.97 | < .001 ## species [Chinstrap] | -0.35 | 0.729 ## species [Gentoo] | -2.91 | 0.004 ## flipper length mm × species [Chinstrap] | 0.24 | 0.811 ## flipper length mm × species [Gentoo] | 3.08 | 0.002 ``` **Interpretation**: The flipper_length_mm:speciesChinstrap interaction means the slope is 8.9g/mm less steep for Chinstraps vs Adelie. Visualize interactions to aid interpretation! --- ## **Transforming** variables within a formula For simple transformations this can be done as follows: ``` r y ~ log(x) # computes the log10 for x y ~ scale(x) # z-transformation of x ``` We could also change the data type of variables within a function, e.g., by converting a numeric variable to a factor using `as.factor(x)`. --- ## Transforming variables within a formula If you cannot use a specific function for your tansformation, you have to wrap the operation in the <span style="background-color: yellow">`I()`</span> function. ``` r y ~ x + I(x^2) # add a quadratic term of x ``` *Note*: Of course, there are also functions in `R` for transforming variables (e.g., standardizing or centering) before we use them in a formula. Besides the `base R` function `scale()` the [`datawizard package`](https://easystats.github.io/datawizard/), e.g., provides a few functions for that. --- ## Where to use formulas? The previous descriptions mainly refer to formulas used in regression models in `R`. However, formulas are also used in other hypothesis testing methods that distinguish between dependent and variables, such as t-tests or ANOVA. We will try out some of those in the following... --- ## Testing group differences in the distribution ### T-Test: Two Groups A common method for analyzing group differences are t-tests. You can use the two-sample <span style="background-color: yellow">`t.test()`</span> function: ``` r # two-sample t-test # Research question: Do male and female penguins differ in body mass? t_test_sex <- t.test( body_mass_g ~ sex, # Outcome ~ Grouping variable data = penguins_clean, alternative = "two.sided" # Two-tailed test ) ``` --- ## T-Test Results ``` r print(t_test_sex) ``` ``` ## ## Welch Two Sample t-test ## ## data: body_mass_g by sex ## t = -8.5545, df = 323.9, p-value = 4.794e-16 ## alternative hypothesis: true difference in means between group female and group male is not equal to 0 ## 95 percent confidence interval: ## -840.5783 -526.2453 ## sample estimates: ## mean in group female mean in group male ## 3862.273 4545.685 ``` .small[ *Note*: By default, `R` uses [Welch's t-test](https://en.wikipedia.org/wiki/Welch%27s_t-test) (instead of [Student's t-test](https://en.wikipedia.org/wiki/Student%27s_t-test)) which does not assume homogeneity of variance across groups. ] --- ## Visualize T-Test .pull-left[ ``` r t_test <- ggplot(penguins_clean, aes(x = sex, y = body_mass_g, fill = sex)) + geom_boxplot(alpha = 0.7, show.legend = FALSE) + geom_jitter(width = 0.2, alpha = 0.5, show.legend = FALSE) + labs( title = "Body Mass by Sex: Significant Difference", subtitle = "Males are ~860g heavier than females (p < 0.001)", x = "Sex", y = "Body Mass (g)" ) + scale_fill_manual(values = c(female = "darkorange", male = "steelblue")) ``` ] .pull-right[ ``` r t_test ``` <img src="data:image/png;base64,#4_1_Confirmatory_Data_Analysis_files/figure-html/unnamed-chunk-25-1.png" style="display: block; margin: auto;" /> ] --- ## Testing multiple groups: ANOVA Research question: Do species differ in flipper length? ``` r peng_clean <- penguins %>% filter(!is.na(flipper_length_mm)) # 1-way ANOVA fit <- aov(flipper_length_mm ~ species, data = peng_clean) summary(fit) ``` ``` ## Df Sum Sq Mean Sq F value Pr(>F) ## species 2 52473 26237 594.8 <2e-16 *** ## Residuals 339 14953 44 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ``` --- ## Anova visualization .pull-left[ ``` r anova_plot <- ggplot(peng_clean, aes(x = species, y = flipper_length_mm, fill = species)) + geom_boxplot(alpha = 0.7) + labs(title = "Flipper Length by Species", y = "Flipper Length (mm)", x = "Species") + theme_minimal() + theme(legend.position = "none") ``` ] .pull-right[ ``` r anova_plot ``` <img src="data:image/png;base64,#4_1_Confirmatory_Data_Analysis_files/figure-html/unnamed-chunk-27-1.png" style="display: block; margin: auto;" /> ] --- ## Two-way ANOVA Species + Sex effects on body mass ``` r peng_clean <- penguins %>% filter(!is.na(body_mass_g), !is.na(sex), !is.na(species)) fit2 <- aov(body_mass_g ~ species * sex, data = peng_clean) summary(fit2) ``` ``` ## Df Sum Sq Mean Sq F value Pr(>F) ## species 2 145190219 72595110 758.358 < 2e-16 *** ## sex 1 37090262 37090262 387.460 < 2e-16 *** ## species:sex 2 1676557 838278 8.757 0.000197 *** ## Residuals 327 31302628 95727 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ``` --- ## Post-hoc test (Tukey HSD) To find out which groups differ from each other we can compute *Tukey Honest Significant Differences*. ``` r TukeyHSD(fit) ``` ``` ## Tukey multiple comparisons of means ## 95% family-wise confidence level ## ## Fit: aov(formula = flipper_length_mm ~ species, data = peng_clean) ## ## $species ## diff lwr upr p adj ## Chinstrap-Adelie 5.869887 3.586583 8.153191 0 ## Gentoo-Adelie 27.233349 25.334376 29.132323 0 ## Gentoo-Chinstrap 21.363462 19.000841 23.726084 0 ``` --- ## Assumptions Check: Residual Diagnostics .pull-left[ ``` r library(broom) diag_data <- augment(penguin_lm_multi) p_resid <- ggplot(diag_data, aes(.fitted, .resid)) + geom_point(alpha = 0.6) + geom_hline(yintercept = 0, linetype = "dashed") + labs( title = "Residuals vs Fitted", x = "Fitted values", y = "Residuals" ) + theme_minimal() p_qq <- ggplot(diag_data, aes(sample = .std.resid)) + stat_qq(alpha = 0.6) + stat_qq_line() + labs( title = "Normal Q-Q Plot", x = "Theoretical Quantiles", y = "Standardized Residuals" ) + theme_minimal() ``` ] .pull-right[ <img src="data:image/png;base64,#4_1_Confirmatory_Data_Analysis_files/figure-html/unnamed-chunk-29-1.png" style="display: block; margin: auto;" /> ] --- ## Multicollinearity: VIF Check When predictors correlate highly, coefficient estimates become unstable. Check Variance Inflation Factor (VIF): ``` r vif(penguin_lm_multi) ``` ``` ## GVIF Df GVIF^(1/(2*Df)) ## flipper_length_mm 6.469817 1 2.543583 ## bill_length_mm 5.234596 1 2.287924 ## bill_depth_mm 4.772094 1 2.184512 ## species 34.097607 2 2.416468 ``` **Rule of thumb**: VIF > 5 indicates problematic multicollinearity. Here, bill length and depth are somewhat correlated but acceptable. **Solution**: Remove one predictor, combine them, or use regularization. --- ## Non-paprametric Alternative What if our data are <span style="background-color: orange">*not normally distributed*</span>? Check **normality** with the: [Shapiro-Wilk](https://en.wikipedia.org/wiki/Shapiro%E2%80%93Wilk_test) test. .pull-left[ ``` r # Shapiro-Wilk test on residuals shapiro.test(residuals(penguin_lm_multi)) ``` ``` ## ## Shapiro-Wilk normality test ## ## data: residuals(penguin_lm_multi) ## W = 0.99212, p-value = 0.07464 ``` ] .pull-right[ ``` r # Histogram of residuals hist(residuals(penguin_lm_multi), main = "Distribution of Residuals", xlab = "Residuals") ``` <img src="data:image/png;base64,#4_1_Confirmatory_Data_Analysis_files/figure-html/unnamed-chunk-31-1.png" style="display: block; margin: auto;" /> ] .small[ **Note**: With large samples, even minor deviations become "significant". Focus on visual diagnostics and practical significance. ] --- ## Non-parametric alternative to the T-test If the data are not normally distributed, the [Wilcoxon/Mann-Whitney](https://en.wikipedia.org/wiki/Mann%E2%80%93Whitney_U_test). Use when: data is skewed, non-normal, or sample size is small. .pull-left[ ``` r # Do 2 groups differ? # E.g.: flipper length between 2 species----> # Filter only 2 species for a valid two-group test peng_two <- penguins %>% filter(species %in% c("Adelie", "Gentoo")) # Wilcoxon rank-sum test wilcox.test(flipper_length_mm ~ species, data = peng_two) ``` ``` ## ## Wilcoxon rank sum test with continuity correction ## ## data: flipper_length_mm by species ## W = 26, p-value < 2.2e-16 ## alternative hypothesis: true location shift is not equal to 0 ``` ] .pull-right[ ``` r #pairwise comparison between all species pairwise.wilcox.test( x = penguins$flipper_length_mm, g = penguins$species, p.adjust.method = "BH" ) ``` ``` ## ## Pairwise comparisons using Wilcoxon rank sum test with continuity correction ## ## data: penguins$flipper_length_mm and penguins$species ## ## Adelie Chinstrap ## Chinstrap 3e-08 - ## Gentoo <2e-16 <2e-16 ## ## P value adjustment method: BH ``` ] --- ## Checking regression assumptions As in the case of ANOVA, check regression assumptions. There are plenty of good tutorials online (such as [this blog post by Ian Ruginski](https://www.ianruginski.com/post/regressionassumptions/) or [this chapter](https://bookdown.org/jimr1603/Intermediate_R_-_R_for_Survey_Analysis/testing-regression-assumptions.html) in the online book [*R for Survey Analysis*](https://bookdown.org/jimr1603/Intermediate_R_-_R_for_Survey_Analysis/)). --- ## Making Predictions .small[Models are useful for prediction! Let's predict body mass for a new penguin:] ``` r # New observation new_penguin <- data.frame( flipper_length_mm = 200, bill_length_mm = 45, bill_depth_mm = 15, species = "Gentoo" ) # Point prediction predict(penguin_lm_multi, newdata = new_penguin) ``` ``` ## 1 ## 4642.305 ``` ``` r # With confidence interval (uncertainty in estimate) predict(penguin_lm_multi, newdata = new_penguin, interval = "confidence") ``` ``` ## fit lwr upr ## 1 4642.305 4526.208 4758.403 ``` --- ## Standardized Effect Sizes (Beta Coefficients) To compare effect magnitudes across variables with different scales, use standardized coefficients: ``` r library(lm.beta) # Standardize coefficients penguin_lm_beta <- lm.beta(penguin_lm_multi) # Extract standardized coefficients coef(penguin_lm_beta)["flipper_length_mm"] ``` ``` ## flipper_length_mm ## 0.352066 ``` ``` r coef(penguin_lm_beta)["bill_length_mm"] ``` ``` ## bill_length_mm ## 0.2697496 ``` .small[**Note**: A 1 SD increase in flipper length → 0.64 SD increase in body mass – the strongest predictor!] --- ## Categorical Predictors `R` automatically handles categorical variables (factors) by creating dummy variables. ``` r contrasts(penguins_clean$species) ``` ``` ## Chinstrap Gentoo ## Adelie 0 0 ## Chinstrap 1 0 ## Gentoo 0 1 ``` ``` r #model matrix head(model.matrix(penguin_lm_multi)) ``` ``` ## (Intercept) flipper_length_mm bill_length_mm bill_depth_mm speciesChinstrap speciesGentoo ## 1 1 181 39.1 18.7 0 0 ## 2 1 186 39.5 17.4 0 0 ## 3 1 195 40.3 18.0 0 0 ## 4 1 193 36.7 19.3 0 0 ## 5 1 190 39.3 20.6 0 0 ## 6 1 181 38.9 17.8 0 0 ``` .small[ - Reference level - Other levels ] --- ## Generalized linear models There are a multitude of other estimators/link functions, so-called **family objects**, e.g., binomial logistic or Poisson regression through the `glm()` function. See `?family` for an overview. - <span style="background-color: orange">logistic</span> (`glm ( y ~ x, family = 'binomial`)) for binary data, - <span style="background-color: orange">Poisson</span> (`glm ( y ~ x, family = 'poisson`)) for count data, - <span style="background-color: orange">negative binomial</span> for overdispersed count data (`glm( y ~ x, family = 'quasipoisson`)) --- ## GLM with Penguins: Logistic Regression Example .small[Let's predict species (binary: Adelie vs. not) from measurements.] ``` r # Binary outcome penguins_binary <- penguins_clean %>% mutate(is_adelie = ifelse(species == "Adelie", 1, 0)) # Logistic regression penguin_glm <- glm( is_adelie ~ flipper_length_mm + bill_length_mm + bill_depth_mm, family = binomial(), data = penguins_binary ) # Odds ratios exp(coef(penguin_glm)) ``` ``` ## (Intercept) flipper_length_mm bill_length_mm bill_depth_mm ## 4.441432e+04 1.164228e+00 5.852423e-02 1.017151e+02 ``` --- ## Assessing model quality The [`performance` package](https://easystats.github.io/performance/) that is that is part of the [`easystats` collection of packages](https://easystats.github.io/easystats/) offers some functions for assessing model quality, including different types of R². A commonly used metric for logistic regression models, e.g., is Nagelkerke's R². --- ## Model Comparison with AIC Compare nested models using Akaike Information Criterion – balances fit and complexity. ``` r # Simple model model1 <- lm(body_mass_g ~ flipper_length_mm, data = penguins_clean) # Complex model model2 <- lm(body_mass_g ~ flipper_length_mm + bill_length_mm + species, data = penguins_clean) # Compare AIC(model1, model2) ``` ``` ## df AIC ## model1 3 4928.146 ## model2 6 4833.203 ``` .small[ Lower AIC = better model. The complex model is preferred here.] --- ## Stepwise Selection Start with a simple model and stepwise remove possible predictors ``` r # Stepwise AIC based selection step <- step(model1, scope = ~ flipper_length_mm + bill_length_mm + species, trace = FALSE) summary(step) ``` .right[↪️] --- class:middle .small[ ``` ## ## Call: ## lm(formula = body_mass_g ~ flipper_length_mm + species + bill_length_mm, ## data = penguins_clean) ## ## Residuals: ## Min 1Q Median 3Q Max ## -811.52 -227.81 -33.74 216.73 1053.26 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) -3864.073 533.592 -7.242 3.18e-12 *** ## flipper_length_mm 27.544 3.209 8.583 3.77e-16 *** ## speciesChinstrap -732.417 82.063 -8.925 < 2e-16 *** ## speciesGentoo 113.254 89.206 1.270 0.205 ## bill_length_mm 60.117 7.207 8.341 2.06e-15 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 339.6 on 328 degrees of freedom ## Multiple R-squared: 0.8243, Adjusted R-squared: 0.8222 ## F-statistic: 384.7 on 4 and 328 DF, p-value: < 2.2e-16 ``` ] --- ## Comparing model performance The `performance` package we have previously used for calculating Nagelkerke's R² for our logistic regression model also provides a handy function for comparing model performance. --- ## Other regression variants While `glm()` already provides quite a few estimators and link functions, depending on the distribution of your dependent variable, you may need more specialized regression models. A few interesting options for that, e.g, include are the [`MASS` package](https://cran.r-project.org/web/packages/MASS/index.html) for negative binomial regression, the [`pscl` package](https://cran.r-project.org/web/packages/pscl/index.html) for zero-inflated (negative) binomial regression and hurdle models, or the [`mcp` package](https://lindeloev.github.io/mcp/) for regression with multiple change points. --- ## Handling regression results While it is (still) common practice to run regressions, search for 'significant' `p`-values, and paste the results into a table without interpreting them substantially, this may not be the best thing to do. --- ## Reporting Results As we have seen in the session on *Exploratory Data Analysis*, there are different options for creating tables as output in `R`: [`stargazer`](https://cran.r-project.org/web/packages/stargazer/index.html) and [`gtsummary`](http://www.danieldsjoberg.com/gtsummary/) --- ## Example: Regression tables with the `stargazer` package ``` r library(stargazer) stargazer( penguin_lm_multi, type = "text", # For console output; use "html" or "latex" for other formats title = "Regression Results", dep.var.labels = c("AI Trust"), covariate.labels = c( "Main Branch", "Participation Frequency", "Years Coding" ), out.header = FALSE, # Exclude header metadata if outputting to file digits = 3 # Number of decimal places ) ``` .right[↪️] --- class: center, middle .small[ ``` ## ## Regression Results ## =================================================== ## Dependent variable: ## --------------------------- ## AI Trust ## --------------------------------------------------- ## Main Branch 20.226*** ## (3.135) ## ## Participation Frequency 39.718*** ## (7.227) ## ## Years Coding 141.771*** ## (19.163) ## ## speciesChinstrap -496.758*** ## (82.469) ## ## speciesGentoo 965.198*** ## (141.770) ## ## Constant -4,282.080*** ## (497.832) ## ## --------------------------------------------------- ## Observations 333 ## R2 0.849 ## Adjusted R2 0.847 ## Residual Std. Error 314.763 (df = 327) ## F Statistic 369.137*** (df = 5; 327) ## =================================================== ## Note: *p<0.1; **p<0.05; ***p<0.01 ``` ] --- ## Coefficient Plots (Odds Ratios) .small[Create coefficient plot] .pull-left[ ``` r library(sjPlot) # Plot odds ratios with 95% CI plot_model( penguin_glm, show.values = TRUE, value.offset = .3, transform = "exp", # transforms log-odds → odds ratios title = "Odds Ratios for Predicting Adelie Species" ) ``` ] .pull-lright[ <img src="data:image/png;base64,#4_1_Confirmatory_Data_Analysis_files/figure-html/unnamed-chunk-38-1.png" style="display: block; margin: auto;" /> ] --- ## Parameters coefficient plot ``` r library(parameters) model_parameters(penguin_glm) %>% plot()+ theme_minimal() + theme(legend.position = "none") ``` <img src="data:image/png;base64,#4_1_Confirmatory_Data_Analysis_files/figure-html/unnamed-chunk-39-1.png" style="display: block; margin: auto;" /> --- ## Raw Coefficients (Log-Odds) ``` r plot_model( penguin_glm, show.values = TRUE, value.offset = .3, transform = NULL, # keep log-odds title = "Log-Odds Coefficients (Logistic Regression)" ) ``` <img src="data:image/png;base64,#4_1_Confirmatory_Data_Analysis_files/figure-html/unnamed-chunk-40-1.png" style="display: block; margin: auto;" /> --- ## sjPlot: Marginal Effects Plot ``` r plot_model( penguin_glm, type = "pred", terms = "flipper_length_mm", title = "Predicted Probability of Adelie vs. Flipper Length" ) ``` <img src="data:image/png;base64,#4_1_Confirmatory_Data_Analysis_files/figure-html/unnamed-chunk-41-1.png" style="display: block; margin: auto;" /> --- ## You can add multiple predictors ``` r plot_model( penguin_glm, type = "pred", terms = c("flipper_length_mm", "bill_length_mm"), title = "Marginal Effects: Interaction Surface" ) ``` <img src="data:image/png;base64,#4_1_Confirmatory_Data_Analysis_files/figure-html/unnamed-chunk-42-1.png" style="display: block; margin: auto;" /> --- ## Prediction plots vs Regressions coefficients - Regression coefficients are sometimes a bit hard to read -they are based on the scale of the independent variable when unstandardized -it is ambiguous what they actually mean substantially -for example, is a coefficient of .2 or .05 (practically) meaningful or not? - Predictions are a solution - they basically draw the regression line into the scatter plot of the Y and X variables - we can read at which level of X which value of Y is expected --- ## Predictions (sjPlot) ``` r plot_model( penguin_glm, type = "pred", terms = c("bill_depth_mm","bill_length_mm"), title = "Predicted Probability of Adelie by Bill Depth and Length" ) ``` <img src="data:image/png;base64,#4_1_Confirmatory_Data_Analysis_files/figure-html/unnamed-chunk-43-1.png" style="display: block; margin: auto;" /> --- ## Customized ggplot .pull-left[ ``` r library(broom) library(ggplot2) library(dplyr) coef_df <- tidy(penguin_glm, conf.int = TRUE) %>% filter(term != "(Intercept)") %>% mutate(term = reorder(term, estimate)) coef_p <- ggplot(coef_df, aes(x = term, y = estimate)) + geom_point(size = 3) + geom_errorbar(aes(ymin = conf.low, ymax = conf.high), width = .15, linewidth = .8) + labs(title = "Logistic Regression Coefficient Plot", x = "Predictor", y = "Estimate (log-odds)") + coord_flip() + theme_minimal() ``` ] .pull-right[ <img src="data:image/png;base64,#4_1_Confirmatory_Data_Analysis_files/figure-html/unnamed-chunk-45-1.png" style="display: block; margin: auto;" /> ] --- ## Summary Table with sjPlot ``` r tab_model( penguin_glm, show.ci = TRUE, show.se = TRUE, show.aic = TRUE, transform = "exp", dv.labels = "Predicting Adelie Species" ) ``` <table style="border-collapse:collapse; border:none;"> <tr> <th style="border-top: double; text-align:center; font-style:normal; font-weight:bold; padding:0.2cm; text-align:left; "> </th> <th colspan="4" style="border-top: double; text-align:center; font-style:normal; font-weight:bold; padding:0.2cm; ">Predicting Adelie Species</th> </tr> <tr> <td style=" text-align:center; border-bottom:1px solid; font-style:italic; font-weight:normal; text-align:left; ">Predictors</td> <td style=" text-align:center; border-bottom:1px solid; font-style:italic; font-weight:normal; ">Odds Ratios</td> <td style=" text-align:center; border-bottom:1px solid; font-style:italic; font-weight:normal; ">std. Error</td> <td style=" text-align:center; border-bottom:1px solid; font-style:italic; font-weight:normal; ">CI</td> <td style=" text-align:center; border-bottom:1px solid; font-style:italic; font-weight:normal; ">p</td> </tr> <tr> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:left; ">(Intercept)</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">44414.32</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">704167.07</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">0.00 – Inf</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">0.500</td> </tr> <tr> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:left; ">flipper length mm</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">1.16</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">0.11</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">0.00 – Inf</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">0.106</td> </tr> <tr> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:left; ">bill length mm</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">0.06</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">0.06</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">0.00 – Inf</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; "><strong>0.003</strong></td> </tr> <tr> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:left; ">bill depth mm</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">101.72</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">178.50</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">0.00 – Inf</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; "><strong>0.008</strong></td> </tr> <tr> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:left; padding-top:0.1cm; padding-bottom:0.1cm; border-top:1px solid;">Observations</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; padding-top:0.1cm; padding-bottom:0.1cm; text-align:left; border-top:1px solid;" colspan="4">333</td> </tr> <tr> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:left; padding-top:0.1cm; padding-bottom:0.1cm;">R<sup>2</sup> Tjur</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; padding-top:0.1cm; padding-bottom:0.1cm; text-align:left;" colspan="4">0.971</td> </tr> <tr> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:left; padding-top:0.1cm; padding-bottom:0.1cm;">AIC</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; padding-top:0.1cm; padding-bottom:0.1cm; text-align:left;" colspan="4">23.615</td> </tr> </table> --- ## Tidy models with `broom` The [`broom`](https://broom.tidymodels.org/) package provides only 3 but very powerful main functions: - `tidy()`: creates a `tibble` from your model (coefficients) - `glance()`: information about your model as `tibble` ('model fit') - `augment()`: adds observ. information, e.g., individual fitted values to your data *Note*: With `broom` you can also create a standardized output of other types of models, such as Latent Class Analysis (LCA) (with the package [`poLCA`](https://cran.r-project.org/web/packages/poLCA/index.html)). --- ## Conflicts ``` r # See which packages conflict conflicts(detail = TRUE) ``` ``` ## $`package:broom` ## [1] "glance" "tidy" ## ## $`package:car` ## [1] "recode" "some" ## ## $`package:parameters` ## [1] "supported_models" ## ## $`package:modelsummary` ## [1] "glance" "supported_models" "tidy" ## ## $`package:webshot2` ## [1] "%>%" "%>%" "%>%" "%>%" "%>%" "%>%" "%>%" ## ## $`package:lubridate` ## [1] "intersect" "setdiff" "union" "Arith" "Compare" "show" ## [7] "as.difftime" "date" "intersect" "setdiff" "union" ## ## $`package:forcats` ## [1] "%>%" "%>%" "%>%" "%>%" "%>%" "%>%" "%>%" ## ## $`package:stringr` ## [1] "%>%" "%>%" "%>%" "%>%" "%>%" "%>%" "%>%" ## ## $`package:dplyr` ## [1] "%>%" "%>%" "%>%" "intersect" "recode" ## [6] "setdiff" "union" "%>%" "%>%" "all_of" ## [11] "any_of" "as_tibble" "contains" "ends_with" "everything" ## [16] "last_col" "matches" "num_range" "one_of" "starts_with" ## [21] "tibble" "tribble" "%>%" "add_row" "as_data_frame" ## [26] "as_tibble" "data_frame" "glimpse" "lst" "tibble" ## [31] "tribble" "type_sum" "as_label" "enexpr" "enexprs" ## [36] "enquo" "enquos" "ensym" "ensyms" "expr" ## [41] "quo" "quo_name" "quos" "sym" "syms" ## [46] "vars" "%>%" "group_rows" "filter" "lag" ## [51] "intersect" "setdiff" "setequal" "union" ## ## $`package:purrr` ## [1] "%>%" "%>%" "%>%" "%>%" "some" "%>%" "%>%" "%>%" ## ## $`package:tidyr` ## [1] "%>%" "%>%" "%>%" "%>%" "%>%" "all_of" ## [7] "any_of" "as_tibble" "contains" "ends_with" "everything" "last_col" ## [13] "matches" "num_range" "one_of" "starts_with" "tibble" "tribble" ## [19] "%>%" "as_tibble" "tibble" "tribble" "%>%" ## ## $`package:tibble` ## [1] "%>%" "%>%" "%>%" "%>%" "%>%" ## [6] "as_tibble" "tibble" "tribble" "%>%" "add_row" ## [11] "as_data_frame" "as_tibble" "data_frame" "glimpse" "lst" ## [16] "tibble" "tribble" "type_sum" "%>%" ## ## $`package:ggplot2` ## [1] "as_label" "enexpr" "enexprs" "enquo" "enquos" "ensym" "ensyms" "expr" ## [9] "quo" "quo_name" "quos" "sym" "syms" "vars" "Position" ## ## $`package:kableExtra` ## [1] "%>%" "%>%" "%>%" "%>%" "%>%" "%>%" "%>%" ## [8] "group_rows" ## ## $`package:stats` ## [1] "filter" "lag" ## ## $`package:graphics` ## [1] "plot" ## ## $`package:methods` ## [1] "Arith" "Compare" "show" "body<-" "kronecker" ## ## $`package:base` ## [1] "intersect" "setdiff" "union" "as.difftime" "body<-" "date" ## [7] "intersect" "kronecker" "plot" "Position" "setdiff" "setequal" ## [13] "union" ``` ``` r # Look specifically for tidy conflicts conflicts(detail = TRUE)$`package:broom` ``` ``` ## [1] "glance" "tidy" ``` Always load broom after packages like tidyverse to avoid conflicts --- ## `tidy()` ``` r # Load in THIS order library(tidyverse) library(palmerpenguins) library(broom) library(parameters) tidy_penguin <- tidy(penguin_lm_multi) print(tidy_penguin) ``` ``` ## # A tibble: 6 × 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) -4282. 498. -8.60 3.33e-16 ## 2 flipper_length_mm 20.2 3.14 6.45 3.98e-10 ## 3 bill_length_mm 39.7 7.23 5.50 7.85e- 8 ## 4 bill_depth_mm 142. 19.2 7.40 1.17e-12 ## 5 speciesChinstrap -497. 82.5 -6.02 4.59e- 9 ## 6 speciesGentoo 965. 142. 6.81 4.74e-11 ``` --- ## `glance()` ``` r # glance() - model fit glance(penguin_lm_multi) ``` ``` ## # A tibble: 1 × 12 ## r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC deviance ## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 0.849 0.847 315. 369. 4.22e-132 5 -2385. 4784. 4810. 32397671. ## # ℹ 2 more variables: df.residual <int>, nobs <int> ``` --- ## `augment()` ``` r # augment() - diagnostics augment_penguin <- augment(penguin_lm_multi) head(augment_penguin) ``` ``` ## # A tibble: 6 × 11 ## body_mass_g flipper_length_mm bill_length_mm bill_depth_mm species .fitted .resid .hat ## <int> <int> <dbl> <dbl> <fct> <dbl> <dbl> <dbl> ## 1 3750 181 39.1 18.7 Adelie 3583. 167. 0.0170 ## 2 3800 186 39.5 17.4 Adelie 3516. 284. 0.0117 ## 3 3250 195 40.3 18 Adelie 3815. -565. 0.0110 ## 4 3450 193 36.7 19.3 Adelie 3815. -365. 0.0154 ## 5 3650 190 39.3 20.6 Adelie 4042. -392. 0.0247 ## 6 3625 181 38.9 17.8 Adelie 3447. 178. 0.0144 ## # ℹ 3 more variables: .sigma <dbl>, .cooksd <dbl>, .std.resid <dbl> ``` --- ## From Model to Diagnostics Combine all three `broom` functions ``` r # 1. Get coefficients (tidy) coef_table <- tidy(penguin_lm_multi, conf.int = TRUE) # 2. Get model fit (glance) fit_stats <- glance(penguin_lm_multi) # 3. Get diagnostics (augment) diag_data <- augment(penguin_lm_multi) %>% mutate( outlier = abs(.std.resid) > 2, species = penguins_clean$species # Add back categorical variable ) # 4. Visualize residuals by species library(ggplot2) ggplot(diag_data, aes(x = species, y = .resid, fill = outlier)) + geom_boxplot() + labs(title = "Residuals by Species", subtitle = "Outliers highlighted in red", x = "Species", y = "Residuals") + theme_minimal() ``` --- class: center, middle .small[ <img src="data:image/png;base64,#4_1_Confirmatory_Data_Analysis_files/figure-html/unnamed-chunk-49-1.png" style="display: block; margin: auto;" /> ] --- ## Outlook: Other modeling options As we've said at the beginning of this session, we only looked at some of the basic confirmatory analysis methods. As you can imagine, however, `R` offers plenty of options more more complex statistical models as well. A few examples: - Structural equation models with [`lavaan`](https://lavaan.ugent.be/) - Multilevel/mixed-effects models with [`lme4`](https://cran.r-project.org/web/packages/lme4/index.html) - Bayesian regression models using [`brms`](https://paul-buerkner.github.io/brms/) - Mediation analysis using [`mediation`](https://cran.r-project.org/web/packages/mediation/vignettes/mediation.pdf) - response, mediator, sensitivity analyses (`medsens()`) --- ## Extracurricular activities As it is such a simple and useful package, we recommend exploring the `broom` package a bit further. Alex Hayes, one authors of the package, [gave an interesting talk about it at the *RStudio* Conference in 2019](https://rstudio.com/resources/rstudioconf-2019/solving-the-model-representation-problem-with-broom/). --- class: center, middle # [Exercise](https://rawcdn.githack.com/nika-akin/data-analysis-with-R/983625a7d096079e0598f2d60ca1281c757d4659/exercises/Exercise_4_1_3_Regression_Reporting.html) time 🏋️♀️💪🏃🚴 ## [Solutions](https://rawcdn.githack.com/nika-akin/data-analysis-with-R/983625a7d096079e0598f2d60ca1281c757d4659/solutions/Exercise_4_1_3_Regression_Reporting.html)