Data Literacy: Introduction to R

.title[
# Data Literacy: Introduction to R
]
.subtitle[
## Confirmatory Data Analysis
]
.author[
### Veronika Batzdorfer
]
.date[
### 2025-11-21
]

---

---

## Content of this session
.pull-left[
**What we will cover**
- Using (more or less) simple regression models

- OLS, GLM, and the like

- How to re-use the results of these models

- How to feed these results into tables
]

- Crazy complex models with elaborated estimators
- e.g., no multilevel models
- also no clustered standard errors

- Bayesian statistics
]

---

## Last sessions

- We learned how to import/ export data
- We learned how to wrangle data, subsets, create new variables, recode, merge, pivot

---
## Recap Distribution

``` r
#plot first variable
hist(penguins_clean$bill_length_mm, 
     col=rgb(0,0,1,0.25), 
          breaks=10,
     xlab='Bill length [mm] ',
     main='Distribution'
) 
```

---
## Descriptive Statistics
### Select continuous variables for summary

``` r
library(modelsummary)

df.s <- penguins_clean %>% 
 select(bill_length_mm, body_mass_g, flipper_length_mm) 
 
datasummary_skim(df.s, output = 'flextable')
```

<div class="tabwid"><style>.cl-b7f00790{}.cl-b7eb5a60{font-family:'Arial';font-size:11pt;font-weight:normal;font-style:normal;text-decoration:none;color:rgba(0, 0, 0, 1.00);background-color:transparent;}.cl-b7ed3ace{margin:0;text-align:left;border-bottom: 0 solid rgba(0, 0, 0, 1.00);border-top: 0 solid rgba(0, 0, 0, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);padding-bottom:5pt;padding-top:5pt;padding-left:5pt;padding-right:5pt;line-height: 1;background-color:transparent;}.cl-b7ed4cd0{width:0.75in;background-color:transparent;vertical-align: middle;border-bottom: 1.5pt solid rgba(102, 102, 102, 1.00);border-top: 1.5pt solid rgba(102, 102, 102, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);margin-bottom:0;margin-top:0;margin-left:0;margin-right:0;}.cl-b7ed4cda{width:0.75in;background-color:transparent;vertical-align: middle;border-bottom: 0 solid rgba(0, 0, 0, 1.00);border-top: 0 solid rgba(0, 0, 0, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);margin-bottom:0;margin-top:0;margin-left:0;margin-right:0;}.cl-b7ed4cdb{width:0.75in;background-color:transparent;vertical-align: middle;border-bottom: 1.5pt solid rgba(102, 102, 102, 1.00);border-top: 0 solid rgba(0, 0, 0, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);margin-bottom:0;margin-top:0;margin-left:0;margin-right:0;}</style><table data-quarto-disable-processing='true' class='cl-b7f00790'><thead><tr style="overflow-wrap:break-word;"><th class="cl-b7ed4cd0"> </th><th class="cl-b7ed4cd0">Unique</th><th class="cl-b7ed4cd0">Missing Pct.</th><th class="cl-b7ed4cd0">Mean</th><th class="cl-b7ed4cd0">SD</th><th class="cl-b7ed4cd0">Min</th><th class="cl-b7ed4cd0">Median</th><th class="cl-b7ed4cd0">Max</th></tr></thead><tbody><tr style="overflow-wrap:break-word;"><td class="cl-b7ed4cda">bill_length_mm</td><td class="cl-b7ed4cda">163</td><td class="cl-b7ed4cda">0</td><td class="cl-b7ed4cda">44.0</td><td class="cl-b7ed4cda">5.5</td><td class="cl-b7ed4cda">32.1</td><td class="cl-b7ed4cda">44.5</td><td class="cl-b7ed4cda">59.6</td></tr><tr style="overflow-wrap:break-word;"><td class="cl-b7ed4cda">body_mass_g</td><td class="cl-b7ed4cda">93</td><td class="cl-b7ed4cda">0</td><td class="cl-b7ed4cda">4207.1</td><td class="cl-b7ed4cda">805.2</td><td class="cl-b7ed4cda">2700.0</td><td class="cl-b7ed4cda">4050.0</td><td class="cl-b7ed4cda">6300.0</td></tr><tr style="overflow-wrap:break-word;"><td class="cl-b7ed4cdb">flipper_length_mm</td><td class="cl-b7ed4cdb">54</td><td class="cl-b7ed4cdb">0</td><td class="cl-b7ed4cdb">201.0</td><td class="cl-b7ed4cdb">14.0</td><td class="cl-b7ed4cdb">172.0</td><td class="cl-b7ed4cdb">197.0</td><td class="cl-b7ed4cdb">231.0</td></tr></tbody></table></div>

---

## Visualize Bivariate Relationships

``` r
library(ggplot2)

# Scatter pl. with trend line----
ggplot(penguins_clean, aes(x = flipper_length_mm, y = body_mass_g)) +
  geom_point(aes(color = species), alpha = 0.6) +
  geom_smooth(method = "lm", se = FALSE, color = "black", linetype = "dashed") +
  labs(title = "Body Mass by Flipper Length",
       subtitle = "Clear positive linear relationship",
       x = "Flipper Length (mm)",
       y = "Body Mass (g)")
```

Tip: `facet_wrap()` can show separate relationships by group!

---

## Simple Linear Regression: Our First Model

``` r
# Estimate the model
penguin_lm_simple <- lm(
 body_mass_g ~ flipper_length_mm,
 data = penguins_clean
)

# View coefficients
penguin_lm_simple
```

```
## 
## Call:
## lm(formula = body_mass_g ~ flipper_length_mm, data = penguins_clean)
## 
## Coefficients:
##       (Intercept)  flipper_length_mm  
##          -5872.09              50.15
```
---

## Interpreting Model Output: What Does It Mean?

``` r
summary(penguin_lm_simple)
```

```
## 
## Call:
## lm(formula = body_mass_g ~ flipper_length_mm, data = penguins_clean)
## 
## Residuals:
## Min 1Q Median 3Q Max 
## -1057.33 -259.79 -12.24 242.97 1293.89 
## 
## Coefficients:
## Estimate Std. Error t value Pr(>|t|) 
## (Intercept) -5872.09 310.29 -18.93 <2e-16 ***
## flipper_length_mm 50.15 1.54 32.56 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 393.3 on 331 degrees of freedom
## Multiple R-squared: 0.7621,	Adjusted R-squared: 0.7614 
## F-statistic: 1060 on 1 and 331 DF, p-value: < 2.2e-16
```

---

## Interpreting Coefficients with `parameters`

``` r
library(parameters)

model_parameters(penguin_lm_simple)
```

```
## Parameter | Coefficient | SE | 95% CI | t(331) | p
## ---------------------------------------------------------------------------------
## (Intercept) | -5872.09 | 310.29 | [-6482.47, -5261.71] | -18.92 | < .001
## flipper length mm | 50.15 | 1.54 | [ 47.12, 53.18] | 32.56 | < .001
```

---

## Formula Syntax in `R`

``` r
y ~ x1 + x2 + x3
```

*Note*: Formulas represent a specific object class in `R`.

``` r
class(y ~ x1 + x2 + x3)
```

```
## [1] "formula"
```

---

## Denoting the left-hand side with `~`

In `R`, stating what your dependent variable is very similar to common mathematical notation:

`$$y \sim N(\theta, \epsilon)$$`

It states that a specific relationship is actually _estimated_, but we, fortunately, don't have to specify errors here.

``` r
y ~ x1 + x2 + x3
```

Yet, sometimes it may be a good idea to at least explicitly specify the intercept as here:

``` r
y ~ 1 + x1 + x2 + x3
```

---

## Intercept

We can also estimate models without an intercept:

``` r
y ~ x1 + x2 + x3 - 1
```

Or intercept-only models as well:

``` r
y ~ 1
```

---

## Adding predictors with `+`
You can add as many predictors/covariates as you want with the simple `+`
 operator. See:

``` r
y ~ 1 + x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8 + x9 + x10 + x11 + x12 +
  x13 + x14 + x15 + x16 + x17 + x18 + x19 + x20 + x21 + x22 + x23 + x24
```

There's also a shortcut for using all (remaining) variables in a data set as predictors:

``` r
y ~ .
```

---

## Multiple Regression: Adding More Predictors

``` r
penguin_lm_multi <- lm(
 body_mass_g ~ flipper_length_mm +
 bill_length_mm +
 bill_depth_mm +
 species,
 data = penguins_clean
)

# Compare with simple model
anova(penguin_lm_simple, penguin_lm_multi)
```

```
## Analysis of Variance Table
## 
## Model 1: body_mass_g ~ flipper_length_mm
## Model 2: body_mass_g ~ flipper_length_mm + bill_length_mm + bill_depth_mm + 
## species
## Res.Df RSS Df Sum of Sq F Pr(>F) 
## 1 331 51211963 
## 2 327 32397671 4 18814292 47.475 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```

Why multiple?

---

## Interaction effects with `*` and `:`

We can also easily add interaction effects to a model. As this is the same as multiplying predictor variables, we also use the **`*`** sign for that.

``` r
y ~ x1 * x2
```

The code above creates a model formula that includes both the main effects of `x1` and `x2`, and their interaction denoted by `x1:x2`. We can even be more explicit and write that into the formula directly:

``` r
y ~ x1 + x2 + x1:x2
```

---

## Interaction Effects with Penguins
Does the effect of flipper length on body mass differ by species?

``` r
penguin_lm_interact <- lm(
 body_mass_g ~ flipper_length_mm * species,
 data = penguins_clean
)
```
---
## Interaction Effects

``` r
model_parameters(penguin_lm_interact)
```

```
## Parameter | Coefficient | SE | 95% CI
## --------------------------------------------------------------------------------------
## (Intercept) | -2508.09 | 892.41 | [-4263.68, -752.49]
## flipper length mm | 32.69 | 4.69 | [ 23.46, 41.92]
## species [Chinstrap] | -529.11 | 1525.11 | [-3529.37, 2471.16]
## species [Gentoo] | -4166.12 | 1431.58 | [-6982.39, -1349.84]
## flipper length mm × species [Chinstrap] | 1.88 | 7.86 | [ -13.59, 17.36]
## flipper length mm × species [Gentoo] | 21.48 | 6.97 | [ 7.77, 35.18]
## 
## Parameter | t(327) | p
## ---------------------------------------------------------
## (Intercept) | -2.81 | 0.005 
## flipper length mm | 6.97 | < .001
## species [Chinstrap] | -0.35 | 0.729 
## species [Gentoo] | -2.91 | 0.004 
## flipper length mm × species [Chinstrap] | 0.24 | 0.811 
## flipper length mm × species [Gentoo] | 3.08 | 0.002
```

**Interpretation**: The flipper_length_mm:speciesChinstrap interaction means the slope is 8.9g/mm less steep for Chinstraps vs Adelie.
Visualize interactions to aid interpretation!

---

## **Transforming** variables within a formula

For simple transformations this can be done as follows:

``` r
y ~ log(x)   # computes the log10 for x
y ~ scale(x) # z-transformation of x
```

We could also change the data type of variables within a function, e.g., by converting a numeric variable to a factor using `as.factor(x)`.

---

## Transforming variables within a formula

If you cannot use a specific function for your tansformation, you have to wrap the operation in the `I()` function.

``` r
y ~ x + I(x^2) # add a quadratic term of x
```

*Note*: Of course, there are also functions in `R` for transforming variables (e.g., standardizing or centering) before we use them in a formula. Besides the `base R` function `scale()` the [`datawizard package`](https://easystats.github.io/datawizard/), e.g., provides a few functions for that.

---

## Where to use formulas?

The previous descriptions mainly refer to formulas used in regression models in `R`. However, formulas are also used in other hypothesis testing methods that distinguish between dependent and variables, such as t-tests or ANOVA. We will try out some of those in the following...

---

## Testing group differences in the distribution
### T-Test: Two Groups

A common method for analyzing group differences are t-tests. You can use the two-sample `t.test()` function:

``` r
#  two-sample t-test
# Research question: Do male and female penguins differ in body mass?

t_test_sex <- t.test(
 body_mass_g ~ sex, # Outcome ~ Grouping variable
 data = penguins_clean, 
 alternative = "two.sided" # Two-tailed test 
)
```

---
## T-Test Results

``` r
print(t_test_sex)
```

```
## 
## 	Welch Two Sample t-test
## 
## data:  body_mass_g by sex
## t = -8.5545, df = 323.9, p-value = 4.794e-16
## alternative hypothesis: true difference in means between group female and group male is not equal to 0
## 95 percent confidence interval:
##  -840.5783 -526.2453
## sample estimates:
## mean in group female   mean in group male 
##             3862.273             4545.685
```

.small[
*Note*: By default, `R` uses [Welch's t-test](https://en.wikipedia.org/wiki/Welch%27s_t-test) (instead of [Student's t-test](https://en.wikipedia.org/wiki/Student%27s_t-test)) which does not assume homogeneity of variance across groups.
]

---
## Visualize T-Test

``` r
t_test <- ggplot(penguins_clean,
 aes(x = sex, y = body_mass_g,
 fill = sex)) +
 geom_boxplot(alpha = 0.7, show.legend = FALSE) +
 geom_jitter(width = 0.2, alpha = 0.5, show.legend = FALSE) +
 labs(
 title = "Body Mass by Sex: Significant Difference",
 subtitle = "Males are ~860g heavier than females (p < 0.001)",
 x = "Sex",
 y = "Body Mass (g)"
 ) +
 scale_fill_manual(values = c(female = "darkorange", male = "steelblue"))
```
]

``` r
t_test
```

<img src="data:image/png;base64,#4_1_Confirmatory_Data_Analysis_files/figure-html/unnamed-chunk-25-1.png" style="display: block; margin: auto;" />
]

---

## Testing multiple groups: ANOVA

Research question: Do species differ in flipper length?

``` r
peng_clean <- penguins %>% 
 filter(!is.na(flipper_length_mm))

# 1-way ANOVA
fit <- aov(flipper_length_mm ~ species, data = peng_clean)
summary(fit)
```

```
## Df Sum Sq Mean Sq F value Pr(>F) 
## species 2 52473 26237 594.8 <2e-16 ***
## Residuals 339 14953 44 
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```

---

## Anova visualization
.pull-left[

``` r
anova_plot <- ggplot(peng_clean,
 aes(x = species,
 y = flipper_length_mm,
 fill = species)) +
 geom_boxplot(alpha = 0.7) +
 labs(title = "Flipper Length by Species",
 y = "Flipper Length (mm)",
 x = "Species") +
 theme_minimal() +
 theme(legend.position = "none")
```

]

``` r
anova_plot
```

<img src="data:image/png;base64,#4_1_Confirmatory_Data_Analysis_files/figure-html/unnamed-chunk-27-1.png" style="display: block; margin: auto;" />
]

---

## Two-way ANOVA

Species + Sex effects on body mass

``` r
peng_clean <- penguins %>% 
 filter(!is.na(body_mass_g), !is.na(sex),
 !is.na(species))

fit2 <- aov(body_mass_g ~ species * sex,
 data = peng_clean)

summary(fit2)
```

```
## Df Sum Sq Mean Sq F value Pr(>F) 
## species 2 145190219 72595110 758.358 < 2e-16 ***
## sex 1 37090262 37090262 387.460 < 2e-16 ***
## species:sex 2 1676557 838278 8.757 0.000197 ***
## Residuals 327 31302628 95727 
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```

---

## Post-hoc test (Tukey HSD)

To find out which groups differ from each other we can compute *Tukey Honest Significant Differences*.

``` r
TukeyHSD(fit)
```

```
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = flipper_length_mm ~ species, data = peng_clean)
## 
## $species
##                       diff       lwr       upr p adj
## Chinstrap-Adelie  5.869887  3.586583  8.153191     0
## Gentoo-Adelie    27.233349 25.334376 29.132323     0
## Gentoo-Chinstrap 21.363462 19.000841 23.726084     0
```

---

## Assumptions Check: Residual Diagnostics

``` r
library(broom)

diag_data <- augment(penguin_lm_multi)

p_resid <- ggplot(diag_data, aes(.fitted, .resid)) +
 geom_point(alpha = 0.6) +
 geom_hline(yintercept = 0, linetype = "dashed") +
 labs(
 title = "Residuals vs Fitted",
 x = "Fitted values",
 y = "Residuals"
 ) +
 theme_minimal()

p_qq <- ggplot(diag_data, aes(sample = .std.resid)) +
 stat_qq(alpha = 0.6) +
 stat_qq_line() +
 labs(
 title = "Normal Q-Q Plot",
 x = "Theoretical Quantiles",
 y = "Standardized Residuals"
 ) +
 theme_minimal()
```
]

.pull-right[
<img src="data:image/png;base64,#4_1_Confirmatory_Data_Analysis_files/figure-html/unnamed-chunk-29-1.png" style="display: block; margin: auto;" />
]

---

## Multicollinearity: VIF Check

When predictors correlate highly, coefficient estimates become unstable. Check Variance Inflation Factor (VIF):

``` r
vif(penguin_lm_multi)
```

```
##                        GVIF Df GVIF^(1/(2*Df))
## flipper_length_mm  6.469817  1        2.543583
## bill_length_mm     5.234596  1        2.287924
## bill_depth_mm      4.772094  1        2.184512
## species           34.097607  2        2.416468
```

**Rule of thumb**: VIF > 5 indicates problematic multicollinearity. Here, bill length and depth are somewhat correlated but acceptable.

**Solution**: Remove one predictor, combine them, or use regularization.
---

## Non-paprametric Alternative

What if our data are *not normally distributed*? Check **normality** with the: [Shapiro-Wilk](https://en.wikipedia.org/wiki/Shapiro%E2%80%93Wilk_test) test.

``` r
# Shapiro-Wilk test on residuals
shapiro.test(residuals(penguin_lm_multi))
```

```
## 
## 	Shapiro-Wilk normality test
## 
## data:  residuals(penguin_lm_multi)
## W = 0.99212, p-value = 0.07464
```
]
.pull-right[

``` r
# Histogram of residuals
hist(residuals(penguin_lm_multi), 
     main = "Distribution of Residuals",
     xlab = "Residuals")
```

<img src="data:image/png;base64,#4_1_Confirmatory_Data_Analysis_files/figure-html/unnamed-chunk-31-1.png" style="display: block; margin: auto;" />
]
.small[
**Note**: With large samples, even minor deviations become "significant".
Focus on visual diagnostics and practical significance.
]
---

## Non-parametric alternative to the T-test

If the data are not normally distributed, the [Wilcoxon/Mann-Whitney](https://en.wikipedia.org/wiki/Mann%E2%80%93Whitney_U_test). Use when: data is skewed, non-normal, or sample size is small.

``` r
# Do 2 groups differ?
# E.g.: flipper length between 2 species---->

# Filter only 2 species for a valid two-group test
peng_two <- penguins %>% 
 filter(species %in% c("Adelie", "Gentoo"))

# Wilcoxon rank-sum test 
wilcox.test(flipper_length_mm ~ species,
            data = peng_two)
```

```
## 
## 	Wilcoxon rank sum test with continuity correction
## 
## data: flipper_length_mm by species
## W = 26, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0
```
]
.pull-right[

``` r
#pairwise comparison between all species

pairwise.wilcox.test(
  x = penguins$flipper_length_mm,
  g = penguins$species,
  p.adjust.method = "BH"
)
```

```
## 
## 	Pairwise comparisons using Wilcoxon rank sum test with continuity correction 
## 
## data: penguins$flipper_length_mm and penguins$species 
## 
## Adelie Chinstrap
## Chinstrap 3e-08 - 
## Gentoo <2e-16 <2e-16 
## 
## P value adjustment method: BH
```
]

---

## Checking regression assumptions

As in the case of ANOVA, check regression assumptions. There are plenty of good tutorials online (such as [this blog post by Ian Ruginski](https://www.ianruginski.com/post/regressionassumptions/) or [this chapter](https://bookdown.org/jimr1603/Intermediate_R_-_R_for_Survey_Analysis/testing-regression-assumptions.html) in the online book [*R for Survey Analysis*](https://bookdown.org/jimr1603/Intermediate_R_-_R_for_Survey_Analysis/)).

---

## Making Predictions 
.small[Models are useful for prediction! Let's predict body mass for a new penguin:]

``` r
# New observation
new_penguin <- data.frame(
 flipper_length_mm = 200,
 bill_length_mm = 45,
 bill_depth_mm = 15,
 species = "Gentoo"
)

# Point prediction
predict(penguin_lm_multi, newdata = new_penguin)
```

```
##        1 
## 4642.305
```

``` r
# With confidence interval (uncertainty in estimate)
predict(penguin_lm_multi, newdata = new_penguin, 
        interval = "confidence")
```

```
##        fit      lwr      upr
## 1 4642.305 4526.208 4758.403
```

---

## Standardized Effect Sizes (Beta Coefficients)
To compare effect magnitudes across variables with different scales, use standardized coefficients:

``` r
library(lm.beta)

# Standardize coefficients
penguin_lm_beta <- lm.beta(penguin_lm_multi)

# Extract standardized coefficients
coef(penguin_lm_beta)["flipper_length_mm"]
```

```
## flipper_length_mm 
##          0.352066
```

``` r
coef(penguin_lm_beta)["bill_length_mm"]
```

```
## bill_length_mm 
##      0.2697496
```

.small[**Note**: A 1 SD increase in flipper length →  0.64 SD increase in body mass – the strongest predictor!]

---
## Categorical Predictors

`R` automatically handles categorical variables (factors) by creating dummy variables.

``` r
contrasts(penguins_clean$species)
```

```
##           Chinstrap Gentoo
## Adelie            0      0
## Chinstrap         1      0
## Gentoo            0      1
```

``` r
#model matrix
head(model.matrix(penguin_lm_multi))
```

```
##   (Intercept) flipper_length_mm bill_length_mm bill_depth_mm speciesChinstrap speciesGentoo
## 1           1               181           39.1          18.7                0             0
## 2           1               186           39.5          17.4                0             0
## 3           1               195           40.3          18.0                0             0
## 4           1               193           36.7          19.3                0             0
## 5           1               190           39.3          20.6                0             0
## 6           1               181           38.9          17.8                0             0
```

---
## Generalized linear models
There are a multitude of other estimators/link functions, so-called **family objects**, e.g., binomial logistic or Poisson regression through the `glm()` function. See `?family` for an overview.

- logistic (`glm ( y ~ x, family = 'binomial`)) for binary data,
- Poisson (`glm ( y ~ x, family = 'poisson`)) for count data,
- negative binomial for overdispersed count data (`glm( y ~ x, family = 'quasipoisson`))

---

## GLM with Penguins: Logistic Regression Example

``` r
# Binary outcome
penguins_binary <- penguins_clean %>%
 mutate(is_adelie = ifelse(species == "Adelie", 1, 0))

# Logistic regression
penguin_glm <- glm(
 is_adelie ~ flipper_length_mm + bill_length_mm + bill_depth_mm,
 family = binomial(),
 data = penguins_binary
)

# Odds ratios
exp(coef(penguin_glm))
```

```
##       (Intercept) flipper_length_mm    bill_length_mm     bill_depth_mm 
##      4.441432e+04      1.164228e+00      5.852423e-02      1.017151e+02
```

---

## Assessing model quality

The [`performance` package](https://easystats.github.io/performance/) that is that is part of the [`easystats` collection of packages](https://easystats.github.io/easystats/) offers some functions for assessing model quality, including different types of R². A commonly used metric for logistic regression models, e.g., is Nagelkerke's R².

---

## Model Comparison with AIC

Compare nested models using Akaike Information Criterion – balances fit and complexity.

``` r
# Simple model
model1 <- lm(body_mass_g ~ flipper_length_mm, 
 data = penguins_clean)

# Complex model
model2 <- lm(body_mass_g ~ flipper_length_mm + bill_length_mm + species, 
 data = penguins_clean)

# Compare
AIC(model1, model2)
```

```
##        df      AIC
## model1  3 4928.146
## model2  6 4833.203
```

---
## Stepwise Selection

Start with a simple model and stepwise remove possible predictors

``` r
# Stepwise AIC based selection

step <- step(model1, scope = ~ flipper_length_mm + bill_length_mm + species, trace = FALSE)

summary(step)
```

---

class:middle
.small[

```
## 
## Call:
## lm(formula = body_mass_g ~ flipper_length_mm + species + bill_length_mm, 
## data = penguins_clean)
## 
## Residuals:
## Min 1Q Median 3Q Max 
## -811.52 -227.81 -33.74 216.73 1053.26 
## 
## Coefficients:
## Estimate Std. Error t value Pr(>|t|) 
## (Intercept) -3864.073 533.592 -7.242 3.18e-12 ***
## flipper_length_mm 27.544 3.209 8.583 3.77e-16 ***
## speciesChinstrap -732.417 82.063 -8.925 < 2e-16 ***
## speciesGentoo 113.254 89.206 1.270 0.205 
## bill_length_mm 60.117 7.207 8.341 2.06e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 339.6 on 328 degrees of freedom
## Multiple R-squared: 0.8243,	Adjusted R-squared: 0.8222 
## F-statistic: 384.7 on 4 and 328 DF, p-value: < 2.2e-16
```
]

---

## Comparing model performance

The `performance` package we have previously used for calculating Nagelkerke's R² for our logistic regression model also provides a handy function for comparing model performance.

---

## Other regression variants

While `glm()` already provides quite a few estimators and link functions, depending on the distribution of your dependent variable, you may need more specialized regression models. A few interesting options for that, e.g, include are the [`MASS` package](https://cran.r-project.org/web/packages/MASS/index.html) for negative binomial regression, the [`pscl` package](https://cran.r-project.org/web/packages/pscl/index.html) for zero-inflated (negative) binomial regression and hurdle models, or the [`mcp` package](https://lindeloev.github.io/mcp/) for regression with multiple change points.

---

## Handling regression results

While it is (still) common practice to run regressions, search for 'significant' `p`-values, and paste the results into a table without interpreting them substantially, this may not be the best thing to do.

---

## Reporting Results

As we have seen in the session on *Exploratory Data Analysis*, there are different options for creating tables as output in `R`:  [`stargazer`](https://cran.r-project.org/web/packages/stargazer/index.html) and [`gtsummary`](http://www.danieldsjoberg.com/gtsummary/)

---

## Example: Regression tables with the `stargazer` package

``` r
library(stargazer)

stargazer(
  penguin_lm_multi, 
  type = "text",        # For console output; use "html" or "latex" for other formats
  title = "Regression Results",
  dep.var.labels = c("AI Trust"),
  covariate.labels = c(
    "Main Branch", 
    "Participation Frequency", 
    "Years Coding"
  ),
  out.header = FALSE,    # Exclude header metadata if outputting to file
  digits = 3             # Number of decimal places
)
```

---

```
## 
## Regression Results
## ===================================================
## Dependent variable: 
## ---------------------------
## AI Trust 
## ---------------------------------------------------
## Main Branch 20.226*** 
## (3.135) 
## 
## Participation Frequency 39.718*** 
## (7.227) 
## 
## Years Coding 141.771*** 
## (19.163) 
## 
## speciesChinstrap -496.758*** 
## (82.469) 
## 
## speciesGentoo 965.198*** 
## (141.770) 
## 
## Constant -4,282.080*** 
## (497.832) 
## 
## ---------------------------------------------------
## Observations 333 
## R2 0.849 
## Adjusted R2 0.847 
## Residual Std. Error 314.763 (df = 327) 
## F Statistic 369.137*** (df = 5; 327) 
## ===================================================
## Note: *p<0.1; **p<0.05; ***p<0.01
```
]

---

## Coefficient Plots (Odds Ratios)

``` r
library(sjPlot)

# Plot odds ratios with 95% CI
plot_model(
  penguin_glm,
  show.values = TRUE,
  value.offset = .3,
  transform = "exp",     # transforms log-odds → odds ratios
  title = "Odds Ratios for Predicting Adelie Species"
)
```
]

]

---
## Parameters coefficient plot

``` r
library(parameters)

model_parameters(penguin_glm) %>% 
  plot()+
  theme_minimal() +
  theme(legend.position = "none")
```

---

## Raw Coefficients (Log-Odds)

``` r
plot_model(
  penguin_glm,
  show.values = TRUE,
  value.offset = .3,
  transform = NULL,      # keep log-odds
  title = "Log-Odds Coefficients (Logistic Regression)"
)
```

---

## sjPlot: Marginal Effects Plot

``` r
plot_model(
  penguin_glm,
  type = "pred",
  terms = "flipper_length_mm",
  title = "Predicted Probability of Adelie vs. Flipper Length"
)
```

---

## You can add multiple predictors

``` r
plot_model(
  penguin_glm,
  type = "pred",
  terms = c("flipper_length_mm", "bill_length_mm"),
  title = "Marginal Effects: Interaction Surface"
)
```

---

## Prediction plots vs Regressions coefficients

- Regression coefficients are sometimes a bit hard to read

-they are based on the scale of the independent variable when unstandardized
    -it is ambiguous what they actually mean substantially
    -for example, is a coefficient of .2 or .05 (practically) meaningful or not?

- Predictions are a solution

- they basically draw the regression line into the scatter plot of the Y and X variables
    - we can read at which level of X which value of Y is expected

---

## Predictions (sjPlot)

``` r
plot_model(
  penguin_glm,
  type = "pred",
  terms = c("bill_depth_mm","bill_length_mm"),
  title = "Predicted Probability of Adelie by Bill Depth and Length"
)
```

---

## Customized ggplot
.pull-left[

``` r
library(broom)
library(ggplot2)
library(dplyr)

coef_df <- tidy(penguin_glm, conf.int = TRUE) %>%
 filter(term != "(Intercept)") %>%
 mutate(term = reorder(term, estimate))

coef_p <- ggplot(coef_df, aes(x = term, y = estimate)) +
 geom_point(size = 3) +
 geom_errorbar(aes(ymin = conf.low, ymax = conf.high),
 width = .15, linewidth = .8) +
 labs(title = "Logistic Regression Coefficient Plot",
 x = "Predictor",
 y = "Estimate (log-odds)") +
 coord_flip() +
 theme_minimal()
```
]

.pull-right[
<img src="data:image/png;base64,#4_1_Confirmatory_Data_Analysis_files/figure-html/unnamed-chunk-45-1.png" style="display: block; margin: auto;" />
]

---

## Summary Table with sjPlot

``` r
tab_model(
  penguin_glm,
  show.ci = TRUE,
  show.se = TRUE,
  show.aic = TRUE,
  transform = "exp",
  dv.labels = "Predicting Adelie Species"
)
```

<table style="border-collapse:collapse; border:none;">
<tr>
<th style="border-top: double; text-align:center; font-style:normal; font-weight:bold; padding:0.2cm; text-align:left; ">&nbsp;</th>
<th colspan="4" style="border-top: double; text-align:center; font-style:normal; font-weight:bold; padding:0.2cm; ">Predicting Adelie Species</th>
</tr>
<tr>
<td style=" text-align:center; border-bottom:1px solid; font-style:italic; font-weight:normal; text-align:left; ">Predictors</td>
<td style=" text-align:center; border-bottom:1px solid; font-style:italic; font-weight:normal; ">Odds Ratios</td>
<td style=" text-align:center; border-bottom:1px solid; font-style:italic; font-weight:normal; ">std. Error</td>
<td style=" text-align:center; border-bottom:1px solid; font-style:italic; font-weight:normal; ">CI</td>
<td style=" text-align:center; border-bottom:1px solid; font-style:italic; font-weight:normal; ">p</td>
</tr>
<tr>
<td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:left; ">(Intercept)</td>
<td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">44414.32</td>
<td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">704167.07</td>
<td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">0.00&nbsp;&ndash;&nbsp;Inf</td>
<td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">0.500</td>
</tr>
<tr>
<td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:left; ">flipper length mm</td>
<td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">1.16</td>
<td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">0.11</td>
<td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">0.00&nbsp;&ndash;&nbsp;Inf</td>
<td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">0.106</td>
</tr>
<tr>
<td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:left; ">bill length mm</td>
<td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">0.06</td>
<td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">0.06</td>
<td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">0.00&nbsp;&ndash;&nbsp;Inf</td>
<td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">0.003</td>
</tr>
<tr>
<td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:left; ">bill depth mm</td>
<td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">101.72</td>
<td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">178.50</td>
<td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">0.00&nbsp;&ndash;&nbsp;Inf</td>
<td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">0.008</td>
</tr>
<tr>
<td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:left; padding-top:0.1cm; padding-bottom:0.1cm; border-top:1px solid;">Observations</td>
<td style=" padding:0.2cm; text-align:left; vertical-align:top; padding-top:0.1cm; padding-bottom:0.1cm; text-align:left; border-top:1px solid;" colspan="4">333</td>
</tr>
<tr>
<td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:left; padding-top:0.1cm; padding-bottom:0.1cm;">R2 Tjur</td>
<td style=" padding:0.2cm; text-align:left; vertical-align:top; padding-top:0.1cm; padding-bottom:0.1cm; text-align:left;" colspan="4">0.971</td>
</tr>
<tr>
<td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:left; padding-top:0.1cm; padding-bottom:0.1cm;">AIC</td>
<td style=" padding:0.2cm; text-align:left; vertical-align:top; padding-top:0.1cm; padding-bottom:0.1cm; text-align:left;" colspan="4">23.615</td>
</tr>

</table>

---

## Tidy models with `broom`

The [`broom`](https://broom.tidymodels.org/) package provides only 3 but very powerful main functions:
- `tidy()`: creates a `tibble` from your model (coefficients)
- `glance()`: information about your model as `tibble` ('model fit')
- `augment()`: adds observ. information, e.g., individual fitted values to your data

*Note*: With `broom` you can also create a standardized output of other types of models, such as Latent Class Analysis (LCA) (with the package [`poLCA`](https://cran.r-project.org/web/packages/poLCA/index.html)).

---

## Conflicts

``` r
# See which packages conflict
conflicts(detail = TRUE)
```

```
## $`package:broom`
## [1] "glance" "tidy" 
## 
## $`package:car`
## [1] "recode" "some" 
## 
## $`package:parameters`
## [1] "supported_models"
## 
## $`package:modelsummary`
## [1] "glance" "supported_models" "tidy" 
## 
## $`package:webshot2`
## [1] "%>%" "%>%" "%>%" "%>%" "%>%" "%>%" "%>%"
## 
## $`package:lubridate`
## [1] "intersect" "setdiff" "union" "Arith" "Compare" "show" 
## [7] "as.difftime" "date" "intersect" "setdiff" "union" 
## 
## $`package:forcats`
## [1] "%>%" "%>%" "%>%" "%>%" "%>%" "%>%" "%>%"
## 
## $`package:stringr`
## [1] "%>%" "%>%" "%>%" "%>%" "%>%" "%>%" "%>%"
## 
## $`package:dplyr`
## [1] "%>%" "%>%" "%>%" "intersect" "recode" 
## [6] "setdiff" "union" "%>%" "%>%" "all_of" 
## [11] "any_of" "as_tibble" "contains" "ends_with" "everything" 
## [16] "last_col" "matches" "num_range" "one_of" "starts_with" 
## [21] "tibble" "tribble" "%>%" "add_row" "as_data_frame"
## [26] "as_tibble" "data_frame" "glimpse" "lst" "tibble" 
## [31] "tribble" "type_sum" "as_label" "enexpr" "enexprs" 
## [36] "enquo" "enquos" "ensym" "ensyms" "expr" 
## [41] "quo" "quo_name" "quos" "sym" "syms" 
## [46] "vars" "%>%" "group_rows" "filter" "lag" 
## [51] "intersect" "setdiff" "setequal" "union" 
## 
## $`package:purrr`
## [1] "%>%" "%>%" "%>%" "%>%" "some" "%>%" "%>%" "%>%" 
## 
## $`package:tidyr`
## [1] "%>%" "%>%" "%>%" "%>%" "%>%" "all_of" 
## [7] "any_of" "as_tibble" "contains" "ends_with" "everything" "last_col" 
## [13] "matches" "num_range" "one_of" "starts_with" "tibble" "tribble" 
## [19] "%>%" "as_tibble" "tibble" "tribble" "%>%" 
## 
## $`package:tibble`
## [1] "%>%" "%>%" "%>%" "%>%" "%>%" 
## [6] "as_tibble" "tibble" "tribble" "%>%" "add_row" 
## [11] "as_data_frame" "as_tibble" "data_frame" "glimpse" "lst" 
## [16] "tibble" "tribble" "type_sum" "%>%" 
## 
## $`package:ggplot2`
## [1] "as_label" "enexpr" "enexprs" "enquo" "enquos" "ensym" "ensyms" "expr" 
## [9] "quo" "quo_name" "quos" "sym" "syms" "vars" "Position"
## 
## $`package:kableExtra`
## [1] "%>%" "%>%" "%>%" "%>%" "%>%" "%>%" "%>%" 
## [8] "group_rows"
## 
## $`package:stats`
## [1] "filter" "lag" 
## 
## $`package:graphics`
## [1] "plot"
## 
## $`package:methods`
## [1] "Arith" "Compare" "show" "body<-" "kronecker"
## 
## $`package:base`
## [1] "intersect" "setdiff" "union" "as.difftime" "body<-" "date" 
## [7] "intersect" "kronecker" "plot" "Position" "setdiff" "setequal" 
## [13] "union"
```

``` r
# Look specifically for tidy conflicts
conflicts(detail = TRUE)$`package:broom`
```

```
## [1] "glance" "tidy"
```

Always load broom after packages like tidyverse to avoid conflicts

---

## `tidy()`

``` r
# Load in THIS order
library(tidyverse)  
library(palmerpenguins)
library(broom)      
library(parameters)

tidy_penguin <- tidy(penguin_lm_multi)
print(tidy_penguin)
```

```
## # A tibble: 6 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) -4282. 498. -8.60 3.33e-16
## 2 flipper_length_mm 20.2 3.14 6.45 3.98e-10
## 3 bill_length_mm 39.7 7.23 5.50 7.85e- 8
## 4 bill_depth_mm 142. 19.2 7.40 1.17e-12
## 5 speciesChinstrap -497. 82.5 -6.02 4.59e- 9
## 6 speciesGentoo 965. 142. 6.81 4.74e-11
```

---

## `glance()`

``` r
# glance() - model fit
glance(penguin_lm_multi)
```

```
## # A tibble: 1 × 12
## r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC deviance
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.849 0.847 315. 369. 4.22e-132 5 -2385. 4784. 4810. 32397671.
## # ℹ 2 more variables: df.residual <int>, nobs <int>
```

---

## `augment()`

``` r
# augment() - diagnostics
augment_penguin <- augment(penguin_lm_multi)
head(augment_penguin)
```

```
## # A tibble: 6 × 11
## body_mass_g flipper_length_mm bill_length_mm bill_depth_mm species .fitted .resid .hat
## <int> <int> <dbl> <dbl> <fct> <dbl> <dbl> <dbl>
## 1 3750 181 39.1 18.7 Adelie 3583. 167. 0.0170
## 2 3800 186 39.5 17.4 Adelie 3516. 284. 0.0117
## 3 3250 195 40.3 18 Adelie 3815. -565. 0.0110
## 4 3450 193 36.7 19.3 Adelie 3815. -365. 0.0154
## 5 3650 190 39.3 20.6 Adelie 4042. -392. 0.0247
## 6 3625 181 38.9 17.8 Adelie 3447. 178. 0.0144
## # ℹ 3 more variables: .sigma <dbl>, .cooksd <dbl>, .std.resid <dbl>
```

---

## From Model to Diagnostics

Combine all three `broom` functions

``` r
# 1. Get coefficients (tidy)
coef_table <- tidy(penguin_lm_multi, conf.int = TRUE)

# 2. Get model fit (glance)
fit_stats <- glance(penguin_lm_multi)

# 3. Get diagnostics (augment)
diag_data <- augment(penguin_lm_multi) %>%
 mutate(
 outlier = abs(.std.resid) > 2,
 species = penguins_clean$species # Add back categorical variable
 )

# 4. Visualize residuals by species
library(ggplot2)
ggplot(diag_data, aes(x = species, y = .resid, fill = outlier)) +
  geom_boxplot() +
  labs(title = "Residuals by Species",
       subtitle = "Outliers highlighted in red",
       x = "Species",
       y = "Residuals") +
  theme_minimal()
```

---
class: center, middle
.small[

]

---

## Outlook: Other modeling options

As we've said at the beginning of this session, we only looked at some of the basic confirmatory analysis methods. As you can imagine, however, `R` offers plenty of options more more complex statistical models as well. A few examples:

- Structural equation models with [`lavaan`](https://lavaan.ugent.be/)

- Multilevel/mixed-effects models with [`lme4`](https://cran.r-project.org/web/packages/lme4/index.html)

- Bayesian regression models using [`brms`](https://paul-buerkner.github.io/brms/)

- Mediation analysis using [`mediation`](https://cran.r-project.org/web/packages/mediation/vignettes/mediation.pdf)
  - response, mediator, sensitivity analyses (`medsens()`)
---

## Extracurricular activities

As it is such a simple and useful package, we recommend exploring the `broom` package a bit further. Alex Hayes, one authors of the package, [gave an interesting talk about it at the *RStudio* Conference in 2019](https://rstudio.com/resources/rstudioconf-2019/solving-the-model-representation-problem-with-broom/).

---

# [Exercise](https://rawcdn.githack.com/nika-akin/data-analysis-with-R/983625a7d096079e0598f2d60ca1281c757d4659/exercises/Exercise_4_1_3_Regression_Reporting.html) time 🏋️‍♀️💪🏃🚴

## [Solutions](https://rawcdn.githack.com/nika-akin/data-analysis-with-R/983625a7d096079e0598f2d60ca1281c757d4659/solutions/Exercise_4_1_3_Regression_Reporting.html)