Data Literacy: Introduction to R

.title[
# Data Literacy: Introduction to R
]
.subtitle[
## Confirmatory Data Analysis
]
.author[
### Veronika Batzdorfer
]
.date[
### 2025-06-23
]

---

---

## Content of this session
.pull-left[
**What we will cover**
- Using (more or less) simple regression models

- OLS, GLM, and the like

- How to re-use the results of these models

- How to feed these results into tables
]

- Crazy complex models with elaborated estimators
- e.g., no multilevel models
- also no clustered standard errors

- Bayesian statistics
]

---

## Last sessions

- We learned how to import/ export data
- We learned how to wrangle data, subsets, create new variables, recode, merge, pivot

---
## Recap Distribution

``` r
#plot first variable
hist(df$ai_acc, 
     col=rgb(0,0,1,0.25), 
          breaks=10,
     xlab='Trust in AI accuracy (blue) and complexity (red) ',
     main='Distribution of AI Metrics'
) 
#add second variable
hist(df$ai_complexity, 
     col=rgb(1,0,0,0.25), 
     breaks=10,
     add=T)
```

---
## Descriptive Statistics

``` r
library(modelsummary)

df.s <- df %>% 
 dplyr::select(ai_complexity, ai_acc, ai_threat, visit_frequency)

datasummary_skim(df.s, output = 'flextable')
```

<div class="tabwid"><style>.cl-09657df6{}.cl-0917c82c{font-family:'Arial';font-size:11pt;font-weight:normal;font-style:normal;text-decoration:none;color:rgba(0, 0, 0, 1.00);background-color:transparent;}.cl-09621b52{margin:0;text-align:left;border-bottom: 0 solid rgba(0, 0, 0, 1.00);border-top: 0 solid rgba(0, 0, 0, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);padding-bottom:5pt;padding-top:5pt;padding-left:5pt;padding-right:5pt;line-height: 1;background-color:transparent;}.cl-096231e6{width:0.75in;background-color:transparent;vertical-align: middle;border-bottom: 1.5pt solid rgba(102, 102, 102, 1.00);border-top: 1.5pt solid rgba(102, 102, 102, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);margin-bottom:0;margin-top:0;margin-left:0;margin-right:0;}.cl-096231f0{width:0.75in;background-color:transparent;vertical-align: middle;border-bottom: 0 solid rgba(0, 0, 0, 1.00);border-top: 0 solid rgba(0, 0, 0, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);margin-bottom:0;margin-top:0;margin-left:0;margin-right:0;}.cl-096231f1{width:0.75in;background-color:transparent;vertical-align: middle;border-bottom: 1.5pt solid rgba(102, 102, 102, 1.00);border-top: 0 solid rgba(0, 0, 0, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);margin-bottom:0;margin-top:0;margin-left:0;margin-right:0;}</style><table data-quarto-disable-processing='true' class='cl-09657df6'><thead><tr style="overflow-wrap:break-word;"><th class="cl-096231e6"> </th><th class="cl-096231e6">Unique</th><th class="cl-096231e6">Missing Pct.</th><th class="cl-096231e6">Mean</th><th class="cl-096231e6">SD</th><th class="cl-096231e6">Min</th><th class="cl-096231e6">Median</th><th class="cl-096231e6">Max</th></tr></thead><tbody><tr style="overflow-wrap:break-word;"><td class="cl-096231f0">ai_complexity</td><td class="cl-096231f0">6</td><td class="cl-096231f0">43</td><td class="cl-096231f0">2.2</td><td class="cl-096231f0">1.1</td><td class="cl-096231f0">1.0</td><td class="cl-096231f0">2.0</td><td class="cl-096231f0">5.0</td></tr><tr style="overflow-wrap:break-word;"><td class="cl-096231f0">ai_acc</td><td class="cl-096231f0">6</td><td class="cl-096231f0">43</td><td class="cl-096231f0">3.8</td><td class="cl-096231f0">1.2</td><td class="cl-096231f0">1.0</td><td class="cl-096231f0">4.0</td><td class="cl-096231f0">5.0</td></tr><tr style="overflow-wrap:break-word;"><td class="cl-096231f0">ai_threat</td><td class="cl-096231f0">4</td><td class="cl-096231f0">32</td><td class="cl-096231f0">1.9</td><td class="cl-096231f0">0.6</td><td class="cl-096231f0">1.0</td><td class="cl-096231f0">2.0</td><td class="cl-096231f0">3.0</td></tr><tr style="overflow-wrap:break-word;"><td class="cl-096231f1">visit_frequency</td><td class="cl-096231f1">6</td><td class="cl-096231f1">9</td><td class="cl-096231f1">2.5</td><td class="cl-096231f1">1.3</td><td class="cl-096231f1">1.0</td><td class="cl-096231f1">2.0</td><td class="cl-096231f1">5.0</td></tr></tbody></table></div>

---

## Bivariate correlations

``` r
# Subset and drop missings
dt.com <- df %>% 
 dplyr::select(ai_complexity, ai_acc, ai_threat, ai_sentiment,years_coding) %>% 
 drop_na()

#  corrplot
library(corrplot)
corrplot.mixed(cor(dt.com), lower.col = "black", number.cex = .7, insig='blank')
```

---

## `R` is straightforward and literate

`R` combines the best of two worlds: It is straightforward to write formulas

``` r
y ~ x1 + x2 + x3
```

*Note*: Formulas represent a specific object class in `R`.

``` r
class(y ~ x1 + x2 + x3)
```

```
## [1] "formula"
```

---

## Denoting the left-hand side with `~`

In `R`, stating what your dependent variable is very similar to common mathematical notation:

`$$y \sim N(\theta, \epsilon)$$`

It states that a specific relationship is actually _estimated_, but we, fortunately, don't have to specify errors here.

``` r
y ~ x1 + x2 + x3
```

Yet, sometimes it may be a good idea to at least explicitly specify the intercept as here:

``` r
y ~ 1 + x1 + x2 + x3
```

---

## Intercept

We can also estimate models without an intercept:

``` r
y ~ x1 + x2 + x3 - 1
```

Or intercept-only models as well:

``` r
y ~ 1
```

---

## Adding predictors with `+`
You can add as many predictors/covariates as you want with the simple `+`
 operator. See:

``` r
y ~ 1 + x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8 + x9 + x10 + x11 + x12 +
  x13 + x14 + x15 + x16 + x17 + x18 + x19 + x20 + x21 + x22 + x23 + x24
```

There's also a shortcut for using all (remaining) variables in a data set as predictors:

``` r
y ~ .
```

---

## Interaction effects with `*` and `:`

We can also easily add interaction effects to a model. As this is the same as multiplying predictor variables, we also use the **`*`** sign for that.

``` r
y ~ x1 * x2
```

The code above creates a model formula that includes both the main effects of `x1` and `x2`, and their interaction denoted by `x1:x2`. We can even be more explicit and write that into the formula directly:

``` r
y ~ x1 + x2 + x1:x2
```

---

## **Transforming** variables within a formula

For simple transformations this can be done as follows:

``` r
y ~ log(x)   # computes the log10 for x
y ~ scale(x) # z-transformation of x
```

We could also change the data type of variables within a function, e.g., by converting a numeric variable to a factor using `as.factor(x)`.

---

## Transforming variables within a formula

If you cannot use a specific function for your tansformation, you have to wrap the operation in the `I()` function.

``` r
y ~ x + I(x^2) # add a quadratic term of x
```

*Note*: Of course, there are also functions in `R` for transforming variables (e.g., standardizing or centering) before we use them in a formula. Besides the `base R` function `scale()` the [`datawizard package`](https://easystats.github.io/datawizard/), e.g., provides a few functions for that.

---

## Where to use formulas?

The previous descriptions mainly refer to formulas used in regression models in `R`. However, formulas are also used in other hypothesis testing methods that distinguish between dependent and variables, such as t-tests or ANOVA. We will try out some of those in the following...

---

## Testing group differences in the distribution

A common method for analyzing group differences are t-tests. You can use the two-sample `t.test()` function:

``` r
df_filtered <- df %>%
 filter(age_group >= 1, age_group <= 6) %>%
 mutate(
 age_group = case_when(
 age_group >= 1 & age_group <= 3 ~ "1–3",
 age_group >= 4 & age_group <= 6 ~ "4–6"
 )
 )

t.test(ai_threat ~ age_group, data = df_filtered)
```

```
## 
## 	Welch Two Sample t-test
## 
## data:  ai_threat by age_group
## t = -4.0674, df = 8800.9, p-value = 4.796e-05
## alternative hypothesis: true difference in means between group 1–3 and group 4–6 is not equal to 0
## 95 percent confidence interval:
##  -0.04370872 -0.01527969
## sample estimates:
## mean in group 1–3 mean in group 4–6 
##          1.918528          1.948023
```

.small[
*Note*: By default, `R` uses [Welch's t-test](https://en.wikipedia.org/wiki/Welch%27s_t-test) (instead of [Student's t-test](https://en.wikipedia.org/wiki/Student%27s_t-test)) which does not assume homogeneity of variance across groups.
]

---

## Test of normality

What if our data are *not normally distributed* in the first place, violating one of the basic assumptions of performing t-tests? To check this: [Shapiro-Wilk](https://en.wikipedia.org/wiki/Shapiro%E2%80%93Wilk_test) test of normality.

``` r
shapiro.test(df_filtered$age_group)
```

---

## Wilcoxon/Mann-Whitney test

If the data are not normally distributed, the [Wilcoxon/Mann-Whitney](https://en.wikipedia.org/wiki/Mann%E2%80%93Whitney_U_test) test can be a suitable alternative.

``` r
wilcox.test(ai_trust ~ age_group, data = df_filtered)
```

```
## 
## 	Wilcoxon rank sum test with continuity correction
## 
## data:  ai_trust by age_group
## W = 67070026, p-value = 0.004805
## alternative hypothesis: true location shift is not equal to 0
```

---

## Testing multiple groups with an ANOVA

There are situations in which we want to test differences across multiple groups. The classic method to use for this is an analysis of variance (ANOVA) and its many variants (ANCOVA, MANOVA, etc.).

``` r
#One-way anova
anova <- aov(ai_trust ~ country, data = df)
anova
```

```
## Call:
##    aov(formula = ai_trust ~ country, data = df)
## 
## Terms:
##                  country Residuals
## Sum of Squares    739.49  39419.08
## Deg. of Freedom      175     35362
## 
## Residual standard error: 1.055808
## Estimated effects may be unbalanced
## 29899 Beobachtungen als fehlend gelöscht
```

---

## Testing multiple groups with an ANOVA

To get some more detailed information, you need to use the `summary()` function we have already seen in the EDA session on the resulting `anova` object.

``` r
#F-test
summary(anova)
```

```
## Df Sum Sq Mean Sq F value Pr(>F) 
## country 175 739 4.226 3.791 <2e-16 ***
## Residuals 35362 39419 1.115 
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 29899 Beobachtungen als fehlend gelöscht
```

---

## Post-hoc test

The previous output only indicates that there are statistically significant differences in the groups. To find out which groups differ from each other we can, for example, compute *Tukey Honest Significant Differences*.

``` r
TukeyHSD(anova)
```

---

## Simple linear regressions

Estimating a linear model (OLS). The function for this is `lm()`.

``` r
simple_linear_model <- 
 lm(
 ai_trust ~ main_branch + participation_frequency + ai_threat+ ai_sentiment, 
 data = df
 )

simple_linear_model
```

---

```
## 
## Call:
## lm(formula = ai_trust ~ main_branch + participation_frequency + 
##     ai_threat + ai_sentiment, data = df)
## 
## Coefficients:
##             (Intercept)              main_branch  participation_frequency  
##                1.914059                 0.015395                -0.003457  
##               ai_threat             ai_sentiment  
##               -0.005587                 0.505417
```
]

---
## OLS Model Summary

``` r
library(parameters)

model_parameters(simple_linear_model)
```

```
## Parameter | Coefficient | SE | 95% CI | t(27561) | p
## ------------------------------------------------------------------------------------
## (Intercept) | 1.91 | 0.02 | [ 1.88, 1.95] | 104.87 | < .001
## main branch | 0.02 | 3.95e-03 | [ 0.01, 0.02] | 3.90 | < .001
## participation frequency | -3.46e-03 | 2.56e-03 | [-0.01, 0.00] | -1.35 | 0.177 
## ai threat | -5.59e-03 | 6.58e-03 | [-0.02, 0.01] | -0.85 | 0.396 
## ai sentiment | 0.51 | 2.08e-03 | [ 0.50, 0.51] | 242.96 | < .001
```

---

## Simple linear regressions

As for the ANOVA, we can get some more detailed (and nicely formatted) output using the well-known `summary()` function.

``` r
summary(simple_linear_model)
```

---

```
## 
## Call:
## lm(formula = ai_trust ~ main_branch + participation_frequency + 
## ai_threat + ai_sentiment, data = df)
## 
## Residuals:
## Min 1Q Median 3Q Max 
## -1.5035 -0.4155 0.0880 0.5719 0.6026 
## 
## Coefficients:
## Estimate Std. Error t value Pr(>|t|) 
## (Intercept) 1.914059 0.018252 104.869 < 2e-16 ***
## main_branch 0.015395 0.003946 3.902 9.58e-05 ***
## participation_frequency -0.003457 0.002562 -1.349 0.177 
## ai_threat -0.005587 0.006578 -0.849 0.396 
## ai_sentiment 0.505417 0.002080 242.961 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6048 on 27561 degrees of freedom
## (37871 Beobachtungen als fehlend gelöscht)
## Multiple R-squared: 0.6821,	Adjusted R-squared: 0.682 
## F-statistic: 1.478e+04 on 4 and 27561 DF, p-value: < 2.2e-16
```
]

---
## Quick model inspection

``` r
par(mfrow = c(2, 2))
plot(simple_linear_model)
```

---

## Assumption checks
easystats provides some nice features for inspecting model fit. Here's a simple check for normality of the residuals. It is a combination of using a function from the performance package and then using see to plot it.

``` r
library(performance)
library(see)
check_normality(linear_model) %>% 
  plot()
```

QQ Plot

``` r
check_normality(simple_linear_model) %>% 
  plot(type = "qq")
```
Heteroscedascity

``` r
check_heteroscedasticity(
  linear_model
) %>% 
  plot()
```

---
## Overview of model checks (*performance* package)

``` r
check_model(simple_linear_model)
```

---

## Checking regression assumptions

As in the case of ANOVA, check regression assumptions. There are plenty of good tutorials online (such as [this blog post by Ian Ruginski](https://www.ianruginski.com/post/regressionassumptions/) or [this chapter](https://bookdown.org/jimr1603/Intermediate_R_-_R_for_Survey_Analysis/testing-regression-assumptions.html) in the online book [*R for Survey Analysis*](https://bookdown.org/jimr1603/Intermediate_R_-_R_for_Survey_Analysis/)).

---
## Why robust?

You can make your linear regression provide **robust** or clustered **standard errors**. One convenient way is the **`fixest`** package. The syntax is very similar to lm. But we specify fixed effects slightly differently by using the | symbol.

Alternatively, we can use the lm_robust() function from the `estimatr` package.

---
## Robust Standard Errors

``` r
library(fixest)

model.lm.1.robust<-feols(ai_trust ~ main_branch + participation_frequency + factor(age_group) + 
 ai_threat + ai_sentiment | country, # fixed effects per country
 vcov = "hetero", # heteroskedastic robust SEs
 data = df_filtered)

summary(model.lm.1.robust)
```

```
## OLS estimation, Dep. Var.: ai_trust
## Observations: 26,303
## Fixed-effects: country: 168
## Standard-errors: Heteroskedasticity-robust 
## Estimate Std. Error t value Pr(>|t|) 
## main_branch 0.012523 0.004395 2.849558 0.0043814 ** 
## participation_frequency -0.000451 0.002647 -0.170375 0.8647166 
## factor(age_group)4–6 0.034928 0.011460 3.047884 0.0023069 ** 
## ai_threat -0.004827 0.006482 -0.744696 0.4564620 
## ai_sentiment 0.504626 0.002183 231.205442 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## RMSE: 0.598333 Adj. R2: 0.688459
## Within R2: 0.684001
```

---
## Generalized linear regressions 
There are a multitude of other estimators/link functions, so-called **family objects**, e.g., binomial logistic or Poisson regression through the `glm()` function. See `?family` for an overview.

- logistic (`glm ( y ~ x, family = 'binomial`)) for binary data,
- Poisson (`glm ( y ~ x, family = 'poisson`)) for count data,
- negative binomial for overdispersed count data (`glm( y ~ x, family = 'quasipoisson`))

---

## Generalized linear regression

Let's look at the the example of logistic regression.

``` r
#keep relevant predictors and complete cases
df_model <- df %>%
 filter(!is.na(r_used)) %>%
 select(
 r_used,
 main_branch,
 age_group,
 education_level,
 participation_frequency,
 ai_sentiment,
 ai_threat,
 ai_complexity,
 years_coding,
 community_belief
 ) %>%
 drop_na()

# Fit the logistic regression model
model_r_used <- glm(
 r_used ~ .,
 data = df_model,
 family = binomial()
)

# View summary
summary(model_r_used)
```

```
## 
## Call:
## glm(formula = r_used ~ ., family = binomial(), data = df_model)
## 
## Coefficients:
## Estimate Std. Error z value Pr(>|z|) 
## (Intercept) -3.406798 0.182386 -18.679 < 2e-16 ***
## main_branch 0.380770 0.025407 14.987 < 2e-16 ***
## age_group -0.137409 0.024267 -5.662 1.49e-08 ***
## education_level -0.023662 0.015825 -1.495 0.13487 
## participation_frequency -0.059690 0.019934 -2.994 0.00275 ** 
## ai_sentiment 0.036146 0.016549 2.184 0.02895 * 
## ai_threat -0.020442 0.052982 -0.386 0.69963 
## ai_complexity 0.010953 0.026642 0.411 0.68099 
## years_coding 0.003755 0.003436 1.093 0.27441 
## community_belief 0.083241 0.015768 5.279 1.30e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
## Null deviance: 9976.8 on 26864 degrees of freedom
## Residual deviance: 9728.0 on 26855 degrees of freedom
## AIC: 9748
## 
## Number of Fisher Scoring iterations: 6
```

---

---

## Assessing model quality

The [`performance` package](https://easystats.github.io/performance/) that is that is part of the [`easystats` collection of packages](https://easystats.github.io/easystats/) offers some functions for assessing model quality, including different types of R². A commonly used metric for logistic regression models, e.g., is Nagelkerke's R².

*Note*: The `performance` package also includes several helpful functions for model diagnostics.

---

## Model selection

Apply stepwise AIC to remove/ add predictors.

``` r
fit_log1 <- glm(r_used ~ main_branch,
 family = binomial, data = df_model)

fit_log2 <- glm(r_used ~ main_branch +
 age_group +
 ai_sentiment,
 family = binomial, data = df_model)

anova(fit_log1, fit_log2, test = "Chisq")
```

---

class:middle
.small[

```
## Analysis of Deviance Table
## 
## Model 1: r_used ~ main_branch
## Model 2: r_used ~ main_branch + age_group + ai_sentiment
##   Resid. Df Resid. Dev Df Deviance  Pr(>Chi)    
## 1     26863     9820.5                          
## 2     26861     9778.4  2   42.117 7.151e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```
]

---
## Compare models step-wise using AIC

Start with a simple model and stepwise remove possible predictors

``` r
fit_log_step = step(glm(r_used ~ main_branch +
                          age_group +
                          ai_complexity +
                          ai_threat, 
                          family = binomial, data = df_model, trace = 0))
```

```
## Start: AIC=9793.24
## r_used ~ main_branch + age_group + ai_complexity + ai_threat
## 
## Df Deviance AIC
## - ai_threat 1 9783.3 9791.3
## - ai_complexity 1 9783.6 9791.6
## <none> 9783.2 9793.2
## - age_group 1 9820.0 9828.0
## - main_branch 1 9966.1 9974.1
## 
## Step: AIC=9791.27
## r_used ~ main_branch + age_group + ai_complexity
## 
## Df Deviance AIC
## - ai_complexity 1 9783.7 9789.7
## <none> 9783.3 9791.3
## - age_group 1 9820.1 9826.1
## - main_branch 1 9966.2 9972.2
## 
## Step: AIC=9789.68
## r_used ~ main_branch + age_group
## 
## Df Deviance AIC
## <none> 9783.7 9789.7
## - age_group 1 9820.5 9824.5
## - main_branch 1 9966.8 9970.8
```

``` r
fit_log_step
```

```
## 
## Call:  glm(formula = r_used ~ main_branch + age_group, family = binomial, 
##     data = df_model, trace = 0)
## 
## Coefficients:
## (Intercept)  main_branch    age_group  
##     -3.3087       0.3658      -0.1190  
## 
## Degrees of Freedom: 26864 Total (i.e. Null);  26862 Residual
## Null Deviance:	    9977 
## Residual Deviance: 9784 	AIC: 9790
```

---

class:middle
.small[

---

## Comparing model performance

The `performance` package we have previously used for calculating Nagelkerke's R² for our logistic regression model also provides a handy function for comparing model performance.

---

## Other regression variants

While `glm()` already provides quite a few estimators and link functions, depending on the distribution of your dependent variable, you may need more specialized regression models. A few interesting options for that, e.g, include are the [`MASS` package](https://cran.r-project.org/web/packages/MASS/index.html) for negative binomial regression, the [`pscl` package](https://cran.r-project.org/web/packages/pscl/index.html) for zero-inflated (negative) binomial regression and hurdle models, or the [`mcp` package](https://lindeloev.github.io/mcp/) for regression with multiple change points.

---

## Handling regression results

While it is (still) common practice to run regressions, search for 'significant' `p`-values, and paste the results into a table without interpreting them substantially, this may not be the best thing to do.

---

## Accessing model results in `base R`

Regression results are a specific type/class of objects in `R`. You can use the `str()` function to get an overview of the whole structure of the object (it's a list of different information). For starters, we may want to see what the first level of this list may provide by asking for the names of the included pieces of information:

``` r
names(fit_log_step)
```

```
##  [1] "coefficients"      "residuals"         "fitted.values"    
##  [4] "effects"           "R"                 "rank"             
##  [7] "qr"                "family"            "linear.predictors"
## [10] "deviance"          "aic"               "null.deviance"    
## [13] "iter"              "weights"           "prior.weights"    
## [16] "df.residual"       "df.null"           "y"                
## [19] "converged"         "boundary"          "model"            
## [22] "call"              "formula"           "terms"            
## [25] "data"              "offset"            "control"          
## [28] "method"            "contrasts"         "xlevels"          
## [31] "anova"
```

---

## Accessing coefficients

We can access the coefficients from our model as follows:

``` r
fit_log_step$coefficients
```

```
## (Intercept) main_branch   age_group 
##  -3.3087191   0.3657930  -0.1190442
```

---

## Accessing standard errors

`lm` objects are a little bit cumbersome to use as the information is deeply nested within the object. If you want to extract the standard errors, e.g., you can do so as follows:

``` r
summary(fit_log_step)$coefficients[,2]
```

```
## (Intercept) main_branch   age_group 
##  0.06563969  0.02495990  0.02042471
```

---

## Accessing confidence intervals

The standard `summary()` doesn't supply confidence intervals. We can use the `confint()` command to produce them. For example, for the logistic regression:

``` r
confint(fit_log_step)

#- for odds ratio (OR) from log-odds
# - i.e. OR < 1 predictor is associated with lower odds of the outcome
exp(confint(fit_log_step))
```

---

## Regression tables

As we have seen in the session on *Exploratory Data Analysis*, there are different options for creating tables as output in `R`:  [`stargazer`](https://cran.r-project.org/web/packages/stargazer/index.html) and [`gtsummary`](http://www.danieldsjoberg.com/gtsummary/)

---

## Example: Regression tables with the `stargazer` package

``` r
library(stargazer)

stargazer(
  simple_linear_model, 
  type = "text",        # For console output; use "html" or "latex" for other formats
  title = "Regression Results",
  dep.var.labels = c("AI Trust"),
  covariate.labels = c(
    "Main Branch", 
    "Participation Frequency", 
    "Years Coding"
  ),
  out.header = FALSE,    # Exclude header metadata if outputting to file
  digits = 3             # Number of decimal places
)
```

---

```
## 
## Regression Results
## =====================================================
## Dependent variable: 
## -----------------------------
## AI Trust 
## -----------------------------------------------------
## Main Branch 0.015*** 
## (0.004) 
## 
## Participation Frequency -0.003 
## (0.003) 
## 
## Years Coding -0.006 
## (0.007) 
## 
## ai_sentiment 0.505*** 
## (0.002) 
## 
## Constant 1.914*** 
## (0.018) 
## 
## -----------------------------------------------------
## Observations 27,566 
## R2 0.682 
## Adjusted R2 0.682 
## Residual Std. Error 0.605 (df = 27561) 
## F Statistic 14,781.740*** (df = 4; 27561)
## =====================================================
## Note: *p<0.1; **p<0.05; ***p<0.01
```
]

---

## Tidy models with `broom`

The [`broom`](https://broom.tidymodels.org/) package provides only 3 but very powerful main functions:
- `tidy()`: creates a `tibble` from your model
- `glance()`: information about your model as `tibble` ('model fit')
- `augment()`: adds information, e.g., individual fitted values to your data

*Note*: With `broom` you can also create a standardized output of other types of models, such as Latent Class Analysis (LCA) (with the package [`poLCA`](https://cran.r-project.org/web/packages/poLCA/index.html)).

---

## `tidy()`

``` r
library(broom)

tidy(simple_linear_model)
```

```
## # A tibble: 5 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 1.91 0.0183 105. 0 
## 2 main_branch 0.0154 0.00395 3.90 0.0000958
## 3 participation_frequency -0.00346 0.00256 -1.35 0.177 
## 4 ai_threat -0.00559 0.00658 -0.849 0.396 
## 5 ai_sentiment 0.505 0.00208 243. 0
```

---

## `glance()`

``` r
glance(simple_linear_model)
```

```
## # A tibble: 1 × 12
## r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.682 0.682 0.605 14782. 0 4 -25251. 50514. 50563.
## # ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
```

---

## `augment()`

``` r
augment(simple_linear_model)
```

```
## # A tibble: 27,566 × 12
## .rownames ai_trust main_branch participation_frequency ai_threat ai_sentiment
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 6 3 4 6 2 1
## 2 10 2.5 1 5 1 2
## 3 11 2.5 5 5 2 1
## 4 14 3.5 5 2 1 5
## 5 17 2.5 4 5 2 1
## 6 19 4 1 2 2 5
## 7 21 4.5 3 5 1 4
## 8 22 5 1 6 2 5
## 9 23 5 1 4 2 5
## 10 27 2 1 2 1 1
## # ℹ 27,556 more rows
## # ℹ 6 more variables: .fitted <dbl>, .resid <dbl>, .hat <dbl>, .sigma <dbl>,
## # .cooksd <dbl>, .std.resid <dbl>
```

---

## Outlook: Other modeling options

As we've said at the beginning of this session, we only looked at some of the basic confirmatory analysis methods. As you can imagine, however, `R` offers plenty of options more more complex statistical models as well. A few examples:

- Structural equation models with [`lavaan`](https://lavaan.ugent.be/)

- Multilevel/mixed-effects models with [`lme4`](https://cran.r-project.org/web/packages/lme4/index.html)

- Bayesian regression models using [`brms`](https://paul-buerkner.github.io/brms/)

- Mediation analysis using [`mediation`](https://cran.r-project.org/web/packages/mediation/vignettes/mediation.pdf)
  - response, mediator, sensitivity analyses (`medsens()`)
---

## Extracurricular activities

As it is such a simple and useful package, we recommend exploring the `broom` package a bit further. Alex Hayes, one authors of the package, [gave an interesting talk about it at the *RStudio* Conference in 2019](https://rstudio.com/resources/rstudioconf-2019/solving-the-model-representation-problem-with-broom/).

---

# [Exercise](https://rawcdn.githack.com/nika-akin/data-analysis-with-R/983625a7d096079e0598f2d60ca1281c757d4659/exercises/Exercise_4_1_3_Regression_Reporting.html) time 🏋️‍♀️💪🏃🚴

## [Solutions](https://rawcdn.githack.com/nika-akin/data-analysis-with-R/983625a7d096079e0598f2d60ca1281c757d4659/solutions/Exercise_4_1_3_Regression_Reporting.html)

---
## Coefficient Plots (parameters)

Create coefficient plot

``` r
library(parameters)

model_parameters(fit_log2) %>% 
  plot()
```

---
## ggplot2-based coefficient plot

``` r
model_parameters(fit_log2) %>% 
  plot() +
  scale_y_discrete(
    labels = c(  # reversed order
      "Sentiment on AI", "Age", "Professional stage"
    )
  ) +
  theme_minimal() +
  theme(legend.position = "none")
```

---

## Prediction plots vs Regressions coefficients

- Regression coefficients are sometimes a bit hard to read

-they are based on the scale of the independent variable when unstandardized
    -it is ambiguous what they actually mean substantially
    -for example, is a coefficient of .2 or .05 (practically) meaningful or not?

- Predictions are a solution

- they basically draw the regression line into the scatter plot of the Y and X variables
    - we can read at which level of X which value of Y is expected

---

## Predictions (sjPlot)

``` r
library(sjPlot)

fit_log2 %>% 
  plot_model(
    type = "pred", 
    terms = "age_group"
  )
```

---

## Customized sjplots

``` r
library(sjPlot)

fit_log2 %>% 
  plot_model(
    type = "pred", 
    terms = "age_group"
  )+
  xlab("Age") +
  ylab("R environment used") +
  ggtitle("Logistic Model") +
  theme_bw()
```

---

## Interaction effects

Does age moderate the relationship between r used and main_branch?

``` r
fit_log2_interaction <- glm(r_used ~ main_branch:age_group +
 ai_sentiment,
 family = binomial, data = df_model)

model_parameters(fit_log2_interaction)
```

```
## Parameter | Log-Odds | SE | 95% CI | z | p
## --------------------------------------------------------------------------------
## (Intercept) | -3.21 | 0.06 | [-3.32, -3.11] | -58.02 | < .001
## ai sentiment | 0.04 | 0.02 | [ 0.00, 0.07] | 2.24 | 0.025 
## main branch × age group | 0.02 | 4.79e-03 | [ 0.01, 0.03] | 4.21 | < .001
```

---

## Plotting interaction effects

Interaction parameters are difficult to interpret. Hence, plotting the regression parameters with `sjPlot`

``` r
fit_log2_interaction %>% 
  plot_model(type = "int") +
  xlab("Age") +
  ylab("'R used") +
  ggtitle("Interaction Model") +
  theme_bw()
```