class: center, middle, inverse, title-slide .title[ # Data Literacy: Introduction to R ] .subtitle[ ## Confirmatory Data Analysis ] .author[ ### Veronika Batzdorfer ] .date[ ### 2025-06-23 ] --- layout: true --- ## Content of this session .pull-left[ **What we will cover** - Using (more or less) simple regression models - OLS, GLM, and the like - How to re-use the results of these models - How to feed these results into tables ] .pull-right[ **What we won't cover** - Theory (and history) of hypothesis testing - Crazy complex models with elaborated estimators - e.g., no multilevel models - also no clustered standard errors - Bayesian statistics ] --- ## Last sessions - We learned how to import/ export data - We learned how to wrangle data, subsets, create new variables, recode, merge, pivot --- ## Recap Distribution ``` r #plot first variable hist(df$ai_acc, col=rgb(0,0,1,0.25), breaks=10, xlab='Trust in AI accuracy (blue) and complexity (red) ', main='Distribution of AI Metrics' ) #add second variable hist(df$ai_complexity, col=rgb(1,0,0,0.25), breaks=10, add=T) ``` <img src="data:image/png;base64,#4_1_Confirmatory_Data_Analysis_files/figure-html/unnamed-chunk-2-1.png" style="display: block; margin: auto;" /> --- ## Descriptive Statistics ``` r library(modelsummary) df.s <- df %>% dplyr::select(ai_complexity, ai_acc, ai_threat, visit_frequency) datasummary_skim(df.s, output = 'flextable') ``` <div class="tabwid"><style>.cl-09657df6{}.cl-0917c82c{font-family:'Arial';font-size:11pt;font-weight:normal;font-style:normal;text-decoration:none;color:rgba(0, 0, 0, 1.00);background-color:transparent;}.cl-09621b52{margin:0;text-align:left;border-bottom: 0 solid rgba(0, 0, 0, 1.00);border-top: 0 solid rgba(0, 0, 0, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);padding-bottom:5pt;padding-top:5pt;padding-left:5pt;padding-right:5pt;line-height: 1;background-color:transparent;}.cl-096231e6{width:0.75in;background-color:transparent;vertical-align: middle;border-bottom: 1.5pt solid rgba(102, 102, 102, 1.00);border-top: 1.5pt solid rgba(102, 102, 102, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);margin-bottom:0;margin-top:0;margin-left:0;margin-right:0;}.cl-096231f0{width:0.75in;background-color:transparent;vertical-align: middle;border-bottom: 0 solid rgba(0, 0, 0, 1.00);border-top: 0 solid rgba(0, 0, 0, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);margin-bottom:0;margin-top:0;margin-left:0;margin-right:0;}.cl-096231f1{width:0.75in;background-color:transparent;vertical-align: middle;border-bottom: 1.5pt solid rgba(102, 102, 102, 1.00);border-top: 0 solid rgba(0, 0, 0, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);margin-bottom:0;margin-top:0;margin-left:0;margin-right:0;}</style><table data-quarto-disable-processing='true' class='cl-09657df6'><thead><tr style="overflow-wrap:break-word;"><th class="cl-096231e6"><p class="cl-09621b52"><span class="cl-0917c82c"> </span></p></th><th class="cl-096231e6"><p class="cl-09621b52"><span class="cl-0917c82c">Unique</span></p></th><th class="cl-096231e6"><p class="cl-09621b52"><span class="cl-0917c82c">Missing Pct.</span></p></th><th class="cl-096231e6"><p class="cl-09621b52"><span class="cl-0917c82c">Mean</span></p></th><th class="cl-096231e6"><p class="cl-09621b52"><span class="cl-0917c82c">SD</span></p></th><th class="cl-096231e6"><p class="cl-09621b52"><span class="cl-0917c82c">Min</span></p></th><th class="cl-096231e6"><p class="cl-09621b52"><span class="cl-0917c82c">Median</span></p></th><th class="cl-096231e6"><p class="cl-09621b52"><span class="cl-0917c82c">Max</span></p></th></tr></thead><tbody><tr style="overflow-wrap:break-word;"><td class="cl-096231f0"><p class="cl-09621b52"><span class="cl-0917c82c">ai_complexity</span></p></td><td class="cl-096231f0"><p class="cl-09621b52"><span class="cl-0917c82c">6</span></p></td><td class="cl-096231f0"><p class="cl-09621b52"><span class="cl-0917c82c">43</span></p></td><td class="cl-096231f0"><p class="cl-09621b52"><span class="cl-0917c82c">2.2</span></p></td><td class="cl-096231f0"><p class="cl-09621b52"><span class="cl-0917c82c">1.1</span></p></td><td class="cl-096231f0"><p class="cl-09621b52"><span class="cl-0917c82c">1.0</span></p></td><td class="cl-096231f0"><p class="cl-09621b52"><span class="cl-0917c82c">2.0</span></p></td><td class="cl-096231f0"><p class="cl-09621b52"><span class="cl-0917c82c">5.0</span></p></td></tr><tr style="overflow-wrap:break-word;"><td class="cl-096231f0"><p class="cl-09621b52"><span class="cl-0917c82c">ai_acc</span></p></td><td class="cl-096231f0"><p class="cl-09621b52"><span class="cl-0917c82c">6</span></p></td><td class="cl-096231f0"><p class="cl-09621b52"><span class="cl-0917c82c">43</span></p></td><td class="cl-096231f0"><p class="cl-09621b52"><span class="cl-0917c82c">3.8</span></p></td><td class="cl-096231f0"><p class="cl-09621b52"><span class="cl-0917c82c">1.2</span></p></td><td class="cl-096231f0"><p class="cl-09621b52"><span class="cl-0917c82c">1.0</span></p></td><td class="cl-096231f0"><p class="cl-09621b52"><span class="cl-0917c82c">4.0</span></p></td><td class="cl-096231f0"><p class="cl-09621b52"><span class="cl-0917c82c">5.0</span></p></td></tr><tr style="overflow-wrap:break-word;"><td class="cl-096231f0"><p class="cl-09621b52"><span class="cl-0917c82c">ai_threat</span></p></td><td class="cl-096231f0"><p class="cl-09621b52"><span class="cl-0917c82c">4</span></p></td><td class="cl-096231f0"><p class="cl-09621b52"><span class="cl-0917c82c">32</span></p></td><td class="cl-096231f0"><p class="cl-09621b52"><span class="cl-0917c82c">1.9</span></p></td><td class="cl-096231f0"><p class="cl-09621b52"><span class="cl-0917c82c">0.6</span></p></td><td class="cl-096231f0"><p class="cl-09621b52"><span class="cl-0917c82c">1.0</span></p></td><td class="cl-096231f0"><p class="cl-09621b52"><span class="cl-0917c82c">2.0</span></p></td><td class="cl-096231f0"><p class="cl-09621b52"><span class="cl-0917c82c">3.0</span></p></td></tr><tr style="overflow-wrap:break-word;"><td class="cl-096231f1"><p class="cl-09621b52"><span class="cl-0917c82c">visit_frequency</span></p></td><td class="cl-096231f1"><p class="cl-09621b52"><span class="cl-0917c82c">6</span></p></td><td class="cl-096231f1"><p class="cl-09621b52"><span class="cl-0917c82c">9</span></p></td><td class="cl-096231f1"><p class="cl-09621b52"><span class="cl-0917c82c">2.5</span></p></td><td class="cl-096231f1"><p class="cl-09621b52"><span class="cl-0917c82c">1.3</span></p></td><td class="cl-096231f1"><p class="cl-09621b52"><span class="cl-0917c82c">1.0</span></p></td><td class="cl-096231f1"><p class="cl-09621b52"><span class="cl-0917c82c">2.0</span></p></td><td class="cl-096231f1"><p class="cl-09621b52"><span class="cl-0917c82c">5.0</span></p></td></tr></tbody></table></div> --- ## Bivariate correlations ``` r # Subset and drop missings dt.com <- df %>% dplyr::select(ai_complexity, ai_acc, ai_threat, ai_sentiment,years_coding) %>% drop_na() # corrplot library(corrplot) corrplot.mixed(cor(dt.com), lower.col = "black", number.cex = .7, insig='blank') ``` <img src="data:image/png;base64,#4_1_Confirmatory_Data_Analysis_files/figure-html/unnamed-chunk-4-1.png" style="display: block; margin: auto;" /> --- ## `R` is straightforward and literate `R` combines the best of two worlds: It is straightforward to write formulas ``` r y ~ x1 + x2 + x3 ``` *Note*: Formulas represent a specific object class in `R`. ``` r class(y ~ x1 + x2 + x3) ``` ``` ## [1] "formula" ``` --- ## Denoting the left-hand side with `~` In `R`, stating what your dependent variable is very similar to common mathematical notation: `$$y \sim N(\theta, \epsilon)$$` It states that a specific relationship is actually _estimated_, but we, fortunately, don't have to specify errors here. ``` r y ~ x1 + x2 + x3 ``` Yet, sometimes it may be a good idea to at least explicitly specify the intercept as here: ``` r y ~ 1 + x1 + x2 + x3 ``` --- ## Intercept We can also estimate models without an intercept: ``` r y ~ x1 + x2 + x3 - 1 ``` Or intercept-only models as well: ``` r y ~ 1 ``` --- ## Adding predictors with `+` You can add as many predictors/covariates as you want with the simple <span style="color:orange">`+`</span> operator. See: ``` r y ~ 1 + x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8 + x9 + x10 + x11 + x12 + x13 + x14 + x15 + x16 + x17 + x18 + x19 + x20 + x21 + x22 + x23 + x24 ``` There's also a shortcut for using all (remaining) variables in a data set as predictors: ``` r y ~ . ``` --- ## Interaction effects with `*` and `:` We can also easily add interaction effects to a model. As this is the same as multiplying predictor variables, we also use the <span style="color:orange">**`*`**</span> sign for that. ``` r y ~ x1 * x2 ``` The code above creates a model formula that includes both the <span style="background-color: yellow">main effects</span> of `x1` and `x2`, and their interaction denoted by <span style="background-color: yellow">`x1:x2`</span>. We can even be more explicit and write that into the formula directly: ``` r y ~ x1 + x2 + x1:x2 ``` --- ## **Transforming** variables within a formula For simple transformations this can be done as follows: ``` r y ~ log(x) # computes the log10 for x y ~ scale(x) # z-transformation of x ``` We could also change the data type of variables within a function, e.g., by converting a numeric variable to a factor using `as.factor(x)`. --- ## Transforming variables within a formula If you cannot use a specific function for your tansformation, you have to wrap the operation in the <span style="background-color: yellow">`I()`</span> function. ``` r y ~ x + I(x^2) # add a quadratic term of x ``` *Note*: Of course, there are also functions in `R` for transforming variables (e.g., standardizing or centering) before we use them in a formula. Besides the `base R` function `scale()` the [`datawizard package`](https://easystats.github.io/datawizard/), e.g., provides a few functions for that. --- ## Where to use formulas? The previous descriptions mainly refer to formulas used in regression models in `R`. However, formulas are also used in other hypothesis testing methods that distinguish between dependent and variables, such as t-tests or ANOVA. We will try out some of those in the following... --- ## Testing group differences in the distribution A common method for analyzing group differences are t-tests. You can use the two-sample <span style="background-color: yellow">`t.test()`</span> function: ``` r df_filtered <- df %>% filter(age_group >= 1, age_group <= 6) %>% mutate( age_group = case_when( age_group >= 1 & age_group <= 3 ~ "1–3", age_group >= 4 & age_group <= 6 ~ "4–6" ) ) t.test(ai_threat ~ age_group, data = df_filtered) ``` ``` ## ## Welch Two Sample t-test ## ## data: ai_threat by age_group ## t = -4.0674, df = 8800.9, p-value = 4.796e-05 ## alternative hypothesis: true difference in means between group 1–3 and group 4–6 is not equal to 0 ## 95 percent confidence interval: ## -0.04370872 -0.01527969 ## sample estimates: ## mean in group 1–3 mean in group 4–6 ## 1.918528 1.948023 ``` .small[ *Note*: By default, `R` uses [Welch's t-test](https://en.wikipedia.org/wiki/Welch%27s_t-test) (instead of [Student's t-test](https://en.wikipedia.org/wiki/Student%27s_t-test)) which does not assume homogeneity of variance across groups. ] --- ## Test of normality What if our data are <span style="background-color: orange">*not normally distributed*</span> in the first place, violating one of the basic assumptions of performing t-tests? To check this: [Shapiro-Wilk](https://en.wikipedia.org/wiki/Shapiro%E2%80%93Wilk_test) test of normality. ``` r shapiro.test(df_filtered$age_group) ``` --- ## Wilcoxon/Mann-Whitney test If the data are not normally distributed, the [Wilcoxon/Mann-Whitney](https://en.wikipedia.org/wiki/Mann%E2%80%93Whitney_U_test) test can be a suitable alternative. ``` r wilcox.test(ai_trust ~ age_group, data = df_filtered) ``` ``` ## ## Wilcoxon rank sum test with continuity correction ## ## data: ai_trust by age_group ## W = 67070026, p-value = 0.004805 ## alternative hypothesis: true location shift is not equal to 0 ``` --- ## Testing multiple groups with an ANOVA There are situations in which we want to test differences across multiple groups. The classic method to use for this is an analysis of variance (ANOVA) and its many variants (ANCOVA, MANOVA, etc.). ``` r #One-way anova anova <- aov(ai_trust ~ country, data = df) anova ``` ``` ## Call: ## aov(formula = ai_trust ~ country, data = df) ## ## Terms: ## country Residuals ## Sum of Squares 739.49 39419.08 ## Deg. of Freedom 175 35362 ## ## Residual standard error: 1.055808 ## Estimated effects may be unbalanced ## 29899 Beobachtungen als fehlend gelöscht ``` --- ## Testing multiple groups with an ANOVA To get some more detailed information, you need to use the `summary()` function we have already seen in the EDA session on the resulting `anova` object. ``` r #F-test summary(anova) ``` ``` ## Df Sum Sq Mean Sq F value Pr(>F) ## country 175 739 4.226 3.791 <2e-16 *** ## Residuals 35362 39419 1.115 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## 29899 Beobachtungen als fehlend gelöscht ``` --- ## Post-hoc test The previous output only indicates that there are statistically significant differences in the groups. To find out which groups differ from each other we can, for example, compute *Tukey Honest Significant Differences*. ``` r TukeyHSD(anova) ``` --- ## Simple linear regressions Estimating a linear model (OLS). The function for this is <span style="background-color: orange">`lm()`</span>. ``` r simple_linear_model <- lm( ai_trust ~ main_branch + participation_frequency + ai_threat+ ai_sentiment, data = df ) simple_linear_model ``` .right[↪️] --- class: middle .small[ ``` ## ## Call: ## lm(formula = ai_trust ~ main_branch + participation_frequency + ## ai_threat + ai_sentiment, data = df) ## ## Coefficients: ## (Intercept) main_branch participation_frequency ## 1.914059 0.015395 -0.003457 ## ai_threat ai_sentiment ## -0.005587 0.505417 ``` ] --- ## OLS Model Summary ``` r library(parameters) model_parameters(simple_linear_model) ``` ``` ## Parameter | Coefficient | SE | 95% CI | t(27561) | p ## ------------------------------------------------------------------------------------ ## (Intercept) | 1.91 | 0.02 | [ 1.88, 1.95] | 104.87 | < .001 ## main branch | 0.02 | 3.95e-03 | [ 0.01, 0.02] | 3.90 | < .001 ## participation frequency | -3.46e-03 | 2.56e-03 | [-0.01, 0.00] | -1.35 | 0.177 ## ai threat | -5.59e-03 | 6.58e-03 | [-0.02, 0.01] | -0.85 | 0.396 ## ai sentiment | 0.51 | 2.08e-03 | [ 0.50, 0.51] | 242.96 | < .001 ``` --- ## Simple linear regressions As for the ANOVA, we can get some more detailed (and nicely formatted) output using the well-known `summary()` function. ``` r summary(simple_linear_model) ``` .right[↪️] --- class: middle .small[ ``` ## ## Call: ## lm(formula = ai_trust ~ main_branch + participation_frequency + ## ai_threat + ai_sentiment, data = df) ## ## Residuals: ## Min 1Q Median 3Q Max ## -1.5035 -0.4155 0.0880 0.5719 0.6026 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 1.914059 0.018252 104.869 < 2e-16 *** ## main_branch 0.015395 0.003946 3.902 9.58e-05 *** ## participation_frequency -0.003457 0.002562 -1.349 0.177 ## ai_threat -0.005587 0.006578 -0.849 0.396 ## ai_sentiment 0.505417 0.002080 242.961 < 2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 0.6048 on 27561 degrees of freedom ## (37871 Beobachtungen als fehlend gelöscht) ## Multiple R-squared: 0.6821, Adjusted R-squared: 0.682 ## F-statistic: 1.478e+04 on 4 and 27561 DF, p-value: < 2.2e-16 ``` ] --- ## Quick model inspection ``` r par(mfrow = c(2, 2)) plot(simple_linear_model) ``` <img src="data:image/png;base64,#4_1_Confirmatory_Data_Analysis_files/figure-html/unnamed-chunk-20-1.png" style="display: block; margin: auto;" /> --- ## Assumption checks easystats provides some nice features for inspecting model fit. Here's a simple check for normality of the residuals. It is a combination of using a function from the performance package and then using see to plot it. ``` r library(performance) library(see) check_normality(linear_model) %>% plot() ``` QQ Plot ``` r check_normality(simple_linear_model) %>% plot(type = "qq") ``` Heteroscedascity ``` r check_heteroscedasticity( linear_model ) %>% plot() ``` --- ## Overview of model checks (*performance* package) ``` r check_model(simple_linear_model) ``` <img src="data:image/png;base64,#4_1_Confirmatory_Data_Analysis_files/figure-html/unnamed-chunk-24-1.png" style="display: block; margin: auto;" /> --- ## Checking regression assumptions As in the case of ANOVA, check regression assumptions. There are plenty of good tutorials online (such as [this blog post by Ian Ruginski](https://www.ianruginski.com/post/regressionassumptions/) or [this chapter](https://bookdown.org/jimr1603/Intermediate_R_-_R_for_Survey_Analysis/testing-regression-assumptions.html) in the online book [*R for Survey Analysis*](https://bookdown.org/jimr1603/Intermediate_R_-_R_for_Survey_Analysis/)). --- ## Why robust? You can make your linear regression provide **robust** or clustered **standard errors**. One convenient way is the **`fixest`** package. The syntax is very similar to lm. But we specify fixed effects slightly differently by using the | symbol. Alternatively, we can use the <span style="background-color: orange">lm_robust()</span> function from the `estimatr` package. --- ## Robust Standard Errors ``` r library(fixest) model.lm.1.robust<-feols(ai_trust ~ main_branch + participation_frequency + factor(age_group) + ai_threat + ai_sentiment | country, # fixed effects per country vcov = "hetero", # heteroskedastic robust SEs data = df_filtered) summary(model.lm.1.robust) ``` ``` ## OLS estimation, Dep. Var.: ai_trust ## Observations: 26,303 ## Fixed-effects: country: 168 ## Standard-errors: Heteroskedasticity-robust ## Estimate Std. Error t value Pr(>|t|) ## main_branch 0.012523 0.004395 2.849558 0.0043814 ** ## participation_frequency -0.000451 0.002647 -0.170375 0.8647166 ## factor(age_group)4–6 0.034928 0.011460 3.047884 0.0023069 ** ## ai_threat -0.004827 0.006482 -0.744696 0.4564620 ## ai_sentiment 0.504626 0.002183 231.205442 < 2.2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## RMSE: 0.598333 Adj. R2: 0.688459 ## Within R2: 0.684001 ``` --- ## Generalized linear regressions There are a multitude of other estimators/link functions, so-called **family objects**, e.g., binomial logistic or Poisson regression through the `glm()` function. See `?family` for an overview. - <span style="background-color: orange">logistic</span> (`glm ( y ~ x, family = 'binomial`)) for binary data, - <span style="background-color: orange">Poisson</span> (`glm ( y ~ x, family = 'poisson`)) for count data, - <span style="background-color: orange">negative binomial</span> for overdispersed count data (`glm( y ~ x, family = 'quasipoisson`)) --- ## Generalized linear regression Let's look at the the example of logistic regression. ``` r #keep relevant predictors and complete cases df_model <- df %>% filter(!is.na(r_used)) %>% select( r_used, main_branch, age_group, education_level, participation_frequency, ai_sentiment, ai_threat, ai_complexity, years_coding, community_belief ) %>% drop_na() # Fit the logistic regression model model_r_used <- glm( r_used ~ ., data = df_model, family = binomial() ) # View summary summary(model_r_used) ``` ``` ## ## Call: ## glm(formula = r_used ~ ., family = binomial(), data = df_model) ## ## Coefficients: ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) -3.406798 0.182386 -18.679 < 2e-16 *** ## main_branch 0.380770 0.025407 14.987 < 2e-16 *** ## age_group -0.137409 0.024267 -5.662 1.49e-08 *** ## education_level -0.023662 0.015825 -1.495 0.13487 ## participation_frequency -0.059690 0.019934 -2.994 0.00275 ** ## ai_sentiment 0.036146 0.016549 2.184 0.02895 * ## ai_threat -0.020442 0.052982 -0.386 0.69963 ## ai_complexity 0.010953 0.026642 0.411 0.68099 ## years_coding 0.003755 0.003436 1.093 0.27441 ## community_belief 0.083241 0.015768 5.279 1.30e-07 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## (Dispersion parameter for binomial family taken to be 1) ## ## Null deviance: 9976.8 on 26864 degrees of freedom ## Residual deviance: 9728.0 on 26855 degrees of freedom ## AIC: 9748 ## ## Number of Fisher Scoring iterations: 6 ``` .right[↪️] --- class: middle .small[ ``` ## ## Call: ## glm(formula = r_used ~ ., family = binomial(), data = df_model) ## ## Coefficients: ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) -3.406798 0.182386 -18.679 < 2e-16 *** ## main_branch 0.380770 0.025407 14.987 < 2e-16 *** ## age_group -0.137409 0.024267 -5.662 1.49e-08 *** ## education_level -0.023662 0.015825 -1.495 0.13487 ## participation_frequency -0.059690 0.019934 -2.994 0.00275 ** ## ai_sentiment 0.036146 0.016549 2.184 0.02895 * ## ai_threat -0.020442 0.052982 -0.386 0.69963 ## ai_complexity 0.010953 0.026642 0.411 0.68099 ## years_coding 0.003755 0.003436 1.093 0.27441 ## community_belief 0.083241 0.015768 5.279 1.30e-07 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## (Dispersion parameter for binomial family taken to be 1) ## ## Null deviance: 9976.8 on 26864 degrees of freedom ## Residual deviance: 9728.0 on 26855 degrees of freedom ## AIC: 9748 ## ## Number of Fisher Scoring iterations: 6 ``` ] --- ## Assessing model quality The [`performance` package](https://easystats.github.io/performance/) that is that is part of the [`easystats` collection of packages](https://easystats.github.io/easystats/) offers some functions for assessing model quality, including different types of R². A commonly used metric for logistic regression models, e.g., is Nagelkerke's R². *Note*: The `performance` package also includes several helpful functions for model diagnostics. --- ## Model selection Apply stepwise AIC to remove/ add predictors. ``` r fit_log1 <- glm(r_used ~ main_branch, family = binomial, data = df_model) fit_log2 <- glm(r_used ~ main_branch + age_group + ai_sentiment, family = binomial, data = df_model) anova(fit_log1, fit_log2, test = "Chisq") ``` .right[↪️] --- class:middle .small[ ``` ## Analysis of Deviance Table ## ## Model 1: r_used ~ main_branch ## Model 2: r_used ~ main_branch + age_group + ai_sentiment ## Resid. Df Resid. Dev Df Deviance Pr(>Chi) ## 1 26863 9820.5 ## 2 26861 9778.4 2 42.117 7.151e-10 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ``` ] --- ## Compare models step-wise using AIC Start with a simple model and stepwise remove possible predictors ``` r fit_log_step = step(glm(r_used ~ main_branch + age_group + ai_complexity + ai_threat, family = binomial, data = df_model, trace = 0)) ``` ``` ## Start: AIC=9793.24 ## r_used ~ main_branch + age_group + ai_complexity + ai_threat ## ## Df Deviance AIC ## - ai_threat 1 9783.3 9791.3 ## - ai_complexity 1 9783.6 9791.6 ## <none> 9783.2 9793.2 ## - age_group 1 9820.0 9828.0 ## - main_branch 1 9966.1 9974.1 ## ## Step: AIC=9791.27 ## r_used ~ main_branch + age_group + ai_complexity ## ## Df Deviance AIC ## - ai_complexity 1 9783.7 9789.7 ## <none> 9783.3 9791.3 ## - age_group 1 9820.1 9826.1 ## - main_branch 1 9966.2 9972.2 ## ## Step: AIC=9789.68 ## r_used ~ main_branch + age_group ## ## Df Deviance AIC ## <none> 9783.7 9789.7 ## - age_group 1 9820.5 9824.5 ## - main_branch 1 9966.8 9970.8 ``` ``` r fit_log_step ``` ``` ## ## Call: glm(formula = r_used ~ main_branch + age_group, family = binomial, ## data = df_model, trace = 0) ## ## Coefficients: ## (Intercept) main_branch age_group ## -3.3087 0.3658 -0.1190 ## ## Degrees of Freedom: 26864 Total (i.e. Null); 26862 Residual ## Null Deviance: 9977 ## Residual Deviance: 9784 AIC: 9790 ``` .right[↪️] --- class:middle .small[ ``` ## Start: AIC=9793.24 ## r_used ~ main_branch + age_group + ai_complexity + ai_threat ## ## Df Deviance AIC ## - ai_threat 1 9783.3 9791.3 ## - ai_complexity 1 9783.6 9791.6 ## <none> 9783.2 9793.2 ## - age_group 1 9820.0 9828.0 ## - main_branch 1 9966.1 9974.1 ## ## Step: AIC=9791.27 ## r_used ~ main_branch + age_group + ai_complexity ## ## Df Deviance AIC ## - ai_complexity 1 9783.7 9789.7 ## <none> 9783.3 9791.3 ## - age_group 1 9820.1 9826.1 ## - main_branch 1 9966.2 9972.2 ## ## Step: AIC=9789.68 ## r_used ~ main_branch + age_group ## ## Df Deviance AIC ## <none> 9783.7 9789.7 ## - age_group 1 9820.5 9824.5 ## - main_branch 1 9966.8 9970.8 ``` ``` ## ## Call: glm(formula = r_used ~ main_branch + age_group, family = binomial, ## data = df_model, trace = 0) ## ## Coefficients: ## (Intercept) main_branch age_group ## -3.3087 0.3658 -0.1190 ## ## Degrees of Freedom: 26864 Total (i.e. Null); 26862 Residual ## Null Deviance: 9977 ## Residual Deviance: 9784 AIC: 9790 ``` ] --- ## Comparing model performance The `performance` package we have previously used for calculating Nagelkerke's R² for our logistic regression model also provides a handy function for comparing model performance. --- ## Other regression variants While `glm()` already provides quite a few estimators and link functions, depending on the distribution of your dependent variable, you may need more specialized regression models. A few interesting options for that, e.g, include are the [`MASS` package](https://cran.r-project.org/web/packages/MASS/index.html) for negative binomial regression, the [`pscl` package](https://cran.r-project.org/web/packages/pscl/index.html) for zero-inflated (negative) binomial regression and hurdle models, or the [`mcp` package](https://lindeloev.github.io/mcp/) for regression with multiple change points. --- ## Handling regression results While it is (still) common practice to run regressions, search for 'significant' `p`-values, and paste the results into a table without interpreting them substantially, this may not be the best thing to do. --- ## Accessing model results in `base R` Regression results are a specific type/class of objects in `R`. You can use the `str()` function to get an overview of the whole structure of the object (it's a list of different information). For starters, we may want to see what the first level of this list may provide by asking for the names of the included pieces of information: ``` r names(fit_log_step) ``` ``` ## [1] "coefficients" "residuals" "fitted.values" ## [4] "effects" "R" "rank" ## [7] "qr" "family" "linear.predictors" ## [10] "deviance" "aic" "null.deviance" ## [13] "iter" "weights" "prior.weights" ## [16] "df.residual" "df.null" "y" ## [19] "converged" "boundary" "model" ## [22] "call" "formula" "terms" ## [25] "data" "offset" "control" ## [28] "method" "contrasts" "xlevels" ## [31] "anova" ``` --- ## Accessing coefficients We can access the coefficients from our model as follows: ``` r fit_log_step$coefficients ``` ``` ## (Intercept) main_branch age_group ## -3.3087191 0.3657930 -0.1190442 ``` --- ## Accessing standard errors `lm` objects are a little bit cumbersome to use as the information is deeply nested within the object. If you want to extract the standard errors, e.g., you can do so as follows: ``` r summary(fit_log_step)$coefficients[,2] ``` ``` ## (Intercept) main_branch age_group ## 0.06563969 0.02495990 0.02042471 ``` --- ## Accessing confidence intervals The standard `summary()` doesn't supply confidence intervals. We can use the `confint()` command to produce them. For example, for the logistic regression: ``` r confint(fit_log_step) #- for odds ratio (OR) from log-odds # - i.e. OR < 1 predictor is associated with lower odds of the outcome exp(confint(fit_log_step)) ``` --- ## Regression tables As we have seen in the session on *Exploratory Data Analysis*, there are different options for creating tables as output in `R`: [`stargazer`](https://cran.r-project.org/web/packages/stargazer/index.html) and [`gtsummary`](http://www.danieldsjoberg.com/gtsummary/) --- ## Example: Regression tables with the `stargazer` package ``` r library(stargazer) stargazer( simple_linear_model, type = "text", # For console output; use "html" or "latex" for other formats title = "Regression Results", dep.var.labels = c("AI Trust"), covariate.labels = c( "Main Branch", "Participation Frequency", "Years Coding" ), out.header = FALSE, # Exclude header metadata if outputting to file digits = 3 # Number of decimal places ) ``` .right[↪️] --- class: center, middle .ssmall[ ``` ## ## Regression Results ## ===================================================== ## Dependent variable: ## ----------------------------- ## AI Trust ## ----------------------------------------------------- ## Main Branch 0.015*** ## (0.004) ## ## Participation Frequency -0.003 ## (0.003) ## ## Years Coding -0.006 ## (0.007) ## ## ai_sentiment 0.505*** ## (0.002) ## ## Constant 1.914*** ## (0.018) ## ## ----------------------------------------------------- ## Observations 27,566 ## R2 0.682 ## Adjusted R2 0.682 ## Residual Std. Error 0.605 (df = 27561) ## F Statistic 14,781.740*** (df = 4; 27561) ## ===================================================== ## Note: *p<0.1; **p<0.05; ***p<0.01 ``` ] --- ## Tidy models with `broom` The [`broom`](https://broom.tidymodels.org/) package provides only 3 but very powerful main functions: - `tidy()`: creates a `tibble` from your model - `glance()`: information about your model as `tibble` ('model fit') - `augment()`: adds information, e.g., individual fitted values to your data *Note*: With `broom` you can also create a standardized output of other types of models, such as Latent Class Analysis (LCA) (with the package [`poLCA`](https://cran.r-project.org/web/packages/poLCA/index.html)). --- ## `tidy()` ``` r library(broom) tidy(simple_linear_model) ``` ``` ## # A tibble: 5 × 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) 1.91 0.0183 105. 0 ## 2 main_branch 0.0154 0.00395 3.90 0.0000958 ## 3 participation_frequency -0.00346 0.00256 -1.35 0.177 ## 4 ai_threat -0.00559 0.00658 -0.849 0.396 ## 5 ai_sentiment 0.505 0.00208 243. 0 ``` --- ## `glance()` ``` r glance(simple_linear_model) ``` ``` ## # A tibble: 1 × 12 ## r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC ## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 0.682 0.682 0.605 14782. 0 4 -25251. 50514. 50563. ## # ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int> ``` --- ## `augment()` ``` r augment(simple_linear_model) ``` ``` ## # A tibble: 27,566 × 12 ## .rownames ai_trust main_branch participation_frequency ai_threat ai_sentiment ## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 6 3 4 6 2 1 ## 2 10 2.5 1 5 1 2 ## 3 11 2.5 5 5 2 1 ## 4 14 3.5 5 2 1 5 ## 5 17 2.5 4 5 2 1 ## 6 19 4 1 2 2 5 ## 7 21 4.5 3 5 1 4 ## 8 22 5 1 6 2 5 ## 9 23 5 1 4 2 5 ## 10 27 2 1 2 1 1 ## # ℹ 27,556 more rows ## # ℹ 6 more variables: .fitted <dbl>, .resid <dbl>, .hat <dbl>, .sigma <dbl>, ## # .cooksd <dbl>, .std.resid <dbl> ``` --- ## Outlook: Other modeling options As we've said at the beginning of this session, we only looked at some of the basic confirmatory analysis methods. As you can imagine, however, `R` offers plenty of options more more complex statistical models as well. A few examples: - Structural equation models with [`lavaan`](https://lavaan.ugent.be/) - Multilevel/mixed-effects models with [`lme4`](https://cran.r-project.org/web/packages/lme4/index.html) - Bayesian regression models using [`brms`](https://paul-buerkner.github.io/brms/) - Mediation analysis using [`mediation`](https://cran.r-project.org/web/packages/mediation/vignettes/mediation.pdf) - response, mediator, sensitivity analyses (`medsens()`) --- ## Extracurricular activities As it is such a simple and useful package, we recommend exploring the `broom` package a bit further. Alex Hayes, one authors of the package, [gave an interesting talk about it at the *RStudio* Conference in 2019](https://rstudio.com/resources/rstudioconf-2019/solving-the-model-representation-problem-with-broom/). --- class: center, middle # [Exercise](https://rawcdn.githack.com/nika-akin/data-analysis-with-R/983625a7d096079e0598f2d60ca1281c757d4659/exercises/Exercise_4_1_3_Regression_Reporting.html) time 🏋️♀️💪🏃🚴 ## [Solutions](https://rawcdn.githack.com/nika-akin/data-analysis-with-R/983625a7d096079e0598f2d60ca1281c757d4659/solutions/Exercise_4_1_3_Regression_Reporting.html) --- ## Coefficient Plots (parameters) Create coefficient plot ``` r library(parameters) model_parameters(fit_log2) %>% plot() ``` <img src="data:image/png;base64,#4_1_Confirmatory_Data_Analysis_files/figure-html/unnamed-chunk-30-1.png" style="display: block; margin: auto;" /> --- ## ggplot2-based coefficient plot ``` r model_parameters(fit_log2) %>% plot() + scale_y_discrete( labels = c( # reversed order "Sentiment on AI", "Age", "Professional stage" ) ) + theme_minimal() + theme(legend.position = "none") ``` <img src="data:image/png;base64,#4_1_Confirmatory_Data_Analysis_files/figure-html/unnamed-chunk-31-1.png" style="display: block; margin: auto;" /> --- ## Prediction plots vs Regressions coefficients - Regression coefficients are sometimes a bit hard to read -they are based on the scale of the independent variable when unstandardized -it is ambiguous what they actually mean substantially -for example, is a coefficient of .2 or .05 (practically) meaningful or not? - Predictions are a solution - they basically draw the regression line into the scatter plot of the Y and X variables - we can read at which level of X which value of Y is expected --- ## Predictions (sjPlot) ``` r library(sjPlot) fit_log2 %>% plot_model( type = "pred", terms = "age_group" ) ``` <img src="data:image/png;base64,#4_1_Confirmatory_Data_Analysis_files/figure-html/unnamed-chunk-32-1.png" style="display: block; margin: auto;" /> --- ## Customized sjplots ``` r library(sjPlot) fit_log2 %>% plot_model( type = "pred", terms = "age_group" )+ xlab("Age") + ylab("R environment used") + ggtitle("Logistic Model") + theme_bw() ``` <img src="data:image/png;base64,#4_1_Confirmatory_Data_Analysis_files/figure-html/unnamed-chunk-33-1.png" style="display: block; margin: auto;" /> --- ## Interaction effects Does age moderate the relationship between r used and main_branch? ``` r fit_log2_interaction <- glm(r_used ~ main_branch:age_group + ai_sentiment, family = binomial, data = df_model) model_parameters(fit_log2_interaction) ``` ``` ## Parameter | Log-Odds | SE | 95% CI | z | p ## -------------------------------------------------------------------------------- ## (Intercept) | -3.21 | 0.06 | [-3.32, -3.11] | -58.02 | < .001 ## ai sentiment | 0.04 | 0.02 | [ 0.00, 0.07] | 2.24 | 0.025 ## main branch × age group | 0.02 | 4.79e-03 | [ 0.01, 0.03] | 4.21 | < .001 ``` --- ## Plotting interaction effects Interaction parameters are difficult to interpret. Hence, plotting the regression parameters with `sjPlot` ``` r fit_log2_interaction %>% plot_model(type = "int") + xlab("Age") + ylab("'R used") + ggtitle("Interaction Model") + theme_bw() ``` <img src="data:image/png;base64,#4_1_Confirmatory_Data_Analysis_files/figure-html/unnamed-chunk-34-1.png" style="display: block; margin: auto;" />