Data Literacty: Introduction to R

.title[
# Data Literacty: Introduction to R
]
.subtitle[
## Confirmatory Data Analysis
]
.author[
### Veronika Batzdorfer
]
.date[
### 2024-11-22
]

---

---

## Content of this session
.pull-left[
**What we will cover**
- Using (more or less) simple regression models

- OLS, GLM, and the like

- How to re-use the results of these models

- How to feed these results into tables
]

- Crazy complex models with elaborated estimators
- e.g., no multilevel models
- also no clustered standard errors

- Bayesian statistics
]

---

### Data

```r
library(tidyverse)

stackoverflow_survey_questions <- read_csv("./data/stackoverflow_survey_questions.csv")
stackoverflow_survey_single_response <- read_csv("./data/stackoverflow_survey_single_response.csv")
qname_levels_single_response_crosswalk <- read_csv("./data/qname_levels_single_response_crosswalk.csv")

df <- stackoverflow_survey_single_response %>%
  rename(
    main_branch = main_branch,
    age_group = age,
    remote_work = remote_work,
    education_level = ed_level,
    years_coding = years_code,
    years_pro_coding = years_code_pro,
    purchase_influence = purchase_influence,
    build_vs_buy = buildvs_buy,
    visit_frequency = so_visit_freq,
    has_account = so_account,
    participation_frequency = so_part_freq,
    community_belief = so_comm,
    ai_usage = ai_select,
    ai_sentiment = ai_sent,
    ai_complexity = ai_complex,
    survey_ease = survey_ease,
    yearly_compensation = converted_comp_yearly,
  )

df <- df %>% 
  mutate(ai_trust = rowMeans(across(ai_sentiment:ai_acc, ~as.numeric(.x))))
```

---

```rpreoproce_cfa

df <- df %>%
  mutate(
    main_branch = case_when(
      main_branch == 1 ~ "Developer by profession",
      main_branch == 2 ~ "Learning to code",
      main_branch == 3 ~ "Not primarily a developer",
      main_branch == 4 ~ "Hobbyist",
      main_branch == 5 ~ "Former developer",
      TRUE ~ as.character(main_branch)
    )
  )

df <- df %>%
  mutate(
    age_group = case_when(
      age_group == 1 ~ "18-24",
      age_group == 2 ~ "25-34",
      age_group == 3 ~ "35-44",
      age_group == 4 ~ "45-54",
      age_group == 5 ~ "55-64",
      age_group == 6 ~ "65+",
      age_group == 7 ~ "Prefer not to say",
      age_group == 8 ~ "Under 18",
      TRUE ~ as.character(age_group)
    )
  )

# Recode remote_work
df <- df %>%
  mutate(
    remote_work = case_when(
      remote_work == 1 ~ "Hybrid",
      remote_work == 2 ~ "In-person",
      remote_work == 3 ~ "Remote",
      TRUE ~ as.character(remote_work)
    )
  )

# Recode education_level
df <- df %>%
  mutate(
    education_level = case_when(
      education_level == 1 ~ "Associate degree",
      education_level == 2 ~ "Bachelor’s degree",
      education_level == 3 ~ "Master’s degree",
      education_level == 4 ~ "Primary/elementary school",
      education_level == 5 ~ "Professional degree",
      education_level == 6 ~ "Secondary school",
      education_level == 7 ~ "Some college",
      education_level == 8 ~ "Other",
      TRUE ~ as.character(education_level)
    )
  )

# Recode so_visit_freq
df <- df %>%
  mutate(
    visit_frequency = case_when(
      visit_frequency == 1 ~ "A few times per month or weekly",
      visit_frequency == 2 ~ "A few times per week",
      visit_frequency == 3 ~ "Daily or almost daily",
      visit_frequency == 4 ~ "Less than once per month or monthly",
      visit_frequency == 5 ~ "Multiple times per day",
      TRUE ~ as.character(visit_frequency)
    )
  )

# Recode ai_threat
df <- df %>%
  mutate(
    ai_threat = case_when(
      ai_threat == 1 ~ "I'm not sure",
      ai_threat == 2 ~ "No",
      ai_threat == 3 ~ "Yes",
      TRUE ~ as.character(ai_threat)
    )
  )

# Recode ai_complex
df <- df %>%
  mutate(
    ai_complexity = case_when(
      ai_complexity == 1 ~ "Bad at complex tasks",
      ai_complexity == 2 ~ "Good",
      ai_complexity == 3 ~ "Neither ",
      ai_complexity == 4 ~ "Very poor at complex tasks",
      ai_complexity == 5 ~ "Very well at complex tasks",
      TRUE ~ as.character(ai_complexity)
    )
  )

df <- df %>% 
  filter(!is.na(ai_threat))

df$ai_threat <- as.factor(df$ai_threat)
df$age_group <- as.factor(df$age_group)

df <- df %>%
  mutate(ai_threat = ifelse(ai_threat %in% c("Yes", "I'm not sure"), 1, 0))

```

---

## `R` is rich in statistical procedures

Generally, if you seek to use a specific statistical method in `R`, chances are quite high that you can easily do that. As we've said before: There's ~~an app~~ a package for that. After all, `R` is a statistical programming language that was originally developed by statisticians.

---

## Formulas in statistical software

Before we start analyzing data, we should make ourselves familiar with some more terminology in `R`. As in other statistical languages, e.g., regression models require the definition of dependent and independent variables. For example, in *Stata* you would write:

```r
y x1 x2 x3
```

*SPSS* is more literate by requiring you to state what your dependent variables are with the `/DEPENDENT` parameter.

---

## `R` is straightforward and literate

`R` combines the best of two worlds: It is straightforward to write formulas and it is quite literate regarding what role a specific element of a formula plays.

```r
y ~ x1 + x2 + x3
```

*Note*: Formulas represent a specific object class in `R`.

```r
class(y ~ x1 + x2 + x3)
```

```
## [1] "formula"
```

---

## Denoting the left-hand side with `~`

In `R`, stating what your dependent variable is very similar to common mathematical notation:

`$$y \sim N(\theta, \epsilon)$$`

It states that a specific relationship is actually _estimated_, but we, fortunately, don't have to specify errors here.

```r
y ~ x1 + x2 + x3
```

Yet, sometimes it may be a good idea to at least explicitly specify the intercept as here:

```r
y ~ 1 + x1 + x2 + x3
```

---

## Intercept

We can also estimate models without an intercept:

```r
y ~ x1 + x2 + x3 - 1
```

Or intercept-only models as well:

```r
y ~ 1
```

---

## Adding predictors with `+`
You can add as many predictors/covariates as you want with the simple `+` operator. See:

```r
y ~ 1 + x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8 + x9 + x10 + x11 + x12 +
  x13 + x14 + x15 + x16 + x17 + x18 + x19 + x20 + x21 + x22 + x23 + x24
```

There's also a shortcut for using all (remaining) variables in a data set as predictors:

```r
y ~ .
```

---

## Interaction effects with `*` and `:`

We can also easily add interaction effects to a model. As this is the same as multiplying predictor variables, we also use the `*` sign for that.

```r
y ~ x1 * x2
```

The code above creates a model formula that includes both the main effects of `x1` and `x2`, and their interaction denoted by `x1:x2`. We can even be more explicit and write that into the formula directly:

```r
y ~ x1 + x2 + x1:x2
```

---

## Transforming variables within a formula

One last point before we dive into doing some actual analysis is transforming variables. This procedure is rather common in regression analysis. It is also straightforward to do in `R`. For simple transformations this can be done as follows:

```r
y ~ log(x)   # computes the log10 for x
y ~ scale(x) # z-transformation of x
```

We could also change the data type of variables within a function, e.g., by converting a numeric variable to a factor using `as.factor(x)`.

---

## Transforming variables within a formula

If you cannot use a specific function for your tansformation, you have to wrap the operation in the `I()` function. For example:

```r
y ~ x + I(x^2) # add a quadratic term of x
```

*Note*: Of course, there are also functions in `R` for transforming variables (e.g., standardizing or centering) before we use them in a formula. Besides the `base R` function `scale()` the [`datawizard package`](https://easystats.github.io/datawizard/), e.g., provides a few functions for that.

---

## Where to use formulas?

The previous descriptions mainly refer to formulas used in regression models in `R`. However, formulas are also used in other hypothesis testing methods that distinguish between dependent and variables, such as t-tests or ANOVA. We will try out some of those in the following...

---

## Testing group differences in the distribution

A very common methods for analyzing group differences are t-tests. You can use the `t.test()` function from `base R` function to easily perform such a test.

```r
df_filtered <- df %>% filter(age_group %in% c("18-24", "35-44"))

t.test(ai_threat ~ age_group, data = df_filtered)
```

.small[
*Note*: By default, `R` uses [Welch's t-test](https://en.wikipedia.org/wiki/Welch%27s_t-test) (instead of [Student's t-test](https://en.wikipedia.org/wiki/Student%27s_t-test)) which does not assume homogeneity of variance across groups.
]

---

## Test of normality

What if our data are not normally distributed in the first place, thus, violating one of the basic assumptions of performing t-tests? To check this, we can use a [Shapiro-Wilk](https://en.wikipedia.org/wiki/Shapiro%E2%80%93Wilk_test) test of normality.

```r
shapiro.test(df$ai_trust)
```

---

## Wilcoxon/Mann-Whitney test

If the data are not normally distributed, the [Wilcoxon/Mann-Whitney](https://en.wikipedia.org/wiki/Mann%E2%80%93Whitney_U_test) test can be a suitable alternative.

```r
wilcox.test(ai_trust ~ ai_threat, data = df)
```

---

## Testing multiple groups with an ANOVA

There are situations in which we want to test differences across multiple groups. The classic method to use for this is an analysis of variance (ANOVA) and its many variants (ANCOVA, MANOVA, etc.). Again, you can easily do that in `R` using the `aov()` function.

```r
anova <- aov(ai_trust ~ age_group, data = df)

anova
```

```
## Call:
##    aov(formula = ai_trust ~ age_group, data = df)
## 
## Terms:
##                 age_group Residuals
## Sum of Squares       0.85  42052.74
## Deg. of Freedom         1     37274
## 
## Residual standard error: 1.06217
## Estimated effects may be unbalanced
## 28161 Beobachtungen als fehlend gelöscht
```

---

## Testing multiple groups with an ANOVA

To get some more detailed information, you need to use the `summary()` function we have already seen in the EDA session on the resulting `anova` object.

```r
summary(anova)
```

```
##                Df Sum Sq Mean Sq F value Pr(>F)
## age_group       1      1  0.8493   0.753  0.386
## Residuals   37274  42053  1.1282               
## 28161 Beobachtungen als fehlend gelöscht
```

---

## Post-hoc test

Unfortunately, the previous output only indicates that there are statistically significant differences in the groups. To find out which groups differ from each other we can, for example, compute *Tukey Honest Significant Differences*.

```r
TukeyHSD(anova)
```

---

## Simple linear regressions

This may not come as a surprise at this point, but regression models are also easy to perform in `R`. The function for this is `lm()`.

```r
simple_linear_model <- 
  lm(
    ai_trust ~ main_branch + participation_frequency +years_coding, 
    data = df
  )

simple_linear_model
```

---

```
## 
## Call:
## lm(formula = ai_trust ~ main_branch + participation_frequency + 
##     years_coding, data = df)
## 
## Coefficients:
##             (Intercept)              main_branch  participation_frequency  
##                3.186857                 0.011180                -0.005610  
##            years_coding  
##               -0.002464
```
]

---

## Simple linear regressions

As for the ANOVA, we can get some more detailed (and nicely formatted) output using the well-known `summary()` function.

```r
summary(simple_linear_model)
```

---

```
## 
## Call:
## lm(formula = ai_trust ~ main_branch + participation_frequency + 
##     years_coding, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.2210 -0.6654 -0.1552  0.8318  2.4236 
## 
## Coefficients:
##                           Estimate Std. Error t value             Pr(>|t|)    
## (Intercept)              3.1868565  0.0242047 131.662 < 0.0000000000000002 ***
## main_branch              0.0111800  0.0070502   1.586             0.112802    
## participation_frequency -0.0056103  0.0045652  -1.229             0.219115    
## years_coding            -0.0024636  0.0006535  -3.770             0.000163 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.073 on 27366 degrees of freedom
##   (38067 Beobachtungen als fehlend gelöscht)
## Multiple R-squared:  0.0006972,	Adjusted R-squared:  0.0005876 
## F-statistic: 6.364 on 3 and 27366 DF,  p-value: 0.0002625
```
]

---

## Checking regression assumptions

As in the case of ANOVA, we do not cover the topic of testing regression assumptions in this session. Of course, when you do some actual analyses, you should definitely do that. But if you want more information about this issue, there are plenty of good tutorials online (such as [this blog post by Ian Ruginski](https://www.ianruginski.com/post/regressionassumptions/) or [this chapter](https://bookdown.org/jimr1603/Intermediate_R_-_R_for_Survey_Analysis/testing-regression-assumptions.html) in the online book [*R for Survey Analysis*](https://bookdown.org/jimr1603/Intermediate_R_-_R_for_Survey_Analysis/)).

---

## Dummy coding of categorical predictors

As you have seen, `R` automatically converts factors in a regression model to dummy-coded variables, with the reference being the first value level. Hence, there is no need to create several variables with dummy codes and add them one by one to the regression formula.

You can inspect the contrast matrix using:

```r
contrasts(df$age_group)
```
]

]

---

## Generalized linear regression

What we have seen so far are estimates for linear OLS regression models. A standard `R` installation provides a multitude of other estimators/link functions, so-called family objects, e.g., binomial logistic or Poisson regression models through the `glm()` function. See `?family` for an overview.

Let's look at the the example of logistic regression. For this  purpose, say we are interested in the effect of the binary indicator of ai threat on all the other variables.

```r
table(df$ai_threat)
```

```
## 
##     1     2     3 
##  8878 30423  5388
```

---

## Logistic regression

```r
simple_model_logistic <- 
  glm(
    ai_threat ~ participation_frequency  + years_coding , 
    family = binomial(link = "logit"),
    data = df
  )

summary(simple_model_logistic)
```

---

]

---

## Assessing model quality

The [`performance` package](https://easystats.github.io/performance/) that is that is part of the [`easystats` collection of packages](https://easystats.github.io/easystats/) offers some functions for assessing model quality, including different types of R². A commonly used metric for logistic regression models, e.g., is Nagelkerke's R².

*Note*: The `performance` package also includes several helpful functions for model diagnostics.

---

## Changing the link function

For the fun of it, let's change the link function in our regression model from logit to probit.

```r
simple_model_probit <- 
  glm(
     ai_threat ~ participation_frequency  + years_coding , 
    family = binomial(link = "probit"),
    data = df
  )

summary(simple_model_probit)
```

---

class:middle
.small[

]

---

## Comparing models with an ANOVA

We can also compare models with some standard tools. For example, to examine competing models, such as our logistic and probit regression, we can apply an ANOVA.

```r
anova(simple_model_logistic, simple_model_probit)
```

---

## Comparing model performance

The `performance` package we have previously used for calculating Nagelkerke's R² for our logistic regression model also provides a handy function for comparing model performance.

---

## Other regression variants

While `glm()` already provides quite a few estimators and link functions, depending on the distribution of your dependent variable, you may need more specialized regression models. A few interesting options for that, e.g, include are the [`MASS` package](https://cran.r-project.org/web/packages/MASS/index.html) for negative binomial regression, the [`pscl` package](https://cran.r-project.org/web/packages/pscl/index.html) for zero-inflated (negative) binomial regression and hurdle models, or the [`mcp` package](https://lindeloev.github.io/mcp/) for regression with multiple change points.

---

## Handling regression results

While it is (still) common practice to run regressions, search for 'significant' p-values, and paste the results into a table without interpreting them substantially, this may not be the best thing to do.

---

## Accessing model results in `base R`

Regression results are a specific type/class of objects in `R`. You can use the `str()` function to get an overview of the whole structure of the object (it's a list of different information). For starters, we may want to see what the first level of this list may provide by asking for the names of the included pieces of information:

```r
names(simple_linear_model)
```

```
##  [1] "coefficients"  "residuals"     "effects"       "rank"         
##  [5] "fitted.values" "assign"        "qr"            "df.residual"  
##  [9] "na.action"     "xlevels"       "call"          "terms"        
## [13] "model"
```

---

## Accessing coefficients

We can access the coefficients from our model as follows:

```r
simple_linear_model$coefficients
```

```
##             (Intercept)             main_branch participation_frequency 
##             3.186856550             0.011179958            -0.005610275 
##            years_coding 
##            -0.002463638
```

---

## Accessing standard errors

`lm` objects are a little bit cumbersome to use as the information is deeply nested within the object. If you want to extract the standard errors, e.g., you can do so as follows:

```r
summary(simple_linear_model)$coefficients[,2]
```

```
##             (Intercept)             main_branch participation_frequency 
##             0.024204746             0.007050154             0.004565233 
##            years_coding 
##             0.000653455
```

---

## Accessing confidence intervals

The standard `summary()` doesn't supply confidence intervals. We can use the `confint()` command to produce them. For example, for the logistic regression:

```r
confint(simple_model_logistic)
```

---

## Regression tables

As we have seen in the session on *Exploratory Data Analysis*, there are different options for creating tables as output in `R`. The ones we have looked at in the EDA session, [`stargazer`](https://cran.r-project.org/web/packages/stargazer/index.html) and [`gtsummary`](http://www.danieldsjoberg.com/gtsummary/), can be used for creating regression tables. The functions from these packages can also be used for comparing multiple regression models in one table.

---

## Example: Regression tables with the `stargazer` package

```r
library(stargazer)

stargazer(
  simple_linear_model, 
  type = "text",        # For console output; use "html" or "latex" for other formats
  title = "Regression Results",
  dep.var.labels = c("AI Trust"),
  covariate.labels = c(
    "Main Branch", 
    "Participation Frequency", 
    "Years Coding"
  ),
  out.header = FALSE,    # Exclude header metadata if outputting to file
  digits = 3             # Number of decimal places
)
```

---

```
## 
## Regression Results
## ===================================================
##                             Dependent variable:    
##                         ---------------------------
##                                  AI Trust          
## ---------------------------------------------------
## Main Branch                        0.011           
##                                   (0.007)          
##                                                    
## Participation Frequency           -0.006           
##                                   (0.005)          
##                                                    
## Years Coding                     -0.002***         
##                                   (0.001)          
##                                                    
## Constant                         3.187***          
##                                   (0.024)          
##                                                    
## ---------------------------------------------------
## Observations                      27,370           
## R2                                 0.001           
## Adjusted R2                        0.001           
## Residual Std. Error         1.073 (df = 27366)     
## F Statistic              6.364*** (df = 3; 27366)  
## ===================================================
## Note:                   *p<0.1; **p<0.05; ***p<0.01
```
]

---

## tidy models with `broom`

.pull-left[
We have already entered the area of reporting statistical results. We will have a separate session on reporting  with `R Markdown` tomorrow. One thing to note at this point is that more and more developers in `R` were unsatisfied with the diverse output some of the standard regression procedures provide. The outputs may be helpful to look at, but they're usually not great for further processing. For that purpose, we need data frames/tibbles.
]

.pull-right[
<img src="data:image/png;base64,#../img/broom.png" width="70%" style="display: block; margin: auto;" />
]

---

## 3 functions of `broom`

The [`broom`](https://broom.tidymodels.org/) package provides only 3 but very powerful main functions:
- `tidy()`: creates a `tibble` from your model
- `glance()`: information about your model as `tibble` ('model fit')
- `augment()`: adds information, e.g., individual fitted values to your data

*Note*: With `broom` you can also create a standardized output of other types of models, such as Latent Class Analysis (LCA) (with the package [`poLCA`](https://cran.r-project.org/web/packages/poLCA/index.html)).

---

## `tidy()`

```r
library(broom)

tidy(simple_linear_model)
```

```
## # A tibble: 4 × 5
##   term                    estimate std.error statistic  p.value
##   <chr>                      <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)              3.19     0.0242      132.   0       
## 2 main_branch              0.0112   0.00705       1.59 0.113   
## 3 participation_frequency -0.00561  0.00457      -1.23 0.219   
## 4 years_coding            -0.00246  0.000653     -3.77 0.000163
```

---

## `glance()`

```r
glance(simple_linear_model)
```

```
## # A tibble: 1 × 12
##   r.squared adj.r.squared sigma statistic  p.value    df  logLik    AIC    BIC
##       <dbl>         <dbl> <dbl>     <dbl>    <dbl> <dbl>   <dbl>  <dbl>  <dbl>
## 1  0.000697      0.000588  1.07      6.36 0.000263     3 -40757. 81524. 81565.
## # ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
```

---

## `augment()`

```r
augment(simple_linear_model)
```

```
## # A tibble: 27,370 × 11
##    .rownames ai_trust main_branch participation_frequency years_coding .fitted
##    <chr>        <dbl>       <dbl>                   <dbl>        <dbl>   <dbl>
##  1 6              3             4                       6           10    3.17
##  2 10             2.5           1                       5           15    3.13
##  3 11             2.5           5                       5           20    3.17
##  4 14             3.5           5                       2           20    3.18
##  5 17             2.5           4                       5            7    3.19
##  6 19             4             1                       2           12    3.16
##  7 21             4.5           3                       5            5    3.18
##  8 22             5             1                       6           36    3.08
##  9 23             5             1                       4           25    3.11
## 10 27             2             1                       2           20    3.14
## # ℹ 27,360 more rows
## # ℹ 5 more variables: .resid <dbl>, .hat <dbl>, .sigma <dbl>, .cooksd <dbl>,
## #   .std.resid <dbl>
```

---

## Outlook: Other modeling options

As we've said at the beginning of this session, we only looked at some of the basic confirmatory analysis methods. As you can imagine, however, `R` offers plenty of options more more complex statistical models as well. A few examples:

- Structural equation models with [`lavaan`](https://lavaan.ugent.be/)

- Multilevel/mixed-effects models with [`lme4`](https://cran.r-project.org/web/packages/lme4/index.html)

- Bayesian regression models using [`brms`](https://paul-buerkner.github.io/brms/)

---

## Extracurricular activities

As it is such a simple and useful package, we recommend exploring the `broom` package a bit further. Alex Hayes, one authors of the package, [gave an interesting talk about it at the *RStudio* Conference in 2019](https://rstudio.com/resources/rstudioconf-2019/solving-the-model-representation-problem-with-broom/).