Modelling

Reporting, summarising and communicating models in R

Last session we talked a bit about how you might scrape data from the web. Collecting data is not an end in itself, rather the first step in answering your research question. When it comes to analyzing trends, testing hypotheses, and predicting outcomes, modelling enters the picture.

In general, remember that your basic workflow for evaluating and reporting models is as follows:

Today we will mostly deal with the “Model” and “Communicate” steps shown in the graph.

In this session, we won’t focus on the statistical considerations that inform your model choice, but rather show you how to:

Use formulas to specify multiple models
Process (estimate) model results with broom
Summarize outputs with modelsummary
Communicate results through plots and tables

You will have full courses dedicated to thinking about modelling choices, such as Causal Inference and Machine Learning.

Setup

Before we start coding, we need to load several packages as well as our data. We’ll be using data from the World Health Organization on life expectancy.

# load packages 
pacman::p_load(kableExtra, # for nicely outputted HTML tables
               tidyverse, # for the suite of tidy packages
               broom, # to extract key information from statistical models into tidy data 
               modelsummary, # flexible model output tables 
               specr, # to conduct specification curve analyses
               janitor, # to clean up out data
               modelr, # tidy syntax modelling
               wesanderson) # color palette for ggplot2

# load data
life_expec_dat <- readr::read_csv("life_expectancy.csv") |> 
  janitor::clean_names() |> 
  dplyr::mutate(large = if_else(population > 50000000, 1, 0))  # pop size > 50M

head(life_expec_dat)

FALSE # A tibble: 6 × 23
FALSE   country      year status life_expectancy adult_mortality infant_deaths alcohol
FALSE   <chr>       <dbl> <chr>            <dbl>           <dbl>         <dbl>   <dbl>
FALSE 1 Afghanistan  2015 Devel…            65               263            62    0.01
FALSE 2 Afghanistan  2014 Devel…            59.9             271            64    0.01
FALSE 3 Afghanistan  2013 Devel…            59.9             268            66    0.01
FALSE 4 Afghanistan  2012 Devel…            59.5             272            69    0.01
FALSE 5 Afghanistan  2011 Devel…            59.2             275            71    0.01
FALSE 6 Afghanistan  2010 Devel…            58.8             279            74    0.01
FALSE # ℹ 16 more variables: percentage_expenditure <dbl>, hepatitis_b <dbl>,
FALSE #   measles <dbl>, bmi <dbl>, under_five_deaths <dbl>, polio <dbl>,
FALSE #   total_expenditure <dbl>, diphtheria <dbl>, hiv_aids <dbl>, gdp <dbl>,
FALSE #   population <dbl>, thinness_1_19_years <dbl>, thinness_5_9_years <dbl>,
FALSE #   income_composition_of_resources <dbl>, schooling <dbl>, large <dbl>

Using Formulas in R 👩‍🏫

R, as a programming language, was specifically designed with statistical analysis in mind. Therefore the founding fathers of R 🙌 - in their infinite wisdom - decided to create a special object class (called formula) to help with running models.

Crafting formulas

The basic structure of a formula:

y ~ x

Running a model with a formula is straightforward. Note that we don’t even have to put the formula in parentheses:

lm(life_expectancy ~ gdp, data = life_expec_dat)

## 
## Call:
## lm(formula = life_expectancy ~ gdp, data = life_expec_dat)
## 
## Coefficients:
## (Intercept)          gdp  
##    6.70e+01     3.12e-04

Note: Both sides of this equation go by many names. The left-hand side, often written as y, may be called the dependent variable, response variable, or target variable, depending on context.The right-hand side, often written as X, may be referred to as the independent variables, explanatory variables, or regressors. These terms have subtle differences in meaning depending on the context, but for our purposes today it’s enough to know that they all refer to one of the two sides of a regression equation.

Syntax

A quick recap of the syntax used within formulas. The ~ (tilde) sign generally differentiates between dependent and independent variables. The + sign is used to distinguish between independent variables. Here are a few more common formula inputs that are useful to know:

y ~ x1 + x2 # Standard formula

y ~ . # Including all other columns in data as x variables.

y ~ x1 - x2 # Ignore x2 in the analysis

y ~ x1 * x2 # Add the interaction term between x1 and x2
y ~ x1 + x2 + x1*x2 # The same as the above

y ~ x1:x2  # Only the interaction, no main effects (rare that you would want this)

y ~ x + I(x^2) # Adds the higher order term x^2.

A More Explicit Way to Create Formulas

Formulas are generally straightforward to work with. To make things clearer and easier to automate we can split the process into two steps:

Step 1: Create a string containing the written formula.

formula_string <- paste("life_expectancy", "~ gdp")
formula_string

## [1] "life_expectancy ~ gdp"

Step 2: Transform the string into R’s formula class. Only then will your modeling function accept it as input.

formula <- as.formula(formula_string)
formula

## life_expectancy ~ gdp

Here we can see that the formula we created above is identical to doing it directly.

reg1 <- lm(formula, data = life_expec_dat) 
reg2 <- lm(life_expectancy ~ gdp, data = life_expec_dat)

reg1$coefficients == reg2$coefficients

## (Intercept)         gdp 
##        TRUE        TRUE

Excercise 1 Using the two-step approach, create formulas that perform the following regressions, in each case run the model and inspect the output:

Regress life_expectancy on gdp, population, and alcohol
Regress life_expectancy on gdp, population, alcohol, and the interaction between gdp and population
Regress life_expectancy on gdp, the square of gdp and the log of population

Iteration

Often, the modelling process requires you to run the same specification with multiple configurations of both dependent and independent variables. Model formulas make running many similar models much easier:

Step 1: Define a function that let’s you plug in different variables for x or y. The function should take the x and/or y variable(s) as a string and return a model object.

# function to include different y variables
lm_fun_iter_y <-  function(dep_var) {
  lm(as.formula(paste(dep_var, "~ gdp")), data = life_expec_dat)
}

# function to include different x variables 
lm_fun_iter_x <- function(indep_var) {
  lm(as.formula(paste("life_expectancy ~", paste(indep_var, collapse = "+"))), data = life_expec_dat)
}

Notice: Considering your model will likely include many independent variables, it’s unlikely that you’ll run a simple bivariate regression. Therefore, we use a nested paste function above that combines (with collapse()) all independent variables contained in a character vector with +.

Step 2: Use purrr::map() (refer to lab sessions 2 and 3) to iteratively apply the model to each variable contained in a character vector of variables:

# create vector of variables to iterate over
vars <- life_expec_dat |> 
  dplyr::select(-c("life_expectancy", "country")) |> 
  names()

vars

##  [1] "year"                            "status"                         
##  [3] "adult_mortality"                 "infant_deaths"                  
##  [5] "alcohol"                         "percentage_expenditure"         
##  [7] "hepatitis_b"                     "measles"                        
##  [9] "bmi"                             "under_five_deaths"              
## [11] "polio"                           "total_expenditure"              
## [13] "diphtheria"                      "hiv_aids"                       
## [15] "gdp"                             "population"                     
## [17] "thinness_1_19_years"             "thinness_5_9_years"             
## [19] "income_composition_of_resources" "schooling"                      
## [21] "large"

# run a bivariate model for each column
biv_model_out <-  vars |>
  purrr::map(lm_fun_iter_x)

biv_model_out

## [[1]]
## 
## Call:
## lm(formula = as.formula(paste("life_expectancy ~", paste(indep_var, 
##     collapse = "+"))), data = life_expec_dat)
## 
## Coefficients:
## (Intercept)         year  
##    -635.872        0.351  
## 
## 
## [[2]]
## 
## Call:
## lm(formula = as.formula(paste("life_expectancy ~", paste(indep_var, 
##     collapse = "+"))), data = life_expec_dat)
## 
## Coefficients:
##      (Intercept)  statusDeveloping  
##             79.2             -12.1  
....

This returns output from all of the models that you’ve just run. Given that there are now quite a few models, it might be a good idea to keep the names of the independent variables that you feed into the model. You can do this with purrr::set_names().

# run a bivariate model for each column
biv_model_out_w_names <-  vars |>
  purrr::set_names() |> 
  purrr::map(lm_fun_iter_x)

biv_model_out_w_names$year # we can now easily index independent vars

## 
## Call:
## lm(formula = as.formula(paste("life_expectancy ~", paste(indep_var, 
##     collapse = "+"))), data = life_expec_dat)
## 
## Coefficients:
## (Intercept)         year  
##    -635.872        0.351

Say that you’re interested in seeing what effect gdp has on life_expectancy, but you suspect that other covariates such as alcohol consumption, health care expenditure, average body mass index and the propagation of aids also have an effect.

On top of this, it’s possible that these variables might have different effects on life_expectancy depending on how they interact with each other. We can determine which independent variables are robustly associated with the dependent variable by generating models for every possible combination of independent variables and then looking into the distribution of effects across these models.

Let’s review the code introduced in the lecture:

Step 1: Find out how many possible combinations of independent and dependent variables exist:

combinations <- 
  purrr::map(1:5, function(x) {combn(1:5, x)}) |> # create all possible combinations (assume we have 5 covariates)
  purrr::map(ncol) |> # extract number of combinations
  unlist() |> # pull out of list structure
  sum() # compute sum

combinations

## [1] 31

Step 2: Write a function that can run all possible combinations in one go:

combn_models <- function(depvar, covars, data) {
  
  # initialize empty list
  combn_list <- list()
  
  # generate list of covariate combinations
  for (i in seq_along(covars)) {
    combn_list[[i]] <- combn(covars, i, simplify = FALSE)
  } # what is combn?
  
  # unlist to make object easier to work with
  combn_list <-
    unlist(combn_list, recursive = FALSE) 
  
  # function to generate formulas
  gen_formula <- function(covars, depvar) {
    form <-
      as.formula(paste0(depvar, " ~ ", paste0(covars, collapse = "+")))
    form
  }
  
  # generate formulas
  formulas_list <-
    purrr::map(combn_list, gen_formula, depvar = depvar)
  
  # run models
  models_list <- purrr::map(formulas_list, lm, data = data)
  models_list
}

Step 3: Apply the function to our case:

depvar <- "life_expectancy"
covars <-  c("gdp", "percentage_expenditure", "alcohol", "bmi", "hiv_aids")
multiv_model_out <- combn_models(depvar = depvar, covars = covars, data = life_expec_dat)

# check whether we have the correct number of models
length(multiv_model_out)

## [1] 31

Excercise 2

Look at the multiv_model_out object that you have just created. Try to get a sense of the structure of the output. Pick three different models and inspect their output using the summary() function. Consider which model you think is “best” and why.

As you can see, the function introduced in the lecture is quite powerful, but it is by no means the only option for testing multiple model specifications. Depending on the type of analysis that you plan to perform, certain R packages enable you run many different combinations of regression models with just a few lines of code. For example, the we can perform multiple estimations with the fixest package.

Specification Curve Analysis ⤴️

In the example above, we consciously chose which models to run. Ideally, this decision should be grounded in theory and knowledge of the subject matter. Alternatively, some researchers argue that such choices are inherently arbitrary, potentially resulting in biased model specification. They propose taking a more analytical approach to model specification which they refer to as Specification Curve Analysis, or Multiverse Analysis.

We can implement this approach using the specr package. In simple terms, it works as follows:

Step 1: Specify a list of reasonable “model ingredients” for each model parameter:

specs <- specr::setup(
  data = life_expec_dat, 
  y = c("life_expectancy"), # add dependent variable 
  x = c("gdp", "log(gdp)", "I(gdp^2)"), # add independent variables 
  model = c("lm", "glm"), # specify model types
  controls = c("population", "alcohol", "bmi"), # specify controls
  subsets = as.list(distinct(life_expec_dat, status)) # specify subgroup
)

summary(specs, rows = 20)

## Setup for the Specification Curve Analysis
## -------------------------------------------
## Class:                      specr.setup -- version: 1.0.0 
## Number of specifications:   144 
## 
## Specifications:
## 
##   Independent variable:     gdp, log(gdp), I(gdp^2) 
##   Dependent variable:       life_expectancy 
##   Models:                   lm, glm 
##   Covariates:               no covariates, population, alcohol, bmi, population + alcohol, population + bmi, alcohol + bmi, population + alcohol + bmi 
##   Subsets analyses:         Developing, Developed, all 
## 
## Function used to extract parameters:
## 
##   function (x) 
## broom::tidy(x, conf.int = TRUE)
## <environment: 0x0000012eda638e80>
## 
## 
## Head of specifications table (first 20 rows):

## # A tibble: 20 × 7
##    x     y               model controls             subsets    status    formula
##    <chr> <chr>           <chr> <chr>                <chr>      <fct>     <glue> 
##  1 gdp   life_expectancy lm    no covariates        Developing Developi… life_e…
##  2 gdp   life_expectancy lm    no covariates        Developed  Developed life_e…
##  3 gdp   life_expectancy lm    no covariates        all        <NA>      life_e…
##  4 gdp   life_expectancy lm    population           Developing Developi… life_e…
##  5 gdp   life_expectancy lm    population           Developed  Developed life_e…
##  6 gdp   life_expectancy lm    population           all        <NA>      life_e…
##  7 gdp   life_expectancy lm    alcohol              Developing Developi… life_e…
##  8 gdp   life_expectancy lm    alcohol              Developed  Developed life_e…
##  9 gdp   life_expectancy lm    alcohol              all        <NA>      life_e…
## 10 gdp   life_expectancy lm    bmi                  Developing Developi… life_e…
## 11 gdp   life_expectancy lm    bmi                  Developed  Developed life_e…
## 12 gdp   life_expectancy lm    bmi                  all        <NA>      life_e…
## 13 gdp   life_expectancy lm    population + alcohol Developing Developi… life_e…
## 14 gdp   life_expectancy lm    population + alcohol Developed  Developed life_e…
## 15 gdp   life_expectancy lm    population + alcohol all        <NA>      life_e…
## 16 gdp   life_expectancy lm    population + bmi     Developing Developi… life_e…
## 17 gdp   life_expectancy lm    population + bmi     Developed  Developed life_e…
## 18 gdp   life_expectancy lm    population + bmi     all        <NA>      life_e…
## 19 gdp   life_expectancy lm    alcohol + bmi        Developing Developi… life_e…
## 20 gdp   life_expectancy lm    alcohol + bmi        Developed  Developed life_e…

Step 2: Run Specification Curve Analysis:

# run specification curve analysis
results <- specr::specr(specs)

Step 3: Inspect the specification curve to understand how robust your findings are to different analytical choices.

These plots help to illustrate where and when you had to make a choice and how many models you might alternatively have specified.

plot(specs, circular = TRUE)

More concretely you could also look at some summary statistics for your p-values across the different model specifications:

summary(results, 
        type = "curve",
        group = c("x", "controls")  # group analysis by model choices
)

## # A tibble: 24 × 9
## # Groups:   x [3]
##    x        controls     median      mad      min     max      q25     q75   obs
##    <chr>    <chr>         <dbl>    <dbl>    <dbl>   <dbl>    <dbl>   <dbl> <dbl>
##  1 I(gdp^2) alcohol     2.31e-9 2.47e- 9 6.45e-10 4.38e-9 1.06e- 9 3.86e-9  1905
##  2 I(gdp^2) alcohol + … 1.77e-9 5.33e-10 6.45e-10 2.13e-9 9.26e-10 2.04e-9  1890
##  3 I(gdp^2) bmi         2.08e-9 8.32e-11 6.37e-10 2.13e-9 9.96e-10 2.12e-9  2013
##  4 I(gdp^2) no covaria… 3.12e-9 2.17e- 9 6.41e-10 4.58e-9 1.26e- 9 4.22e-9  2037
##  5 I(gdp^2) population  3.30e-9 3.94e- 9 6.41e-10 1.15e-8 1.31e- 9 9.42e-9  1846
##  6 I(gdp^2) population… 2.09e-9 2.12e- 9 6.59e-10 9.98e-9 1.02e- 9 8.01e-9  1726
##  7 I(gdp^2) population… 1.75e-9 1.63e- 9 6.56e-10 6.57e-9 9.30e-10 5.37e-9  1711
##  8 I(gdp^2) population… 2.23e-9 2.35e- 9 6.38e-10 6.63e-9 1.03e- 9 5.53e-9  1822
##  9 gdp      alcohol     2.50e-4 1.98e- 4 6.26e- 5 3.83e-4 1.09e- 4 3.50e-4  1905
## 10 gdp      alcohol + … 1.87e-4 4.95e- 5 6.26e- 5 2.21e-4 9.38e- 5 2.12e-4  1890
## # ℹ 14 more rows

Note that statistic mad stands for median absolute deviation.

A more intuitive way of inspecting the results might be to visualize them.

# plot entire visualization
plot(results)

Plot A denotes:

X-axis = ordered model specifications (from most negative effect to most positive effect)
Y-axis = standardized regression coefficient for the IV-DV relationship
Points = the standardized regression coefficient for each specific model
Error bars = 95% confidence intervals around the point estimate

Plot B denotes:

X-axis = ordered model specifications (the same as panel A)
Y-axis (right) = analytic decision categories
Y-axis (left) = specific analytic decisions within each category
Lines = indicates whether a specific analytic decision was true for that particular model specification

Interpretable Effect Sizes 🔎

While these plots make it easy to compare effects across coefficients, unit changes may not be comparable across variables. We can address this problem in several ways:

Re-scale variables to show intuitive unit changes in X (e.g., $1k instead of $1)
Re-scale to full scale (minimum to maximum) changes in X
Standardize variables to show standard deviation changes in X

Let’s try standardizing our variables to address the problem.

# copy our data
life_expec_dat_stand <- life_expec_dat

# standardize  variables
life_expec_dat_stand <- life_expec_dat_stand |> 
  dplyr::mutate(gdp_squared = gdp^2,
                log_gdp = log(gdp)) |> # In this case we explicitly create the log and squared gdp variables first.
  dplyr::mutate(gdp = effectsize::standardize(gdp),
                gdp_squared = effectsize::standardize(gdp_squared),
                log_gdp = effectsize::standardize(log_gdp)
  ) #Then we standardise all three
                
                

# specify reasonable model "ingredients"
specs_stand <- specr::setup(
  data = life_expec_dat_stand, 
  y = c("life_expectancy"), # add dependent variable 
  x = c("gdp", "log_gdp", "gdp_squared"), # add independent variables 
  model = c("lm", "glm"), # specify model types
  controls = c("population", "alcohol", "bmi"), # specify controls
  subsets = as.list(distinct(life_expec_dat_stand, status)) # specify subgroup
)

# run specification curve analysis
results_stand <- specr::specr(specs_stand)

# plot entire visualization
plot(results_stand)

Note that any re-scaling operation will affect how you interpret the coefficients! For more details, check out the effectsize package.

Without putting much thought into our model, it looks like gdp (depending on the specification) can either have a small positive effect or a substantial positive effect on life_expectancy. This just goes to show how careful one has to approach the task of correctly specifying a model!

🚨 Watch Out! ⚠️

Be careful and selective when you use specification curve analysis! It is not a tool that you should apply to each and every potential paper. Especially, if you have not taken a course on causal inference. Specification curve analysis, when it includes variables that act as colliders or mediators, might actually legitimize an otherwise wrongly specified model!

Model Output with `broom` 🧹

In the lecture we saw that lists of different combinations of model specifications can be unwieldy. We are usually mainly interested in three main aspects of our model outputs:

Estimated coefficients and associated standard errors, t-statistics, p-values, confidence intervals
Model summaries including goodness of fit measures, information on model convergence, number of observations used
Observation-level information that arises from the estimated model, such as fitted/predicted values, residuals, estimates of influence

Extracting this information from the model output is possible but not straightforward (see Appendix for more):

str(multiv_model_out[[30]])

## List of 13
##  $ coefficients : Named num [1:5] 61.521341 0.000928 0.460674 0.165909 -0.816043
##   ..- attr(*, "names")= chr [1:5] "(Intercept)" "percentage_expenditure" "alcohol" "bmi" ...
##  $ residuals    : Named num [1:2720] 0.321 -4.698 -4.615 -4.937 -5.105 ...
##   ..- attr(*, "names")= chr [1:2720] "1" "2" "3" "4" ...
##  $ effects      : Named num [1:2720] -3610 198 142 212 -216 ...
##   ..- attr(*, "names")= chr [1:2720] "(Intercept)" "percentage_expenditure" "alcohol" "bmi" ...
##  $ rank         : int 5
##  $ fitted.values: Named num [1:2720] 64.7 64.6 64.5 64.4 64.3 ...
##   ..- attr(*, "names")= chr [1:2720] "1" "2" "3" "4" ...
....

summary(multiv_model_out[[30]])

## 
## Call:
## .f(formula = .x[[i]], data = ..1)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -33.69  -3.61   0.26   3.47  21.82 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
....

Tidying Model Objects

Luckily, we don’t have to deal with this. We can use the tidyverse’s broom package which is part of a collection of packages used for modeling that adhere to tidyverse principles referred to as tidymodels. Tidymodels offers a lot of cool packages that have their own tutorials, so check them out! 💪.

In today’s session, we will only focus on broom. Here are the three key broom functions that you should know:

broom::tidy(): Summarizes information about model components.

broom::tidy(multiv_model_out[[30]], conf.int = TRUE, conf.level = 0.95)

## # A tibble: 5 × 7
##   term                 estimate std.error statistic   p.value conf.low conf.high
##   <chr>                   <dbl>     <dbl>     <dbl>     <dbl>    <dbl>     <dbl>
## 1 (Intercept)           6.15e+1 0.273         226.  0          6.10e+1  62.1    
## 2 percentage_expendit…  9.28e-4 0.0000602      15.4 1.94e- 51  8.10e-4   0.00105
## 3 alcohol               4.61e-1 0.0314         14.7 6.97e- 47  3.99e-1   0.522  
## 4 bmi                   1.66e-1 0.00638        26.0 3.46e-133  1.53e-1   0.178  
## 5 hiv_aids             -8.16e-1 0.0226        -36.1 2.43e-233 -8.60e-1  -0.772

broom::glance(): Reports information about the entire model.

broom::glance(multiv_model_out[[30]])

## # A tibble: 1 × 12
##   r.squared adj.r.squared sigma statistic p.value    df logLik    AIC    BIC
##       <dbl>         <dbl> <dbl>     <dbl>   <dbl> <dbl>  <dbl>  <dbl>  <dbl>
## 1     0.608         0.607  5.99     1052.       0     4 -8727. 17465. 17501.
## # ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>

augment(): Adds information about observations to a dataset.

broom::augment(multiv_model_out[[30]], se_fit = TRUE)

## # A tibble: 2,720 × 13
##    .rownames life_expectancy percentage_expenditure alcohol   bmi hiv_aids
##    <chr>               <dbl>                  <dbl>   <dbl> <dbl>    <dbl>
##  1 1                    65                    71.3     0.01  19.1      0.1
##  2 2                    59.9                  73.5     0.01  18.6      0.1
##  3 3                    59.9                  73.2     0.01  18.1      0.1
##  4 4                    59.5                  78.2     0.01  17.6      0.1
##  5 5                    59.2                   7.10    0.01  17.2      0.1
##  6 6                    58.8                  79.7     0.01  16.7      0.1
##  7 7                    58.6                  56.8     0.01  16.2      0.1
##  8 8                    58.1                  25.9     0.03  15.7      0.1
##  9 9                    57.5                  10.9     0.02  15.2      0.1
## 10 10                   57.3                  17.2     0.03  14.7      0.1
## # ℹ 2,710 more rows
## # ℹ 7 more variables: .fitted <dbl>, .se.fit <dbl>, .resid <dbl>, .hat <dbl>,
## #   .sigma <dbl>, .cooksd <dbl>, .std.resid <dbl>

Doesn’t this look so much tidier?

Broom with Many Models

Where broom really shines is when it comes to dealing with multiple models like we have in our multiv_model_out. It allows us to easily move away from hard-to-deal-with lists while retaining each of the three important model outputs. Since our example replicates several variables in different models, we include the .id argument in purrr::map(). This adds a column to our tibble which will contain a model identifier. In our example, the identifier of each model corresponds with the index number of the model output.

multiv_model_out_broom <- purrr::map_dfr(multiv_model_out[29:31], # the model objects
                                         broom::tidy, # the function to iterate with
                                         .id = "model_type") # id column

multiv_model_out_broom

## # A tibble: 16 × 6
##    model_type term                     estimate  std.error statistic   p.value
##    <chr>      <chr>                       <dbl>      <dbl>     <dbl>     <dbl>
##  1 1          (Intercept)            61.5       0.292        210.    0        
##  2 1          gdp                     0.000160  0.00000927    17.2   1.13e- 62
##  3 1          alcohol                 0.392     0.0339        11.6   4.20e- 30
##  4 1          bmi                     0.168     0.00693       24.2   2.72e-115
##  5 1          hiv_aids               -0.779     0.0227       -34.4   1.69e-209
##  6 2          (Intercept)            61.5       0.273        226.    0        
##  7 2          percentage_expenditure  0.000928  0.0000602     15.4   1.94e- 51
##  8 2          alcohol                 0.461     0.0314        14.7   6.97e- 47
##  9 2          bmi                     0.166     0.00638       26.0   3.46e-133
## 10 2          hiv_aids               -0.816     0.0226       -36.1   2.43e-233
## 11 3          (Intercept)            61.5       0.293        210.    0        
## 12 3          gdp                     0.000166  0.0000219      7.59  4.69e- 14
## 13 3          percentage_expenditure -0.0000464 0.000145      -0.321 7.48e-  1
## 14 3          alcohol                 0.394     0.0344        11.5   1.27e- 29
## 15 3          bmi                     0.167     0.00695       24.1   1.26e-114
## 16 3          hiv_aids               -0.779     0.0227       -34.4   2.18e-209

Now that we have multiple model outputs in a format that we can work with, it is time to start thinking about visualizing our results. You can plot the residuals, the goodness of fit statistics etc. Similarly, you could also use broom::glance() to provide some robustness statistics or descriptive tables.

The next section will focus on how to present your results!

Results with `modelsummary` 🥼

For this we will use the modelsummary package by Vincent Arel-Bundock’s. Though there are many packages to choose from when it comes to communicating your results, we highly recommend you give modelsummary a shot!

One reason that we like this package is that it combines both approaches of presenting model results, either as a table or as a coefficient plot. As we saw in the lecture, each method has its advantages and its drawbacks. Therefore, we will cover both here.

Using Regression Tables

The workhorse function in modelsummary is unsurprisingly modelsummary(). 🤔

Here we only need to input the models list as the function does all of the tidying for us.

modelsummary::modelsummary(multiv_model_out[29:31], 
                           output = "kableExtra",
                           escape = FALSE) # can be set to multiple different output formats (see the documentation for more information)

	(1)	(2)	(3)
(Intercept)	61.459	61.521	61.451
	(0.292)	(0.273)	(0.293)
gdp	0.000		0.000
	(0.000)		(0.000)
alcohol	0.392	0.461	0.394
	(0.034)	(0.031)	(0.034)
bmi	0.168	0.166	0.167
	(0.007)	(0.006)	(0.007)
hiv_aids	−0.779	−0.816	−0.779
	(0.023)	(0.023)	(0.023)
percentage_expenditure		0.001	−0.000
		(0.000)	(0.000)
Num.Obs.	2313	2720	2313
R2	0.631	0.608	0.631
R2 Adj.	0.630	0.607	0.630
AIC	14785.8	17465.5	14787.7
BIC	14820.3	17500.9	14828.0
Log.Lik.	−7386.925	−8726.735	−7386.873
RMSE	5.90	5.99	5.90

It looks pretty solid already, but can definitely be improved upon.

For example, we can set acceptable number of digits:

modelsummary::modelsummary(multiv_model_out[[31]], 
                           output = "kableExtra",
                           fmt = "%.2f", # 2-digits and trailing 0
                           escape = FALSE) # to avoid escaping special characters in kableExtra

We can also report only the C.I. by getting rid of the p-values and intercept term:

model_table <- modelsummary::modelsummary(multiv_model_out[29:31], 
                                          output = "kableExtra",
                                          fmt = "%.2f",  # 2-digits and trailing 0  
                                          estimate  = "{estimate}", 
                                          statistic = "conf.int",
                                          coef_omit = "Intercept",
                                          escape = FALSE) 
model_table

	(1)	(2)	(3)
gdp	0.00		0.00
	[0.00, 0.00]		[0.00, 0.00]
alcohol	0.39	0.46	0.39
	[0.33, 0.46]	[0.40, 0.52]	[0.33, 0.46]
bmi	0.17	0.17	0.17
	[0.15, 0.18]	[0.15, 0.18]	[0.15, 0.18]
hiv_aids	−0.78	−0.82	−0.78
	[−0.82, −0.73]	[−0.86, −0.77]	[−0.82, −0.73]
percentage_expenditure		0.00	−0.00
		[0.00, 0.00]	[−0.00, 0.00]
Num.Obs.	2313	2720	2313
R2	0.631	0.608	0.631
R2 Adj.	0.630	0.607	0.630
AIC	14785.8	17465.5	14787.7
BIC	14820.3	17500.9	14828.0
Log.Lik.	−7386.925	−8726.735	−7386.873
RMSE	5.90	5.99	5.90

Wouldn’t it look more professional if our coefficients began with capital letter? And how about we get rid of some of the bloated goodness of fit statistics?

mod_table <- modelsummary::modelsummary(multiv_model_out[29:31], 
                                        output = "kableExtra",
                                        fmt = "%.2f",  # 2-digits and trailing 0  
                                        estimate  = "{estimate}",
                                        statistic = "conf.int",
                                        coef_omit = "Intercept",
                                        coef_rename=c("gdp"="Gdp", 
                                                      "bmi"="Avg. BMI", 
                                                      "alcohol" = "Alcohol Consum.",
                                                      "hiv_aids"= "HIV cases", 
                                                      "percentage_expenditure" = "Health Expenditure (% of GDP)",
                                                      "life_expectancy" = "Life Expectancy"),
                                        gof_omit = 'DF|Deviance|Log.Lik|AIC|BIC',
                                        title = 'An Improved Regression Table',
                                        escape = FALSE)

mod_table

An Improved Regression Table
	(1)	(2)	(3)
Gdp	0.00		0.00
	[0.00, 0.00]		[0.00, 0.00]
Alcohol Consum.	0.39	0.46	0.39
	[0.33, 0.46]	[0.40, 0.52]	[0.33, 0.46]
Avg. BMI	0.17	0.17	0.17
	[0.15, 0.18]	[0.15, 0.18]	[0.15, 0.18]
HIV cases	−0.78	−0.82	−0.78
	[−0.82, −0.73]	[−0.86, −0.77]	[−0.82, −0.73]
Health Expenditure (% of GDP)		0.00	−0.00
		[0.00, 0.00]	[−0.00, 0.00]
Num.Obs.	2313	2720	2313
R2	0.631	0.608	0.631
R2 Adj.	0.630	0.607	0.630
RMSE	5.90	5.99	5.90

When using the kableExtra package, you can even post-process your table:

mod_table |> 
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, fixed_thead = T) |> 
  row_spec(3, color = 'red') |>
  row_spec(5, background = 'lightblue')

An Improved Regression Table
	(1)	(2)	(3)
Gdp	0.00		0.00
	[0.00, 0.00]		[0.00, 0.00]
Alcohol Consum.	0.39	0.46	0.39
	[0.33, 0.46]	[0.40, 0.52]	[0.33, 0.46]
Avg. BMI	0.17	0.17	0.17
	[0.15, 0.18]	[0.15, 0.18]	[0.15, 0.18]
HIV cases	−0.78	−0.82	−0.78
	[−0.82, −0.73]	[−0.86, −0.77]	[−0.82, −0.73]
Health Expenditure (% of GDP)		0.00	−0.00
		[0.00, 0.00]	[−0.00, 0.00]
Num.Obs.	2313	2720	2313
R2	0.631	0.608	0.631
R2 Adj.	0.630	0.607	0.630
RMSE	5.90	5.99	5.90

Excercise 3 Our implementation of modelsummary above is good, but it can be improved further. Using the code chunk above as a starting point, try to: - Switch to using significant digits instead of fixed decimal places so that gdp and percent_expenditure are interpretable. - Add a note at the bottom of the table indicating that the data originated at the WHO. - Be creative and customize the table to your liking using kableExtra options.

Using Plots 🎨

If you are looking for an alternative way to present your results, you might also consider a nice coefficient plot.

Again, modelsummary provides a very accessible function for this: modelplot(). As with the regression table, you input the Model List and not the tidied output. The tidying happens under the hood.

# plot 
modelsummary::modelplot(multiv_model_out[29:31], coef_omit = "Intercept")

Now we just need to customize the plot to our liking:

modelsummary::modelplot(multiv_model_out[29:31], coef_omit = "Intercept") +
  labs(x = 'Coefficients', 
       y = 'Term names',
       title = 'Linear regression models of "Life expectancy"',
       caption = "Comparing multiple models of life expectancy at the country level, data originated at the WHO")

We will cover data visualization with R in much greater detail in the next session, so hopefully the section on coefficient plot will be made clearer then!

Acknowledgements

This tutorial drew heavily on the vignette from the specr package as well as the Regression Section in McDermott’s Data Science for Economists by Grant McDermott.

This script was drafted by Tom Arendt and Lisa Oswald, with contributions by Steve Kerr, Hiba Ahmad, Carmen Garro, Sebastian Ramirez-Ruiz, Killian Conyngham and Carol Sobral.

Appendix

More Formula info

Create a formula for a model with a large number of variables:

xnam <- paste0("x", 1:25)
(fmla <- as.formula(paste("y ~ ", paste(xnam, collapse= "+"))))

## y ~ x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8 + x9 + x10 + x11 + 
##     x12 + x13 + x14 + x15 + x16 + x17 + x18 + x19 + x20 + x21 + 
##     x22 + x23 + x24 + x25

The update function

The update function lets us incrementally modify/update a formula:

#?update

Working with model outputs in base R

Inspecting models in R is straightforward with the summary() function.

lm(life_expectancy ~ gdp, data = life_expec_dat) |> summary()

## 
## Call:
## lm(formula = life_expectancy ~ gdp, data = life_expec_dat)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -30.94  -4.97   2.01   5.82  21.84 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 6.70e+01   1.94e-01   345.7   <2e-16 ***
## gdp         3.12e-04   1.20e-05    25.9   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.6 on 2483 degrees of freedom
##   (453 observations deleted due to missingness)
## Multiple R-squared:  0.213,  Adjusted R-squared:  0.213 
## F-statistic:  672 on 1 and 2483 DF,  p-value: <2e-16

But often you want to post-process estimation results. So let’s examine the output of a model function more closely.

model_out <- lm(life_expectancy ~ gdp, data = life_expec_dat)
class(model_out)

## [1] "lm"

str(model_out)

## List of 13
##  $ coefficients : Named num [1:2] 6.70e+01 3.12e-04
##   ..- attr(*, "names")= chr [1:2] "(Intercept)" "gdp"
##  $ residuals    : Named num [1:2485] -2.22 -7.33 -7.33 -7.74 -7.85 ...
##   ..- attr(*, "names")= chr [1:2485] "1" "2" "3" "4" ...
##  $ effects      : Named num [1:2485] -3458.11 221.84 -7.22 -7.63 -7.74 ...
##   ..- attr(*, "names")= chr [1:2485] "(Intercept)" "gdp" "" "" ...
##  $ rank         : int 2
##  $ fitted.values: Named num [1:2485] 67.2 67.2 67.2 67.2 67.1 ...
##   ..- attr(*, "names")= chr [1:2485] "1" "2" "3" "4" ...
##  $ assign       : int [1:2] 0 1
##  $ qr           :List of 5
##   ..$ qr   : num [1:2485, 1:2] -49.8498 0.0201 0.0201 0.0201 0.0201 ...
##   .. ..- attr(*, "dimnames")=List of 2
##   .. .. ..$ : chr [1:2485] "1" "2" "3" "4" ...
##   .. .. ..$ : chr [1:2] "(Intercept)" "gdp"
##   .. ..- attr(*, "assign")= int [1:2] 0 1
##   ..$ qraux: num [1:2] 1.02 1.01
##   ..$ pivot: int [1:2] 1 2
##   ..$ tol  : num 1e-07
##   ..$ rank : int 2
##   ..- attr(*, "class")= chr "qr"
##  $ df.residual  : int 2483
##  $ na.action    : 'omit' Named int [1:453] 161 162 163 164 165 166 167 168 169 170 ...
##   ..- attr(*, "names")= chr [1:453] "161" "162" "163" "164" ...
##  $ xlevels      : Named list()
##  $ call         : language lm(formula = life_expectancy ~ gdp, data = life_expec_dat)
##  $ terms        :Classes 'terms', 'formula'  language life_expectancy ~ gdp
##   .. ..- attr(*, "variables")= language list(life_expectancy, gdp)
##   .. ..- attr(*, "factors")= int [1:2, 1] 0 1
##   .. .. ..- attr(*, "dimnames")=List of 2
##   .. .. .. ..$ : chr [1:2] "life_expectancy" "gdp"
##   .. .. .. ..$ : chr "gdp"
##   .. ..- attr(*, "term.labels")= chr "gdp"
##   .. ..- attr(*, "order")= int 1
##   .. ..- attr(*, "intercept")= int 1
##   .. ..- attr(*, "response")= int 1
##   .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv> 
##   .. ..- attr(*, "predvars")= language list(life_expectancy, gdp)
##   .. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "numeric"
##   .. .. ..- attr(*, "names")= chr [1:2] "life_expectancy" "gdp"
##  $ model        :'data.frame':   2485 obs. of  2 variables:
##   ..$ life_expectancy: num [1:2485] 65 59.9 59.9 59.5 59.2 58.8 58.6 58.1 57.5 57.3 ...
##   ..$ gdp            : num [1:2485] 584.3 612.7 631.7 670 63.5 ...
##   ..- attr(*, "terms")=Classes 'terms', 'formula'  language life_expectancy ~ gdp
##   .. .. ..- attr(*, "variables")= language list(life_expectancy, gdp)
##   .. .. ..- attr(*, "factors")= int [1:2, 1] 0 1
##   .. .. .. ..- attr(*, "dimnames")=List of 2
##   .. .. .. .. ..$ : chr [1:2] "life_expectancy" "gdp"
##   .. .. .. .. ..$ : chr "gdp"
##   .. .. ..- attr(*, "term.labels")= chr "gdp"
##   .. .. ..- attr(*, "order")= int 1
##   .. .. ..- attr(*, "intercept")= int 1
##   .. .. ..- attr(*, "response")= int 1
##   .. .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv> 
##   .. .. ..- attr(*, "predvars")= language list(life_expectancy, gdp)
##   .. .. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "numeric"
##   .. .. .. ..- attr(*, "names")= chr [1:2] "life_expectancy" "gdp"
##   ..- attr(*, "na.action")= 'omit' Named int [1:453] 161 162 163 164 165 166 167 168 169 170 ...
##   .. ..- attr(*, "names")= chr [1:453] "161" "162" "163" "164" ...
##  - attr(*, "class")= chr "lm"

Ugh, that’s a lot of information in a list structure. Let’s use some helper functions to unpack these:

coef(model_out)
fitted.values(model_out)
residuals(model_out)
model.matrix(model_out)

We can also dig in the summary of the model.

#str(summary(model_out))
summary(model_out)$coefficients

##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)  6.7e+01    1.9e-01     346  0.0e+00
## gdp          3.1e-04    1.2e-05      26 2.7e-131