Modelling
Reporting, summarising and communicating models in R
Last session we talked a bit about how you might scrape data from the web. Collecting data is not an end in itself, rather the first step in answering your research question. When it comes to analyzing trends, testing hypotheses, and predicting outcomes, modelling enters the picture.
In general, remember that your basic workflow for evaluating and reporting models is as follows:
Today we will mostly deal with the “Model” and “Communicate” steps shown in the graph.
In this session, we won’t focus on the statistical considerations that inform your model choice, but rather show you how to:
- Use formulas to specify multiple models
- Process (estimate) model results with
broom - Summarize outputs with
modelsummary - Communicate results through plots and tables
You will have full courses dedicated to thinking about modelling choices, such as Causal Inference and Machine Learning.
Setup
Before we start coding, we need to load several packages as well as our data. We’ll be using data from the World Health Organization on life expectancy.
# load packages
pacman::p_load(kableExtra, # for nicely outputted HTML tables
tidyverse, # for the suite of tidy packages
broom, # to extract key information from statistical models into tidy data
modelsummary, # flexible model output tables
specr, # to conduct specification curve analyses
janitor, # to clean up out data
modelr, # tidy syntax modelling
wesanderson) # color palette for ggplot2
# load data
life_expec_dat <- readr::read_csv("life_expectancy.csv") |>
janitor::clean_names() |>
dplyr::mutate(large = if_else(population > 50000000, 1, 0)) # pop size > 50M
head(life_expec_dat)FALSE # A tibble: 6 × 23
FALSE country year status life_expectancy adult_mortality infant_deaths alcohol
FALSE <chr> <dbl> <chr> <dbl> <dbl> <dbl> <dbl>
FALSE 1 Afghanistan 2015 Devel… 65 263 62 0.01
FALSE 2 Afghanistan 2014 Devel… 59.9 271 64 0.01
FALSE 3 Afghanistan 2013 Devel… 59.9 268 66 0.01
FALSE 4 Afghanistan 2012 Devel… 59.5 272 69 0.01
FALSE 5 Afghanistan 2011 Devel… 59.2 275 71 0.01
FALSE 6 Afghanistan 2010 Devel… 58.8 279 74 0.01
FALSE # ℹ 16 more variables: percentage_expenditure <dbl>, hepatitis_b <dbl>,
FALSE # measles <dbl>, bmi <dbl>, under_five_deaths <dbl>, polio <dbl>,
FALSE # total_expenditure <dbl>, diphtheria <dbl>, hiv_aids <dbl>, gdp <dbl>,
FALSE # population <dbl>, thinness_1_19_years <dbl>, thinness_5_9_years <dbl>,
FALSE # income_composition_of_resources <dbl>, schooling <dbl>, large <dbl>
Using Formulas in R 👩🏫
R, as a programming language, was specifically designed with
statistical analysis in mind. Therefore the founding fathers of R 🙌 -
in their infinite wisdom - decided to create a special object class
(called formula) to help with running models.
Crafting formulas
The basic structure of a formula:
Running a model with a formula is straightforward. Note that we don’t even have to put the formula in parentheses:
##
## Call:
## lm(formula = life_expectancy ~ gdp, data = life_expec_dat)
##
## Coefficients:
## (Intercept) gdp
## 6.70e+01 3.12e-04
Note: Both sides of this equation go by many names. The left-hand side, often written as y, may be called the dependent variable, response variable, or target variable, depending on context.The right-hand side, often written as X, may be referred to as the independent variables, explanatory variables, or regressors. These terms have subtle differences in meaning depending on the context, but for our purposes today it’s enough to know that they all refer to one of the two sides of a regression equation.
Syntax
A quick recap of the syntax used within formulas. The ~
(tilde) sign generally differentiates between dependent and
independent variables. The + sign is used to distinguish
between independent variables. Here are a few more common formula inputs
that are useful to know:
y ~ x1 + x2 # Standard formula
y ~ . # Including all other columns in data as x variables.
y ~ x1 - x2 # Ignore x2 in the analysis
y ~ x1 * x2 # Add the interaction term between x1 and x2
y ~ x1 + x2 + x1*x2 # The same as the above
y ~ x1:x2 # Only the interaction, no main effects (rare that you would want this)
y ~ x + I(x^2) # Adds the higher order term x^2. A More Explicit Way to Create Formulas
Formulas are generally straightforward to work with. To make things clearer and easier to automate we can split the process into two steps:
Step 1: Create a string containing the written formula.
## [1] "life_expectancy ~ gdp"
Step 2: Transform the string into R’s
formula class. Only then will your modeling function accept
it as input.
## life_expectancy ~ gdp
Here we can see that the formula we created above is identical to doing it directly.
reg1 <- lm(formula, data = life_expec_dat)
reg2 <- lm(life_expectancy ~ gdp, data = life_expec_dat)
reg1$coefficients == reg2$coefficients## (Intercept) gdp
## TRUE TRUE
Excercise 1 Using the two-step approach, create formulas that perform the following regressions, in each case run the model and inspect the output:
Regress
life_expectancyongdp,population, andalcoholRegress
life_expectancyongdp,population,alcohol, and the interaction betweengdpandpopulationRegress
life_expectancyongdp, the square ofgdpand the log ofpopulation
Iteration
Often, the modelling process requires you to run the same specification with multiple configurations of both dependent and independent variables. Model formulas make running many similar models much easier:
Step 1: Define a function that let’s you plug in
different variables for x or y. The function
should take the x and/or y variable(s) as a
string and return a model object.
# function to include different y variables
lm_fun_iter_y <- function(dep_var) {
lm(as.formula(paste(dep_var, "~ gdp")), data = life_expec_dat)
}
# function to include different x variables
lm_fun_iter_x <- function(indep_var) {
lm(as.formula(paste("life_expectancy ~", paste(indep_var, collapse = "+"))), data = life_expec_dat)
}Notice: Considering your model will
likely include many independent variables, it’s unlikely that you’ll run
a simple bivariate regression. Therefore, we use a nested paste function
above that combines (with collapse()) all independent
variables contained in a character vector with +.
Step 2: Use purrr::map() (refer to lab
sessions 2 and 3) to iteratively apply the model to each variable
contained in a character vector of variables:
# create vector of variables to iterate over
vars <- life_expec_dat |>
dplyr::select(-c("life_expectancy", "country")) |>
names()
vars## [1] "year" "status"
## [3] "adult_mortality" "infant_deaths"
## [5] "alcohol" "percentage_expenditure"
## [7] "hepatitis_b" "measles"
## [9] "bmi" "under_five_deaths"
## [11] "polio" "total_expenditure"
## [13] "diphtheria" "hiv_aids"
## [15] "gdp" "population"
## [17] "thinness_1_19_years" "thinness_5_9_years"
## [19] "income_composition_of_resources" "schooling"
## [21] "large"
# run a bivariate model for each column
biv_model_out <- vars |>
purrr::map(lm_fun_iter_x)
biv_model_out## [[1]]
##
## Call:
## lm(formula = as.formula(paste("life_expectancy ~", paste(indep_var,
## collapse = "+"))), data = life_expec_dat)
##
## Coefficients:
## (Intercept) year
## -635.872 0.351
##
##
## [[2]]
##
## Call:
## lm(formula = as.formula(paste("life_expectancy ~", paste(indep_var,
## collapse = "+"))), data = life_expec_dat)
##
## Coefficients:
## (Intercept) statusDeveloping
## 79.2 -12.1
....
This returns output from all of the models that you’ve just run.
Given that there are now quite a few models, it might be a good idea to
keep the names of the independent variables that you feed into the
model. You can do this with purrr::set_names().
# run a bivariate model for each column
biv_model_out_w_names <- vars |>
purrr::set_names() |>
purrr::map(lm_fun_iter_x)
biv_model_out_w_names$year # we can now easily index independent vars##
## Call:
## lm(formula = as.formula(paste("life_expectancy ~", paste(indep_var,
## collapse = "+"))), data = life_expec_dat)
##
## Coefficients:
## (Intercept) year
## -635.872 0.351
Say that you’re interested in seeing what effect gdp has
on life_expectancy, but you suspect that other covariates
such as alcohol consumption, health care expenditure, average body mass
index and the propagation of aids also have an effect.
On top of this, it’s possible that these variables might have
different effects on life_expectancy depending on how they
interact with each other. We can determine which independent variables
are robustly associated with the dependent variable by generating models
for every possible combination of independent variables and then looking
into the distribution of effects across these models.
Let’s review the code introduced in the lecture:
Step 1: Find out how many possible combinations of independent and dependent variables exist:
combinations <-
purrr::map(1:5, function(x) {combn(1:5, x)}) |> # create all possible combinations (assume we have 5 covariates)
purrr::map(ncol) |> # extract number of combinations
unlist() |> # pull out of list structure
sum() # compute sum
combinations## [1] 31
Step 2: Write a function that can run all possible combinations in one go:
combn_models <- function(depvar, covars, data) {
# initialize empty list
combn_list <- list()
# generate list of covariate combinations
for (i in seq_along(covars)) {
combn_list[[i]] <- combn(covars, i, simplify = FALSE)
} # what is combn?
# unlist to make object easier to work with
combn_list <-
unlist(combn_list, recursive = FALSE)
# function to generate formulas
gen_formula <- function(covars, depvar) {
form <-
as.formula(paste0(depvar, " ~ ", paste0(covars, collapse = "+")))
form
}
# generate formulas
formulas_list <-
purrr::map(combn_list, gen_formula, depvar = depvar)
# run models
models_list <- purrr::map(formulas_list, lm, data = data)
models_list
}Step 3: Apply the function to our case:
depvar <- "life_expectancy"
covars <- c("gdp", "percentage_expenditure", "alcohol", "bmi", "hiv_aids")
multiv_model_out <- combn_models(depvar = depvar, covars = covars, data = life_expec_dat)
# check whether we have the correct number of models
length(multiv_model_out)## [1] 31
Excercise 2
Look at the multiv_model_out object that you have just
created. Try to get a sense of the structure of the output. Pick three
different models and inspect their output using the
summary() function. Consider which model you think is
“best” and why.
As you can see, the function introduced in the lecture is quite
powerful, but it is by no means the only option for testing multiple
model specifications. Depending on the type of analysis that you plan to
perform, certain R packages enable you run many different combinations
of regression models with just a few lines of code. For example, the we
can perform multiple
estimations with the fixest package.
Specification Curve Analysis ⤴️
In the example above, we consciously chose which models to run. Ideally, this decision should be grounded in theory and knowledge of the subject matter. Alternatively, some researchers argue that such choices are inherently arbitrary, potentially resulting in biased model specification. They propose taking a more analytical approach to model specification which they refer to as Specification Curve Analysis, or Multiverse Analysis.
We can implement this approach using the specr package.
In simple terms, it works as follows:
Step 1: Specify a list of reasonable “model ingredients” for each model parameter:
specs <- specr::setup(
data = life_expec_dat,
y = c("life_expectancy"), # add dependent variable
x = c("gdp", "log(gdp)", "I(gdp^2)"), # add independent variables
model = c("lm", "glm"), # specify model types
controls = c("population", "alcohol", "bmi"), # specify controls
subsets = as.list(distinct(life_expec_dat, status)) # specify subgroup
)
summary(specs, rows = 20)## Setup for the Specification Curve Analysis
## -------------------------------------------
## Class: specr.setup -- version: 1.0.0
## Number of specifications: 144
##
## Specifications:
##
## Independent variable: gdp, log(gdp), I(gdp^2)
## Dependent variable: life_expectancy
## Models: lm, glm
## Covariates: no covariates, population, alcohol, bmi, population + alcohol, population + bmi, alcohol + bmi, population + alcohol + bmi
## Subsets analyses: Developing, Developed, all
##
## Function used to extract parameters:
##
## function (x)
## broom::tidy(x, conf.int = TRUE)
## <environment: 0x0000012eda638e80>
##
##
## Head of specifications table (first 20 rows):
## # A tibble: 20 × 7
## x y model controls subsets status formula
## <chr> <chr> <chr> <chr> <chr> <fct> <glue>
## 1 gdp life_expectancy lm no covariates Developing Developi… life_e…
## 2 gdp life_expectancy lm no covariates Developed Developed life_e…
## 3 gdp life_expectancy lm no covariates all <NA> life_e…
## 4 gdp life_expectancy lm population Developing Developi… life_e…
## 5 gdp life_expectancy lm population Developed Developed life_e…
## 6 gdp life_expectancy lm population all <NA> life_e…
## 7 gdp life_expectancy lm alcohol Developing Developi… life_e…
## 8 gdp life_expectancy lm alcohol Developed Developed life_e…
## 9 gdp life_expectancy lm alcohol all <NA> life_e…
## 10 gdp life_expectancy lm bmi Developing Developi… life_e…
## 11 gdp life_expectancy lm bmi Developed Developed life_e…
## 12 gdp life_expectancy lm bmi all <NA> life_e…
## 13 gdp life_expectancy lm population + alcohol Developing Developi… life_e…
## 14 gdp life_expectancy lm population + alcohol Developed Developed life_e…
## 15 gdp life_expectancy lm population + alcohol all <NA> life_e…
## 16 gdp life_expectancy lm population + bmi Developing Developi… life_e…
## 17 gdp life_expectancy lm population + bmi Developed Developed life_e…
## 18 gdp life_expectancy lm population + bmi all <NA> life_e…
## 19 gdp life_expectancy lm alcohol + bmi Developing Developi… life_e…
## 20 gdp life_expectancy lm alcohol + bmi Developed Developed life_e…
Step 2: Run Specification Curve Analysis:
Step 3: Inspect the specification curve to understand how robust your findings are to different analytical choices.
These plots help to illustrate where and when you had to make a choice and how many models you might alternatively have specified.
More concretely you could also look at some summary statistics for your p-values across the different model specifications:
## # A tibble: 24 × 9
## # Groups: x [3]
## x controls median mad min max q25 q75 obs
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 I(gdp^2) alcohol 2.31e-9 2.47e- 9 6.45e-10 4.38e-9 1.06e- 9 3.86e-9 1905
## 2 I(gdp^2) alcohol + … 1.77e-9 5.33e-10 6.45e-10 2.13e-9 9.26e-10 2.04e-9 1890
## 3 I(gdp^2) bmi 2.08e-9 8.32e-11 6.37e-10 2.13e-9 9.96e-10 2.12e-9 2013
## 4 I(gdp^2) no covaria… 3.12e-9 2.17e- 9 6.41e-10 4.58e-9 1.26e- 9 4.22e-9 2037
## 5 I(gdp^2) population 3.30e-9 3.94e- 9 6.41e-10 1.15e-8 1.31e- 9 9.42e-9 1846
## 6 I(gdp^2) population… 2.09e-9 2.12e- 9 6.59e-10 9.98e-9 1.02e- 9 8.01e-9 1726
## 7 I(gdp^2) population… 1.75e-9 1.63e- 9 6.56e-10 6.57e-9 9.30e-10 5.37e-9 1711
## 8 I(gdp^2) population… 2.23e-9 2.35e- 9 6.38e-10 6.63e-9 1.03e- 9 5.53e-9 1822
## 9 gdp alcohol 2.50e-4 1.98e- 4 6.26e- 5 3.83e-4 1.09e- 4 3.50e-4 1905
## 10 gdp alcohol + … 1.87e-4 4.95e- 5 6.26e- 5 2.21e-4 9.38e- 5 2.12e-4 1890
## # ℹ 14 more rows
Note that statistic mad stands for median absolute deviation.
A more intuitive way of inspecting the results might be to visualize them.
Plot A denotes:
- X-axis = ordered model specifications (from most negative effect to most positive effect)
- Y-axis = standardized regression coefficient for the IV-DV relationship
- Points = the standardized regression coefficient for each specific model
- Error bars = 95% confidence intervals around the point estimate
Plot B denotes:
- X-axis = ordered model specifications (the same as panel A)
- Y-axis (right) = analytic decision categories
- Y-axis (left) = specific analytic decisions within each category
- Lines = indicates whether a specific analytic decision was true for that particular model specification
Interpretable Effect Sizes 🔎
While these plots make it easy to compare effects across coefficients, unit changes may not be comparable across variables. We can address this problem in several ways:
- Re-scale variables to show intuitive unit changes in X (e.g., $1k instead of $1)
- Re-scale to full scale (minimum to maximum) changes in X
- Standardize variables to show standard deviation changes in X
Let’s try standardizing our variables to address the problem.
# copy our data
life_expec_dat_stand <- life_expec_dat
# standardize variables
life_expec_dat_stand <- life_expec_dat_stand |>
dplyr::mutate(gdp_squared = gdp^2,
log_gdp = log(gdp)) |> # In this case we explicitly create the log and squared gdp variables first.
dplyr::mutate(gdp = effectsize::standardize(gdp),
gdp_squared = effectsize::standardize(gdp_squared),
log_gdp = effectsize::standardize(log_gdp)
) #Then we standardise all three
# specify reasonable model "ingredients"
specs_stand <- specr::setup(
data = life_expec_dat_stand,
y = c("life_expectancy"), # add dependent variable
x = c("gdp", "log_gdp", "gdp_squared"), # add independent variables
model = c("lm", "glm"), # specify model types
controls = c("population", "alcohol", "bmi"), # specify controls
subsets = as.list(distinct(life_expec_dat_stand, status)) # specify subgroup
)
# run specification curve analysis
results_stand <- specr::specr(specs_stand)
# plot entire visualization
plot(results_stand) Note that any re-scaling operation will affect how you interpret the
coefficients! For more details, check out the effectsize
package.
Without putting much thought into our model, it looks like
gdp (depending on the specification) can either have a
small positive effect or a substantial positive effect on
life_expectancy. This just goes to show how
careful one has to approach the task of correctly
specifying a model!
🚨 Watch Out! ⚠️
Be careful and selective when you use specification curve analysis! It is not a tool that you should apply to each and every potential paper. Especially, if you have not taken a course on causal inference. Specification curve analysis, when it includes variables that act as colliders or mediators, might actually legitimize an otherwise wrongly specified model!
Model Output with broom 🧹
In the lecture we saw that lists of different combinations of model specifications can be unwieldy. We are usually mainly interested in three main aspects of our model outputs:
- Estimated coefficients and associated standard errors, t-statistics, p-values, confidence intervals
- Model summaries including goodness of fit measures, information on model convergence, number of observations used
- Observation-level information that arises from the estimated model, such as fitted/predicted values, residuals, estimates of influence
Extracting this information from the model output is possible but not straightforward (see Appendix for more):
## List of 13
## $ coefficients : Named num [1:5] 61.521341 0.000928 0.460674 0.165909 -0.816043
## ..- attr(*, "names")= chr [1:5] "(Intercept)" "percentage_expenditure" "alcohol" "bmi" ...
## $ residuals : Named num [1:2720] 0.321 -4.698 -4.615 -4.937 -5.105 ...
## ..- attr(*, "names")= chr [1:2720] "1" "2" "3" "4" ...
## $ effects : Named num [1:2720] -3610 198 142 212 -216 ...
## ..- attr(*, "names")= chr [1:2720] "(Intercept)" "percentage_expenditure" "alcohol" "bmi" ...
## $ rank : int 5
## $ fitted.values: Named num [1:2720] 64.7 64.6 64.5 64.4 64.3 ...
## ..- attr(*, "names")= chr [1:2720] "1" "2" "3" "4" ...
....
##
## Call:
## .f(formula = .x[[i]], data = ..1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -33.69 -3.61 0.26 3.47 21.82
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
....
Tidying Model Objects
Luckily, we don’t have to deal with this. We can use the tidyverse’s broom
package which is part of a collection of packages used for modeling that
adhere to tidyverse principles referred to as tidymodels. Tidymodels offers a
lot of cool packages that have their own tutorials, so check them out!
💪.
In today’s session, we will only focus on broom. Here
are the three key broom functions that you should know:
broom::tidy(): Summarizes information about model components.
## # A tibble: 5 × 7
## term estimate std.error statistic p.value conf.low conf.high
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 6.15e+1 0.273 226. 0 6.10e+1 62.1
## 2 percentage_expendit… 9.28e-4 0.0000602 15.4 1.94e- 51 8.10e-4 0.00105
## 3 alcohol 4.61e-1 0.0314 14.7 6.97e- 47 3.99e-1 0.522
## 4 bmi 1.66e-1 0.00638 26.0 3.46e-133 1.53e-1 0.178
## 5 hiv_aids -8.16e-1 0.0226 -36.1 2.43e-233 -8.60e-1 -0.772
broom::glance(): Reports information about the entire model.
## # A tibble: 1 × 12
## r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.608 0.607 5.99 1052. 0 4 -8727. 17465. 17501.
## # ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
augment(): Adds information about observations to a dataset.
## # A tibble: 2,720 × 13
## .rownames life_expectancy percentage_expenditure alcohol bmi hiv_aids
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 65 71.3 0.01 19.1 0.1
## 2 2 59.9 73.5 0.01 18.6 0.1
## 3 3 59.9 73.2 0.01 18.1 0.1
## 4 4 59.5 78.2 0.01 17.6 0.1
## 5 5 59.2 7.10 0.01 17.2 0.1
## 6 6 58.8 79.7 0.01 16.7 0.1
## 7 7 58.6 56.8 0.01 16.2 0.1
## 8 8 58.1 25.9 0.03 15.7 0.1
## 9 9 57.5 10.9 0.02 15.2 0.1
## 10 10 57.3 17.2 0.03 14.7 0.1
## # ℹ 2,710 more rows
## # ℹ 7 more variables: .fitted <dbl>, .se.fit <dbl>, .resid <dbl>, .hat <dbl>,
## # .sigma <dbl>, .cooksd <dbl>, .std.resid <dbl>
Doesn’t this look so much tidier?
Broom with Many Models
Where broom really shines is when it comes to dealing
with multiple models like we have in our multiv_model_out.
It allows us to easily move away from hard-to-deal-with lists while
retaining each of the three important model outputs. Since our example
replicates several variables in different models, we include the
.id argument in purrr::map(). This adds a
column to our tibble which will contain a model
identifier. In our example, the identifier of each model
corresponds with the index number of the model output.
multiv_model_out_broom <- purrr::map_dfr(multiv_model_out[29:31], # the model objects
broom::tidy, # the function to iterate with
.id = "model_type") # id column
multiv_model_out_broom## # A tibble: 16 × 6
## model_type term estimate std.error statistic p.value
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 1 (Intercept) 61.5 0.292 210. 0
## 2 1 gdp 0.000160 0.00000927 17.2 1.13e- 62
## 3 1 alcohol 0.392 0.0339 11.6 4.20e- 30
## 4 1 bmi 0.168 0.00693 24.2 2.72e-115
## 5 1 hiv_aids -0.779 0.0227 -34.4 1.69e-209
## 6 2 (Intercept) 61.5 0.273 226. 0
## 7 2 percentage_expenditure 0.000928 0.0000602 15.4 1.94e- 51
## 8 2 alcohol 0.461 0.0314 14.7 6.97e- 47
## 9 2 bmi 0.166 0.00638 26.0 3.46e-133
## 10 2 hiv_aids -0.816 0.0226 -36.1 2.43e-233
## 11 3 (Intercept) 61.5 0.293 210. 0
## 12 3 gdp 0.000166 0.0000219 7.59 4.69e- 14
## 13 3 percentage_expenditure -0.0000464 0.000145 -0.321 7.48e- 1
## 14 3 alcohol 0.394 0.0344 11.5 1.27e- 29
## 15 3 bmi 0.167 0.00695 24.1 1.26e-114
## 16 3 hiv_aids -0.779 0.0227 -34.4 2.18e-209
Now that we have multiple model outputs in a format that we can work
with, it is time to start thinking about visualizing our results. You
can plot the residuals, the goodness of fit statistics etc. Similarly,
you could also use broom::glance() to provide some
robustness statistics or descriptive tables.
The next section will focus on how to present your results!
Results with modelsummary 🥼
For this we will use the modelsummary package by Vincent
Arel-Bundock’s. Though there are many packages to choose from when
it comes to communicating your results, we highly recommend you give
modelsummary a shot!
One reason that we like this package is that it combines both approaches of presenting model results, either as a table or as a coefficient plot. As we saw in the lecture, each method has its advantages and its drawbacks. Therefore, we will cover both here.
Using Regression Tables
The workhorse function in modelsummary is unsurprisingly
modelsummary(). 🤔
Here we only need to input the models list as the function does all of the tidying for us.
modelsummary::modelsummary(multiv_model_out[29:31],
output = "kableExtra",
escape = FALSE) # can be set to multiple different output formats (see the documentation for more information)| (1) | (2) | (3) | |
|---|---|---|---|
| (Intercept) | 61.459 | 61.521 | 61.451 |
| (0.292) | (0.273) | (0.293) | |
| gdp | 0.000 | 0.000 | |
| (0.000) | (0.000) | ||
| alcohol | 0.392 | 0.461 | 0.394 |
| (0.034) | (0.031) | (0.034) | |
| bmi | 0.168 | 0.166 | 0.167 |
| (0.007) | (0.006) | (0.007) | |
| hiv_aids | −0.779 | −0.816 | −0.779 |
| (0.023) | (0.023) | (0.023) | |
| percentage_expenditure | 0.001 | −0.000 | |
| (0.000) | (0.000) | ||
| Num.Obs. | 2313 | 2720 | 2313 |
| R2 | 0.631 | 0.608 | 0.631 |
| R2 Adj. | 0.630 | 0.607 | 0.630 |
| AIC | 14785.8 | 17465.5 | 14787.7 |
| BIC | 14820.3 | 17500.9 | 14828.0 |
| Log.Lik. | −7386.925 | −8726.735 | −7386.873 |
| RMSE | 5.90 | 5.99 | 5.90 |
It looks pretty solid already, but can definitely be improved upon.
For example, we can set acceptable number of digits:
modelsummary::modelsummary(multiv_model_out[[31]],
output = "kableExtra",
fmt = "%.2f", # 2-digits and trailing 0
escape = FALSE) # to avoid escaping special characters in kableExtraWe can also report only the C.I. by getting rid of the p-values and intercept term:
model_table <- modelsummary::modelsummary(multiv_model_out[29:31],
output = "kableExtra",
fmt = "%.2f", # 2-digits and trailing 0
estimate = "{estimate}",
statistic = "conf.int",
coef_omit = "Intercept",
escape = FALSE)
model_table| (1) | (2) | (3) | |
|---|---|---|---|
| gdp | 0.00 | 0.00 | |
| [0.00, 0.00] | [0.00, 0.00] | ||
| alcohol | 0.39 | 0.46 | 0.39 |
| [0.33, 0.46] | [0.40, 0.52] | [0.33, 0.46] | |
| bmi | 0.17 | 0.17 | 0.17 |
| [0.15, 0.18] | [0.15, 0.18] | [0.15, 0.18] | |
| hiv_aids | −0.78 | −0.82 | −0.78 |
| [−0.82, −0.73] | [−0.86, −0.77] | [−0.82, −0.73] | |
| percentage_expenditure | 0.00 | −0.00 | |
| [0.00, 0.00] | [−0.00, 0.00] | ||
| Num.Obs. | 2313 | 2720 | 2313 |
| R2 | 0.631 | 0.608 | 0.631 |
| R2 Adj. | 0.630 | 0.607 | 0.630 |
| AIC | 14785.8 | 17465.5 | 14787.7 |
| BIC | 14820.3 | 17500.9 | 14828.0 |
| Log.Lik. | −7386.925 | −8726.735 | −7386.873 |
| RMSE | 5.90 | 5.99 | 5.90 |
Wouldn’t it look more professional if our coefficients began with capital letter? And how about we get rid of some of the bloated goodness of fit statistics?
mod_table <- modelsummary::modelsummary(multiv_model_out[29:31],
output = "kableExtra",
fmt = "%.2f", # 2-digits and trailing 0
estimate = "{estimate}",
statistic = "conf.int",
coef_omit = "Intercept",
coef_rename=c("gdp"="Gdp",
"bmi"="Avg. BMI",
"alcohol" = "Alcohol Consum.",
"hiv_aids"= "HIV cases",
"percentage_expenditure" = "Health Expenditure (% of GDP)",
"life_expectancy" = "Life Expectancy"),
gof_omit = 'DF|Deviance|Log.Lik|AIC|BIC',
title = 'An Improved Regression Table',
escape = FALSE)
mod_table| (1) | (2) | (3) | |
|---|---|---|---|
| Gdp | 0.00 | 0.00 | |
| [0.00, 0.00] | [0.00, 0.00] | ||
| Alcohol Consum. | 0.39 | 0.46 | 0.39 |
| [0.33, 0.46] | [0.40, 0.52] | [0.33, 0.46] | |
| Avg. BMI | 0.17 | 0.17 | 0.17 |
| [0.15, 0.18] | [0.15, 0.18] | [0.15, 0.18] | |
| HIV cases | −0.78 | −0.82 | −0.78 |
| [−0.82, −0.73] | [−0.86, −0.77] | [−0.82, −0.73] | |
| Health Expenditure (% of GDP) | 0.00 | −0.00 | |
| [0.00, 0.00] | [−0.00, 0.00] | ||
| Num.Obs. | 2313 | 2720 | 2313 |
| R2 | 0.631 | 0.608 | 0.631 |
| R2 Adj. | 0.630 | 0.607 | 0.630 |
| RMSE | 5.90 | 5.99 | 5.90 |
When using the kableExtra package, you can even
post-process your table:
mod_table |>
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, fixed_thead = T) |>
row_spec(3, color = 'red') |>
row_spec(5, background = 'lightblue')| (1) | (2) | (3) | |
|---|---|---|---|
| Gdp | 0.00 | 0.00 | |
| [0.00, 0.00] | [0.00, 0.00] | ||
| Alcohol Consum. | 0.39 | 0.46 | 0.39 |
| [0.33, 0.46] | [0.40, 0.52] | [0.33, 0.46] | |
| Avg. BMI | 0.17 | 0.17 | 0.17 |
| [0.15, 0.18] | [0.15, 0.18] | [0.15, 0.18] | |
| HIV cases | −0.78 | −0.82 | −0.78 |
| [−0.82, −0.73] | [−0.86, −0.77] | [−0.82, −0.73] | |
| Health Expenditure (% of GDP) | 0.00 | −0.00 | |
| [0.00, 0.00] | [−0.00, 0.00] | ||
| Num.Obs. | 2313 | 2720 | 2313 |
| R2 | 0.631 | 0.608 | 0.631 |
| R2 Adj. | 0.630 | 0.607 | 0.630 |
| RMSE | 5.90 | 5.99 | 5.90 |
Excercise 3 Our implementation of
modelsummary above is good, but it can be improved further.
Using the code chunk above as a starting point, try to: - Switch to
using significant digits instead of fixed decimal places so that gdp and
percent_expenditure are interpretable. - Add a note at the bottom of the
table indicating that the data originated at the WHO. - Be creative and
customize the table to your liking using kableExtra
options.
Using Plots 🎨
If you are looking for an alternative way to present your results, you might also consider a nice coefficient plot.
Again, modelsummary provides a very accessible function
for this: modelplot(). As with the regression table, you
input the Model List and not the tidied
output. The tidying happens under the hood.
Now we just need to customize the plot to our liking:
modelsummary::modelplot(multiv_model_out[29:31], coef_omit = "Intercept") +
labs(x = 'Coefficients',
y = 'Term names',
title = 'Linear regression models of "Life expectancy"',
caption = "Comparing multiple models of life expectancy at the country level, data originated at the WHO") We will cover data visualization with R in much greater detail in the next session, so hopefully the section on coefficient plot will be made clearer then!
Acknowledgements
This tutorial drew heavily on the vignette from the specr package as well as the Regression Section in McDermott’s Data Science for Economists by Grant McDermott.
This script was drafted by Tom Arendt and Lisa Oswald, with contributions by Steve Kerr, Hiba Ahmad, Carmen Garro, Sebastian Ramirez-Ruiz, Killian Conyngham and Carol Sobral.
Appendix
More Formula info
Create a formula for a model with a large number of variables:
## y ~ x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8 + x9 + x10 + x11 +
## x12 + x13 + x14 + x15 + x16 + x17 + x18 + x19 + x20 + x21 +
## x22 + x23 + x24 + x25
Working with model outputs in base R
Inspecting models in R is straightforward with the
summary() function.
##
## Call:
## lm(formula = life_expectancy ~ gdp, data = life_expec_dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -30.94 -4.97 2.01 5.82 21.84
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.70e+01 1.94e-01 345.7 <2e-16 ***
## gdp 3.12e-04 1.20e-05 25.9 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.6 on 2483 degrees of freedom
## (453 observations deleted due to missingness)
## Multiple R-squared: 0.213, Adjusted R-squared: 0.213
## F-statistic: 672 on 1 and 2483 DF, p-value: <2e-16
But often you want to post-process estimation results. So let’s examine the output of a model function more closely.
## [1] "lm"
## List of 13
## $ coefficients : Named num [1:2] 6.70e+01 3.12e-04
## ..- attr(*, "names")= chr [1:2] "(Intercept)" "gdp"
## $ residuals : Named num [1:2485] -2.22 -7.33 -7.33 -7.74 -7.85 ...
## ..- attr(*, "names")= chr [1:2485] "1" "2" "3" "4" ...
## $ effects : Named num [1:2485] -3458.11 221.84 -7.22 -7.63 -7.74 ...
## ..- attr(*, "names")= chr [1:2485] "(Intercept)" "gdp" "" "" ...
## $ rank : int 2
## $ fitted.values: Named num [1:2485] 67.2 67.2 67.2 67.2 67.1 ...
## ..- attr(*, "names")= chr [1:2485] "1" "2" "3" "4" ...
## $ assign : int [1:2] 0 1
## $ qr :List of 5
## ..$ qr : num [1:2485, 1:2] -49.8498 0.0201 0.0201 0.0201 0.0201 ...
## .. ..- attr(*, "dimnames")=List of 2
## .. .. ..$ : chr [1:2485] "1" "2" "3" "4" ...
## .. .. ..$ : chr [1:2] "(Intercept)" "gdp"
## .. ..- attr(*, "assign")= int [1:2] 0 1
## ..$ qraux: num [1:2] 1.02 1.01
## ..$ pivot: int [1:2] 1 2
## ..$ tol : num 1e-07
## ..$ rank : int 2
## ..- attr(*, "class")= chr "qr"
## $ df.residual : int 2483
## $ na.action : 'omit' Named int [1:453] 161 162 163 164 165 166 167 168 169 170 ...
## ..- attr(*, "names")= chr [1:453] "161" "162" "163" "164" ...
## $ xlevels : Named list()
## $ call : language lm(formula = life_expectancy ~ gdp, data = life_expec_dat)
## $ terms :Classes 'terms', 'formula' language life_expectancy ~ gdp
## .. ..- attr(*, "variables")= language list(life_expectancy, gdp)
## .. ..- attr(*, "factors")= int [1:2, 1] 0 1
## .. .. ..- attr(*, "dimnames")=List of 2
## .. .. .. ..$ : chr [1:2] "life_expectancy" "gdp"
## .. .. .. ..$ : chr "gdp"
## .. ..- attr(*, "term.labels")= chr "gdp"
## .. ..- attr(*, "order")= int 1
## .. ..- attr(*, "intercept")= int 1
## .. ..- attr(*, "response")= int 1
## .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv>
## .. ..- attr(*, "predvars")= language list(life_expectancy, gdp)
## .. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "numeric"
## .. .. ..- attr(*, "names")= chr [1:2] "life_expectancy" "gdp"
## $ model :'data.frame': 2485 obs. of 2 variables:
## ..$ life_expectancy: num [1:2485] 65 59.9 59.9 59.5 59.2 58.8 58.6 58.1 57.5 57.3 ...
## ..$ gdp : num [1:2485] 584.3 612.7 631.7 670 63.5 ...
## ..- attr(*, "terms")=Classes 'terms', 'formula' language life_expectancy ~ gdp
## .. .. ..- attr(*, "variables")= language list(life_expectancy, gdp)
## .. .. ..- attr(*, "factors")= int [1:2, 1] 0 1
## .. .. .. ..- attr(*, "dimnames")=List of 2
## .. .. .. .. ..$ : chr [1:2] "life_expectancy" "gdp"
## .. .. .. .. ..$ : chr "gdp"
## .. .. ..- attr(*, "term.labels")= chr "gdp"
## .. .. ..- attr(*, "order")= int 1
## .. .. ..- attr(*, "intercept")= int 1
## .. .. ..- attr(*, "response")= int 1
## .. .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv>
## .. .. ..- attr(*, "predvars")= language list(life_expectancy, gdp)
## .. .. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "numeric"
## .. .. .. ..- attr(*, "names")= chr [1:2] "life_expectancy" "gdp"
## ..- attr(*, "na.action")= 'omit' Named int [1:453] 161 162 163 164 165 166 167 168 169 170 ...
## .. ..- attr(*, "names")= chr [1:453] "161" "162" "163" "164" ...
## - attr(*, "class")= chr "lm"
Ugh, that’s a lot of information in a list structure. Let’s use some helper functions to unpack these:
We can also dig in the summary of the model.
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.7e+01 1.9e-01 346 0.0e+00
## gdp 3.1e-04 1.2e-05 26 2.7e-131
Further reading
- Model formulae in R by Thomas Leeper
formula {stats}documentation- Multiverse analysis with multiverse
- Specification
curves with
specr - For more background on multiverse analysis, see Steegen et al.
- Introductory broom package vignette
- Overview of available methods
- The
modelsummarypackage creates tables and plots to summarize statistical models and data in R - Alternative packages for model summaries, including stargazer and texreg
- Yet another alternative for data visualization for statistics: sjPlot