class: left, middle, inverse, title-slide .title[ # Model selection ] .subtitle[ ## & some other Miscelaneous ] .author[ ### Mabel Carabali ] .institute[ ### EBOH, McGill University ] .date[ ### updated: 2024-11-25 ] --- class: middle ##<span style="color:darkmagenta">What do you think?</span> .pull-left[ <img src="images/L25economist.png" width="75%" style="display: block; margin: auto;" /> [Racial discrimination in mortgage lending has declined sharply in America](https://www.economist.com/united-states/2022/11/24/racial-discrimination-in-mortgage-lending-has-declined-sharply-in-america?utm_medium=social-media.content.np&utm_source=twitter&utm_campaign=editorial-social&utm_content=discovery.content) ] -- .pull-right[ <img src="images/L25tweet.png" width="75%" style="display: block; margin: auto;" /> [Endogeneity tweet](https://twitter.com/BearCurd/status/1596590553323298816) ] --- class: middle **Expected Competencies** - Understand the difference between predictive and etiological epidemiology principles. - Knows statistical assumptions for different regression models. - Knows basic model fit or "goodness-of-fit" statistics. -- ## Objectives - To revise the principles and differences between predictive and etiological research. - To revise the use of main "goodness-of-fit" statistics. - To revise the overall framework for model parameterization and specification for prediction models. - To revise variable parameterization and selection. --- class:middle ## What is model selection? - There are two important considerations when building a regression model: - The kind of outcome data you want to model (continuous, count, binary, etc.) - **.black[The variables and interactions you include in your model]** <br> -- <br> - **.red[Model selection]** is a process that attempts to find the best model for a given purpose. <br> -- <br> - The two main purposes of regression models are: - **.red[Causal Inference:]** To estimate the effect of one or more variables while adjusting for the possible confounding effects of other variables - **.red[Prediction:]** To predict outcomes for a set of similar individuals --- class: middle ### Causal Inference and Prediction: Different Goals! - **.blue[Causal Inference]**: estimating the effect of a variable on an outcome - Usually adjusted for confounding - Want a model with good confounder selection, determined via a DAG and substantive knowledge - **.purple[Prediction]**: predicting a future outcome using a set of covariates - We want predictions that are close to the actual value, but avoid overfitting - May prioritize measurable predictors over strong predictors - **.red[The two goals do NOT require the same model selection approach!]** --- class:middle ## Prediction?? - **.red[Prediction]** aims to anticipate some future outcome using a set of covariates - Prediction models are typically built using data from an existing group of individuals, with the goal of being able to predict the outcome for future individuals. - Here, we are concerned with getting the best model that gives "optimal" predictions for future subjects. - A good predictive model doesn't necessarily tell you anything useful about how to intervene to change the outcome! - Age, sex, and prior hospitalization may be good predictors of heart failure, but we wouldn't stop admitting people to the hospital to prevent heart failure --- class: middle ## Prediction: Examples Not only .red[weather]... - Credit scores - Netflix recommendations - Cardiovascular risk scores `\(^*\)` - Kidney function estimates `\(^*\)` - Predicting mortality <br> `\(^*\)` Some, including the misstep "race correction", different form "race-adjustment". --- class: middle **Prediction Examples** .pull-left[ Framingham cardiovascular disease (10-year risk) [calculator](https://ccs.ca/frs/) <img src="images/ccs_frs.png" width="75%" style="display: block; margin: auto;" /><img src="images/ccs_frs1.png" width="75%" style="display: block; margin: auto;" /> ] -- .pull-right[ Or [AHA's CV Risk Calculator](https://static.heart.org/riskcalc/app/index.html#!/baseline-risk) <img src="images/ascvd.png" width="80%" style="display: block; margin: auto;" /><img src="images/ascvd1.png" width="80%" style="display: block; margin: auto;" /> ] --- class:middle ## Causal Inference vs. Prediction **.red[Important:]** We can't interpret coefficients from prediction models as causal parameters - Not appropriately controlled for confounding (because we aren't worried about this) - .red[This is often a pitfall that people fall into!] -- <img src="images/table2fallacy.png" width="80%" style="display: block; margin: auto;" /> --- class:inverse, center, middle # OK, but how do I build a model? --- class:middle ## What is required to build a "Good" model? **.blue[Criteria:]** - Is our .blue[research question] causal or predictive? + This will determine the "philosophy" we use to build our model - Sample size (Power and .blue[Precision]) - Choice of statistical model/.blue[distribution/link] (Poisson, logistic, linear, etc.) - Consideration of different .blue[sources of bias] (for causal inference). - What data do we have available to us? --- class: middle ### Overall framework for model building / specification: 1) Variable specification (including assessment of number and parameterization) 2) Interaction assessment (including assessment of heterogeneity)* 3) Confounding assessment (including consideration of precision)* <br> *Steps 1 and 2 can occur iteratively and depending on the research question investigated differently. --- class: middle ### When do we say our model is "good"? - Variable specification Especially when we have many predictor variables (with many possible interactions), it can be difficult to find a good model. - Which main effects do we include? - Which interactions do we include? - With 6 variables, there are 64 potential models with just main effects! --- class: middle ### When do we say our model is "good"? - Variable specification - An active research problem in statistics - Model selection procedures try to simplify this task. -- **Evaluating Model Fit** To implement a model selection procedure, we first need a criterion to compare models. The goal is to select the model with the optimal value of the criterion. To that end, we assess the: **Goodness-of-Fit Statistics and use some statistical model metrics** --- class:middle ### Evaluating Model Fit: .red[Recall] .pull-left[ **R-squared `\(R^2\)`** - `\(R^2\)` represents the proportion of variance in the outcome explained by all the predictors in the model. - **.red[Criterion:]** Choose the model with the largest `\(R^2\)` - **.red[Problem:]** `\(R^2\)` always increases with model size - If only use `\(R^2\)` means simply choosing the largest model: .red[not very useful]. - Helpful when comparing models with the same number of parameters. ] -- .pull-right[ **Adj. R-squared `\(R^2\)`** - Adjusted `\(R^2\)` represents the proportion of variance in the outcome explained by all the predictors in the model, **.red[penalized for the number of variables included in the model]**. - **.red[Criterion:]** Choose the model with the largest adjusted `\(R^2\)`. - Here, the largest model is not necessarily the best model. ] --- class:middle .pull-left[ **AIC = Akaike Information Criterion** `\(AIC = -2ln(Likelihood) + 2p\)` - **.red[The best model is the one with the smallest AIC]**. - The AIC is formed by two terms: - The likelihood: measure of fit - The penalty term: `\(2p\)`, accounts for adding more terms to the model. The first term **.red[always]** decreases as more terms are added to the model, so `\(2p\)` is needed for "balance". - AIC can be used whenever we have a likelihood, so this generalizes to many statistical models. ] -- .pull-right[ **BIC = Bayesian Information Criterion** `\(BIC = -2ln(Likelihood) + p ln(n)\)` - **.red[The best model is the one with the smallest BIC]**. - AIC and BIC are very similar - only the last term changes - BIC will always choose a model as small or smaller than the AIC (if using the same search strategy). ] --- class:middle ## Selection Strategies - Prediction - Now that we know how to evaluate model fit, we need to figure out how to find the model with the **.red[best]** fit. - **.red[Best subset:]** Search all possible models and take the one with the highest `\(R^2\)`, or lowest MSE/AIC/BIC, etc. - Such searches are typically only feasible when you have less than 30 potential predictor variables. - **.red[Stepwise (forward, backward, or both) searches:]** Useful when the number of potential predictor variables is large. --- class:middle ## Illustration with the COVID-19 in Kenya data ```r library(epib.704.data) data("covidkenya") glimpse(covidkenya) ``` ``` ## Rows: 355 ## Columns: 15 ## $ adm_sex <chr> "Male", "Male", "Male", "Male", "Male", "Male", "Fe… ## $ sex <int> 1, 1, 1, 1, 1, 1, 2, 1, 1, 2, 1, 1, 2, 1, 2, 2, 1, … ## $ adm_agemons <int> 20, 58, 5, 44, 39, 56, 15, 105, 6, 44, 77, 44, 7, 3… ## $ age_cat_new <chr> "12-23 months", "2-5 years", "<6 months", "2-5 year… ## $ covid_status <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, … ## $ height <dbl> 72.60, 108.70, 67.40, 107.20, 112.15, NA, 78.50, 12… ## $ weight <dbl> 7.16, 17.40, 8.20, 15.30, 16.07, 14.40, 9.30, 24.30… ## $ hfaz <dbl> -4.35, -0.06, 0.54, 1.43, 3.56, NA, 0.01, -0.62, -3… ## $ stunting_cat <chr> "Severe stunting", "No stunting", "No stunting", "N… ## $ muac_average <dbl> 11.15, 15.50, 14.90, 14.30, 13.95, 14.55, 15.40, 16… ## $ nutritional_status <chr> "SAM", "No Malnutrition", "No Malnutrition", "No Ma… ## $ muac_cat <int> 1, 3, 3, 3, 3, 3, 3, NA, 2, 3, NA, 3, 3, 3, 3, 3, N… ## $ hh_covid_pos <chr> "No", "No", "No", "No", "No", "No", "No", "No", "No… ## $ birth_order <chr> "First", "Second", "Third and above", "Second", "Se… ## $ breast_feeding1 <chr> "nobreastfeed", "nobreastfeed", "nobreastfeed", "no… ``` --- class:middle **Regression subset selection: Illustration with the COVID-19 in Kenya data** ```r library(leaps) #Regression subset selection package *regfit_full <- regsubsets(covid_status ~ ., data = covidkenya, method = "exhaustive") ``` ``` ## Reordering variables and trying again: ``` -- ```r reg_summary <- summary(regfit_full) summary(regfit_full) ``` ``` ## Subset selection object ## Call: regsubsets.formula(covid_status ~ ., data = covidkenya, method = "exhaustive") ## 19 Variables (and intercept) ## Forced in Forced out ## adm_sexMale FALSE FALSE ## adm_agemons FALSE FALSE ## age_cat_new12-23 months FALSE FALSE ## age_cat_new2-5 years FALSE FALSE ## age_cat_new6-11 months FALSE FALSE ## height FALSE FALSE ## weight FALSE FALSE ## hfaz FALSE FALSE ## stunting_catNo stunting FALSE FALSE ## stunting_catSevere stunting FALSE FALSE ## muac_average FALSE FALSE ## nutritional_statusNo Malnutrition FALSE FALSE ## nutritional_statusSAM FALSE FALSE ## muac_cat FALSE FALSE ## hh_covid_posYes FALSE FALSE ## birth_orderSecond FALSE FALSE ## birth_orderThird and above FALSE FALSE ## breast_feeding1nobreastfeed FALSE FALSE ## sex FALSE FALSE ## 1 subsets of each size up to 9 ## Selection Algorithm: exhaustive ## adm_sexMale sex adm_agemons age_cat_new12-23 months ## 1 ( 1 ) " " " " " " " " ## 2 ( 1 ) " " " " " " " " ## 3 ( 1 ) " " " " " " " " ## 4 ( 1 ) "*" " " " " " " ## 5 ( 1 ) " " "*" " " " " ## 6 ( 1 ) "*" " " "*" " " ## 7 ( 1 ) " " "*" "*" " " ## 8 ( 1 ) " " "*" "*" " " ## 9 ( 1 ) " " "*" "*" " " ## age_cat_new2-5 years age_cat_new6-11 months height weight hfaz ## 1 ( 1 ) " " " " " " "*" " " ## 2 ( 1 ) " " " " " " "*" " " ## 3 ( 1 ) " " " " "*" "*" " " ## 4 ( 1 ) " " " " "*" "*" " " ## 5 ( 1 ) " " " " "*" "*" " " ## 6 ( 1 ) " " " " "*" "*" " " ## 7 ( 1 ) " " " " "*" "*" " " ## 8 ( 1 ) " " " " "*" "*" " " ## 9 ( 1 ) " " " " "*" "*" " " ## stunting_catNo stunting stunting_catSevere stunting muac_average ## 1 ( 1 ) " " " " " " ## 2 ( 1 ) " " " " " " ## 3 ( 1 ) " " " " " " ## 4 ( 1 ) " " " " " " ## 5 ( 1 ) " " " " " " ## 6 ( 1 ) " " "*" " " ## 7 ( 1 ) " " "*" " " ## 8 ( 1 ) " " "*" " " ## 9 ( 1 ) " " "*" " " ## nutritional_statusNo Malnutrition nutritional_statusSAM muac_cat ## 1 ( 1 ) " " " " " " ## 2 ( 1 ) " " " " "*" ## 3 ( 1 ) " " " " "*" ## 4 ( 1 ) " " " " "*" ## 5 ( 1 ) " " "*" "*" ## 6 ( 1 ) " " " " "*" ## 7 ( 1 ) " " " " "*" ## 8 ( 1 ) " " "*" "*" ## 9 ( 1 ) "*" "*" "*" ## hh_covid_posYes birth_orderSecond birth_orderThird and above ## 1 ( 1 ) " " " " " " ## 2 ( 1 ) " " " " " " ## 3 ( 1 ) " " " " " " ## 4 ( 1 ) " " " " " " ## 5 ( 1 ) " " " " " " ## 6 ( 1 ) " " " " " " ## 7 ( 1 ) " " "*" " " ## 8 ( 1 ) " " "*" " " ## 9 ( 1 ) " " "*" " " ## breast_feeding1nobreastfeed ## 1 ( 1 ) " " ## 2 ( 1 ) " " ## 3 ( 1 ) " " ## 4 ( 1 ) " " ## 5 ( 1 ) " " ## 6 ( 1 ) " " ## 7 ( 1 ) " " ## 8 ( 1 ) " " ## 9 ( 1 ) " " ``` --- class:middle ### Illustration with the COVID-19 in Kenya data ```r par(mfrow = c(1,2)) plot(reg_summary$adjr2, xlab = "Number of Variables", ylab = "Adjusted RSq", type = "l") adj_r2_max = which.max(reg_summary$adjr2) points(adj_r2_max, reg_summary$adjr2[adj_r2_max], col ="red", cex = 2, pch = 20) plot(reg_summary$bic, xlab = "Number of Variables", ylab = "BIC", type = "l") bic_min = which.min(reg_summary$bic) points(bic_min, reg_summary$bic[bic_min], col = "red", cex = 2, pch = 20) ``` <img src="L22_EPIB704_Model_Building_files/figure-html/unnamed-chunk-9-1.png" width="60%" height="60%" style="display: block; margin: auto;" /> --- class:middle ```r plot(regfit_full, scale = "adjr2", main= "'adjr2' for COVID-19 in Kenya data") ``` <img src="L22_EPIB704_Model_Building_files/figure-html/unnamed-chunk-10-1.png" width="110%" height="50%" style="display: block; margin: auto;" /> --- class:middle ```r plot(regfit_full, scale = "bic", main= "'BIC' for COVID-19 in Kenya data") ``` <img src="L22_EPIB704_Model_Building_files/figure-html/unnamed-chunk-11-1.png" width="110%" height="50%" style="display: block; margin: auto;" /> --- class: middle ### Illustration of selection with the COVID-19 in Kenya data Subset selection object `\(\to\)` automatically selects the "best" sets: regsubsets.formula(covid_status ~ ., data = covidkenya, method = "exhaustive") - .blue[19 Variables/ parameter (and intercept) with 1 subsets of each size up to 9 variables] -- - Based on this, we would probably choose to include a model with .red[2-8 variables]: - Likely age, weight, hfaz (height for age), MUAC (continuous), hh_covid, depending on the choice of metrics... -- ## .red[Do you trust this approach?] .purple[Any concerns??] --- class:middle ### Even the _"smartest"_ software, program or technology will need, .red[until now], a bit of guidance... **Selecting a pool of variables to choose from, may help** ```r dat<- covidkenya %>% select(adm_sex, age_cat_new, stunting_cat, nutritional_status, hh_covid_pos, birth_order, breast_feeding1, covid_status, hfaz, muac_average) ``` -- **Then re-formulating the selection** ```r regfit_full1 <- regsubsets(covid_status ~ ., * data = dat, method = "exhaustive") ``` -- ```r reg_summary1 <- summary(regfit_full1) ``` --- class: middle ### Illustration of selection with the COVID-19 in Kenya data From the selected covariates, there are .blue[9 variables and 15 parameters] to be estimated. - The package proposes .red[8 models with 8 combinations] of the parameters/covariates. -- ``` ## Subset selection object ## Call: regsubsets.formula(covid_status ~ ., data = dat, method = "exhaustive") ## 15 Variables (and intercept) ## Forced in Forced out ## adm_sexMale FALSE FALSE ## age_cat_new>=5 years FALSE FALSE ## age_cat_new12-23 months FALSE FALSE ## age_cat_new2-5 years FALSE FALSE ## age_cat_new6-11 months FALSE FALSE ## stunting_catNo stunting FALSE FALSE ## stunting_catSevere stunting FALSE FALSE ## nutritional_statusNo Malnutrition FALSE FALSE ## nutritional_statusSAM FALSE FALSE ## hh_covid_posYes FALSE FALSE ## birth_orderSecond FALSE FALSE ## birth_orderThird and above FALSE FALSE ## breast_feeding1nobreastfeed FALSE FALSE ## hfaz FALSE FALSE ## muac_average FALSE FALSE ## 1 subsets of each size up to 8 ## Selection Algorithm: exhaustive ## adm_sexMale age_cat_new>=5 years age_cat_new12-23 months ## 1 ( 1 ) "*" " " " " ## 2 ( 1 ) "*" " " " " ## 3 ( 1 ) "*" " " " " ## 4 ( 1 ) "*" " " " " ## 5 ( 1 ) "*" " " " " ## 6 ( 1 ) "*" " " " " ## 7 ( 1 ) "*" "*" " " ## 8 ( 1 ) "*" "*" "*" ## age_cat_new2-5 years age_cat_new6-11 months stunting_catNo stunting ## 1 ( 1 ) " " " " " " ## 2 ( 1 ) " " " " " " ## 3 ( 1 ) "*" " " " " ## 4 ( 1 ) "*" " " " " ## 5 ( 1 ) "*" " " " " ## 6 ( 1 ) "*" " " " " ## 7 ( 1 ) " " " " " " ## 8 ( 1 ) " " "*" " " ## stunting_catSevere stunting nutritional_statusNo Malnutrition ## 1 ( 1 ) " " " " ## 2 ( 1 ) " " "*" ## 3 ( 1 ) " " "*" ## 4 ( 1 ) " " "*" ## 5 ( 1 ) " " "*" ## 6 ( 1 ) " " "*" ## 7 ( 1 ) " " "*" ## 8 ( 1 ) " " "*" ## nutritional_statusSAM hh_covid_posYes birth_orderSecond ## 1 ( 1 ) " " " " " " ## 2 ( 1 ) " " " " " " ## 3 ( 1 ) " " " " " " ## 4 ( 1 ) "*" " " " " ## 5 ( 1 ) "*" "*" " " ## 6 ( 1 ) "*" "*" "*" ## 7 ( 1 ) "*" "*" " " ## 8 ( 1 ) "*" "*" "*" ## birth_orderThird and above breast_feeding1nobreastfeed hfaz ## 1 ( 1 ) " " " " " " ## 2 ( 1 ) " " " " " " ## 3 ( 1 ) " " " " " " ## 4 ( 1 ) " " " " " " ## 5 ( 1 ) " " " " " " ## 6 ( 1 ) " " " " " " ## 7 ( 1 ) "*" "*" " " ## 8 ( 1 ) " " " " " " ## muac_average ## 1 ( 1 ) " " ## 2 ( 1 ) " " ## 3 ( 1 ) " " ## 4 ( 1 ) " " ## 5 ( 1 ) " " ## 6 ( 1 ) " " ## 7 ( 1 ) " " ## 8 ( 1 ) " " ``` --- class: middle **Illustration of selection with the COVID-19 in Kenya data** <img src="L22_EPIB704_Model_Building_files/figure-html/unnamed-chunk-16-1.png" width="100%" height="60%" style="display: block; margin: auto;" /> --- class:middle <img src="L22_EPIB704_Model_Building_files/figure-html/unnamed-chunk-17-1.png" width="120%" height="60%" style="display: block; margin: auto;" /> --- class:middle <img src="L22_EPIB704_Model_Building_files/figure-html/unnamed-chunk-18-1.png" width="120%" style="display: block; margin: auto;" /> --- class: middle ### Other Subset Selection ```r #Backwards selection regsubsets(outcome ~ ., data = covidkenya, method = "backward") #Forwards selection regsubsets(outcome ~ ., data = covidkenya, method = "forward") #Stepwise selection regsubsets(outcome ~ ., data = covidkenya, method = "seqrep") ``` <br> **.red[Still any decision should be made with caution!]** --- class:middle ## Selection Strategies: .red[Caution!] - Different model fit metrics (e.g., `\(R^2\)`, AIC, Mallow's Cp) can show different models as the "best". <br> -- <br> - Can be a recipe for a type I error (chance significant findings) .red[recall Ioannidis paper?] - Particularly when the number of candidate predictors is large relative to the sample size. <br> -- <br> - Automated selection procedures make it easy for researchers to ignore good practices for causal inference, like .blue[choosing variables based on a DAG]. <br> -- <br> - High .purple[risk of overfitting] to your data set - Selected variables may strongly discriminate among individuals in your data set, but may have less ability to do so on other data --- class: middle ##Overall framework for model specification: 1) Variable specification: > .blue[ - What is the universe of variables I would consider? - There is a limited number available, some of the ones we would need, will not be. - We need to think this through and choose. - Leaving variables out is a strong assumption about that not having an effect ( `\(\beta = 0\)` ) ] 2) Interaction assessment (including assessment of heterogeneity) 3) Confounding assessment (including consideration of precision) --- class: middle ##Selection and Specification > _"The approach is guided by several principles, including the adherence to a hierarchically defined initial (full) model, and a backward elimination strategy._ <br> -- <br> **Corollaries** - Collinearity, correlation (other hierarchical structures) - Moderate or correct for your own Degrees of Freedom - Multiple testing and Bonferroni corrections --- class: middle ## Summary - **.red[Prediction and inference are different types of problems]**. - If the goal is causal inference, we must think about confounding (and other biases!) and model interpretation is important. -- <br> - If the goal is prediction, model interpretation can be less important, and the choice of predictor variables is driven more by the data we have available and model evaluation metrics. -- <br> - Model selection strategies can be used for both, but much more care must be taken if you choose to use them for an inference question --- class:inverse, center, middle ##“All models are wrong, but some are useful” — George Box (1919-2013) --- class: middle ### QUESTIONS? ## COMMENTS? # RECOMMENDATIONS? --- class: middle ## Resources: - Greenland et al. 2016. "Outcome modelling strategies in epidemiology: traditional methods and basic alternatives." Int. J. Epidemiol. - Steyerberg, Ewout W. 2019. Clinical Prediction Models: A Practical Approach to Development, Validation, and Updating. Springer, Cham. - Harrell, Frank E., Jr. 2015. Regression Modeling Strategies: With Applications to Linear Models, Logistic and Ordinal Regression, and Survival Analysis. Springer Series in Statistics. Cham: Springer International Publishing. --- class:middle ## Miscelaneous --- class: middle ### Goodness-of-Fit Statistics and model metrics - **R-squared** `\(R^2\)` : % of variation in `Y` explained by `"predictors"` variables. The higher `\(R^2\)`, the better the model. - **Root Mean Squared Error (RMSE)**: the average error performed by the model in predicting the outcome for an observation. RMSE = sqrt(MSE); The `\(\uparrow\)` the RMSE, the better the model. - **Residual Standard Error (RSE):** the model sigma, a variant of RMSE adjusted # predictors in the model. The `\(\downarrow\)` RSE, the better the model. - **Mean Absolute Error (MAE)**, measures the prediction error. MAE = mean(abs(observed- predicted)). Less sensitive to outliers compared to RMSE. --- class: middle ### Goodness-of-Fit Statistics and model metrics - **AIC:** (Akaike’s Information Criteria) penalizes the inclusion of additional variables to a model. - **AICc:** is a version of AIC corrected for small sample sizes. - **BIC:** (or Bayesian information criteria) is a variant of AIC with a stronger penalty. - **Mallows Cp:** A variant of AIC developed by Colin Mallows. - **WAIC:** Widely (Watanabe) Application Criterion (Bayesian) - **PSIS:** Pareto-Smoothed Importance Sampling [(Bayesian)](https://mouse-imaging-centre.github.io/blog/post/2018-01-31_loo-intro/) - **DIC:** Deviance Information Criterion (Bayesian) --- class: middle ## Evaluating Prediction Models - The model fit in the **training data** (i.e., the data used to develop a predictive model) is not our primary interest - What we are primarily interested in is the accuracy of our model predictions **when our model is applied to new data** that was not used as part of the model development process (i.e., **test data**) - **.red[A good model fit in the training data doesn't necessarily ensure a good ability to predict using other data]**. --- class: middle ###Paramaterization, Trends, Dose-Response? **<span style="color:darkmagenta"> Recall this?</span>**
<caption class='gt_caption'><span class='gt_from_md'><strong>Table 1. Summary of Study Characteristics</strong></span></caption>
Characteristic
N = 355
1
covid_status
55 (16%)
adm_agemons
Median (Q1, Q3)
30 (15, 63)
muac_average
Median (Q1, Q3)
14.70 (13.40, 15.70)
caregiver_educ1
None
21 (5.9%)
Primary
216 (61%)
Secondary
87 (25%)
Above secondary
30 (8.5%)
nutrition
malnutrition
123 (35%)
nutrition
230 (65%)
1
Median (IQR) or Frequency (%); ‘adm_agemons’ is age in months; ‘muac_average’ is measured in cms; ‘nutrition’, and ‘muac’ are measures of nutritional status; ‘caregiver_edu’ is education level of caregiver
[Clinical epidemiology of COVID-19 among hospitalized children in rural western Kenya](https://journals.plos.org/globalpublichealth/article?id=10.1371/journal.pgph.0002011#abstract0) --- class: middle ###Paramaterization, Trends, Dose-Response? Let's consider Education of the caregiver and nutritional status measured by Mid Upper Arm Circumference (MUAC) in cm, as the continuous outcome. .pull-left[ <img src="L22_EPIB704_Model_Building_files/figure-html/unnamed-chunk-21-1.png" width="80%" style="display: block; margin: auto;" /> ] -- .pull-right[ ``` ## # A tibble: 5 × 2 ## caregiver_educ_n m.muac ## <dbl> <dbl> ## 1 0 15.2 ## 2 1 14.6 ## 3 2 14.7 ## 4 3 15.2 ## 5 NA 14.4 ``` ] [Clinical epidemiology of COVID-19 among hospitalized children in rural western Kenya](https://journals.plos.org/globalpublichealth/article?id=10.1371/journal.pgph.0002011#abstract0) --- class: middle ###Paramaterization, Trends, Dose-Response? How can we assess their relationship? **Using a continuous variable <span style="color:red">assumptions?</span>** ```r muac1<- glm(muac_average ~ caregiver_educ_n, data = L25data) round(j_summ(muac1, confint = T)$coeftable, 2) ``` ``` ## Est. 2.5% 97.5% t val. p ## (Intercept) 14.67 14.11 15.22 52.15 0.00 ## caregiver_educ_n -0.03 -0.39 0.33 -0.18 0.86 ``` -- **Using a categorical version of the variable <span style="color:red">assumptions?</span>** ```r muac2<- glm(muac_average ~ factor(caregiver_educ_n), data = L25data) round(j_summ(muac2, confint = T)$coeftable, 2) ``` ``` ## Est. 2.5% 97.5% t val. p ## (Intercept) 15.23 14.18 16.29 28.23 0.00 ## factor(caregiver_educ_n)1 -0.65 -1.76 0.46 -1.15 0.25 ## factor(caregiver_educ_n)2 -0.79 -1.97 0.39 -1.31 0.19 ## factor(caregiver_educ_n)3 -0.24 -1.62 1.14 -0.35 0.73 ``` [Clinical epidemiology of COVID-19 among hospitalized children in rural western Kenya](https://journals.plos.org/globalpublichealth/article?id=10.1371/journal.pgph.0002011#abstract0) --- class: middle ###Paramaterization, Trends, Dose-Response? Let's consider COVID-19 status (disease yes/no) and Education, can we assess their relationship? .pull-left[ <img src="L22_EPIB704_Model_Building_files/figure-html/unnamed-chunk-25-1.png" width="80%" style="display: block; margin: auto;" /> ] .pull-right[ <img src="L22_EPIB704_Model_Building_files/figure-html/unnamed-chunk-26-1.png" width="80%" style="display: block; margin: auto;" /> ] [Clinical epidemiology of COVID-19 among hospitalized children in rural western Kenya](https://journals.plos.org/globalpublichealth/article?id=10.1371/journal.pgph.0002011#abstract0) --- class: middle ###Paramaterization, Trends, Dose-Response? **Dichotomous Outcome** .pull-left[ ```r muac3<- glm(covid_status ~ caregiver_educ_n, data = L25data, family= binomial(link = "logit")) round(j_summ(muac3, confint = T)$coeftable, 2) ``` ``` ## Est. 2.5% 97.5% z val. p ## (Intercept) -1.44 -2.05 -0.83 -4.64 0.00 ## caregiver_educ_n -0.18 -0.60 0.24 -0.85 0.39 ``` ```r muac4<- glm(covid_status ~ factor(caregiver_educ_n), data = L25data, family= binomial(link = "logit")) round(j_summ(muac4, confint = T)$coeftable, 2) ``` ``` ## Est. 2.5% 97.5% z val. p ## (Intercept) -1.16 -2.17 -0.16 -2.27 0.02 ## factor(caregiver_educ_n)1 -0.47 -1.54 0.59 -0.87 0.38 ## factor(caregiver_educ_n)2 -0.85 -2.05 0.35 -1.39 0.16 ## factor(caregiver_educ_n)3 -0.41 -1.80 0.99 -0.57 0.57 ``` ] -- .pull-right[ <img src="L22_EPIB704_Model_Building_files/figure-html/unnamed-chunk-29-1.png" width="80%" style="display: block; margin: auto;" /> ] [Clinical epidemiology of COVID-19 among hospitalized children in rural western Kenya](https://journals.plos.org/globalpublichealth/article?id=10.1371/journal.pgph.0002011#abstract0) --- class: middle ###Paramaterization, Trends, Dose-Response? What happens when the referent group has a very small sample size? - OR estimates are all imprecise <span style="color:darkred">(since the imprecision for the referent group is propogated)</span> - One way are this is to use incremental coding of the dummy variables ```r L25data$edu1<- L25data$edu2<-L25data$edu3<-0 L25data$edu1[L25data$caregiver_educ_n>=1]<- 1 L25data$edu2[L25data$caregiver_educ_n>=2]<- 1 L25data$edu3[L25data$caregiver_educ_n>=3]<- 1 L25data$edu1[is.na(L25data$caregiver_educ_n)==T]<- NA L25data$edu2[is.na(L25data$caregiver_educ_n)==T]<- NA L25data$edu3[is.na(L25data$caregiver_educ_n)==T]<- NA ``` [Clinical epidemiology of COVID-19 among hospitalized children in rural western Kenya](https://journals.plos.org/globalpublichealth/article?id=10.1371/journal.pgph.0002011#abstract0) --- class: middle **Paramaterization, Trends, Dose-Response?** .pull-left[ **Incremental categories*** ```r *muac5<- glm(covid_status ~ edu1+edu2+edu3, data = L25data, family= binomial(link = "logit")) round(j_summ(muac5, confint = T)$coeftable, 2) ``` ``` ## Est. 2.5% 97.5% z val. p ## (Intercept) -1.16 -2.17 -0.16 -2.27 0.02 ## edu1 -0.47 -1.54 0.59 -0.87 0.38 ## edu2 -0.38 -1.13 0.38 -0.98 0.33 ## edu3 0.45 -0.72 1.61 0.75 0.45 ``` **OR what we usually do...** ```r muac6<- glm(covid_status ~ caregiver_educ1, data = L25data, family= binomial(link = "logit")) round(j_summ(muac6, confint = T)$coeftable, 2) ``` ``` ## Est. 2.5% 97.5% z val. p ## (Intercept) -1.16 -2.17 -0.16 -2.27 0.02 ## caregiver_educ1Primary -0.47 -1.54 0.59 -0.87 0.38 ## caregiver_educ1Secondary -0.85 -2.05 0.35 -1.39 0.16 ## caregiver_educ1Above secondary -0.41 -1.80 0.99 -0.57 0.57 ``` ] -- .pull-right[ <img src="L22_EPIB704_Model_Building_files/figure-html/unnamed-chunk-33-1.png" width="80%" style="display: block; margin: auto;" /> ] .small[ *Incremental means, that the reference is the immediately previous category/group [Clinical epidemiology of COVID-19 among hospitalized children in rural western Kenya](https://journals.plos.org/globalpublichealth/article?id=10.1371/journal.pgph.0002011#abstract0) ] --- class: middle **Splines examples!** .pull-left[ - `Y`= muac_average (MUAC in cms); - `X`= hfaz (Height for age, Z-score) <img src="L22_EPIB704_Model_Building_files/figure-html/unnamed-chunk-34-1.png" width="80%" style="display: block; margin: auto;" /><img src="L22_EPIB704_Model_Building_files/figure-html/unnamed-chunk-34-2.png" width="80%" style="display: block; margin: auto;" /> ] -- .pull-right[ ```r require(splines) lmmod6.11a<- glm(muac_average~ breast_feeding1 + adm_sex + age_cat_new+ * ns(hfaz, df=2), data = covidkenya) #round(j_summ(lmmod6.11a, confint = T)$coeftable, 2) lmmod6.11b<- glm(muac_average~ breast_feeding1 + adm_sex + age_cat_new + * ns(hfaz, df=5), data = covidkenya) round(j_summ(lmmod6.11b, confint = T)$coeftable, 2) ``` ``` ## Est. 2.5% 97.5% t val. p ## (Intercept) 12.95 9.55 16.35 7.47 0.00 ## breast_feeding1nobreastfeed -0.08 -0.74 0.59 -0.22 0.83 ## adm_sexMale 0.08 -0.34 0.49 0.36 0.72 ## age_cat_new>=5 years 3.99 3.01 4.96 8.02 0.00 ## age_cat_new12-23 months 1.58 0.66 2.50 3.38 0.00 ## age_cat_new2-5 years 2.53 1.59 3.46 5.31 0.00 ## age_cat_new6-11 months 1.24 0.26 2.22 2.48 0.01 ## ns(hfaz, df = 5)1 -0.64 -3.91 2.63 -0.38 0.70 ## ns(hfaz, df = 5)2 0.03 -3.32 3.39 0.02 0.98 ## ns(hfaz, df = 5)3 3.32 0.24 6.40 2.11 0.04 ## ns(hfaz, df = 5)4 -2.67 -9.94 4.60 -0.72 0.47 ## ns(hfaz, df = 5)5 7.34 3.22 11.46 3.49 0.00 ``` ```r #attr(terms(lmmod6.11b), "predvars") ``` ] --- class: middle **Last Splines examples!** .pull-left[ - `Y`= muac_average (MUAC in cms); - `X`= adm_agemons (Age in months) <img src="L22_EPIB704_Model_Building_files/figure-html/unnamed-chunk-36-1.png" width="80%" style="display: block; margin: auto;" /><img src="L22_EPIB704_Model_Building_files/figure-html/unnamed-chunk-36-2.png" width="80%" style="display: block; margin: auto;" /> ] -- .pull-right[ ```r lmmod6.12a<- glm(muac_average~ breast_feeding1 + adm_sex + * ns(adm_agemons, df=5), data = covidkenya) #round(j_summ(lmmod6.12a, confint = T)$coeftable, 2) lmmod6.12b<- glm(muac_average~ breast_feeding1 + adm_sex + * poly(adm_agemons, 5, raw=T), data = covidkenya) round(j_summ(lmmod6.12b, confint = T)$coeftable, 2) ``` ``` ## Est. 2.5% 97.5% t val. p ## (Intercept) 11.70 10.75 12.66 23.96 0.00 ## breast_feeding1nobreastfeed -0.29 -0.97 0.38 -0.85 0.40 ## adm_sexMale 0.13 -0.30 0.56 0.59 0.56 ## poly(adm_agemons, 5, raw = T)1 0.21 0.09 0.34 3.31 0.00 ## poly(adm_agemons, 5, raw = T)2 0.00 -0.01 0.00 -1.90 0.06 ## poly(adm_agemons, 5, raw = T)3 0.00 0.00 0.00 1.11 0.27 ## poly(adm_agemons, 5, raw = T)4 0.00 0.00 0.00 -0.52 0.61 ## poly(adm_agemons, 5, raw = T)5 0.00 0.00 0.00 0.08 0.94 ``` ```r #attr(terms(lmmod6.12b), "predvars") ``` ]