class: left, middle, inverse, title-slide .title[ # Model selection ] .subtitle[ ## & some other Miscelaneous ] .author[ ### Mabel Carabali ] .institute[ ### EBOH, McGill University ] .date[ ### updated: 2025-12-02 ] --- class: middle ##<span style="color:darkmagenta">What do you think?</span> <img src="images/alcohol_stroke.png" width="75%" style="display: block; margin: auto;" /> [Effects of Heavy Alcohol Use on Acute Intracerebral Hemorrhage and Cerebral Small Vessel Disease](https://www.neurology.org/doi/10.1212/WNL.0000000000214348) Hindsholm, M., et al.,. (2025). Neurology, 105 (11), doi: 10.1212/WNL.0000000000214348. --- class: middle **Expected Competencies** - Understand the difference between predictive and etiological epidemiology principles. - Knows statistical assumptions for different regression models. - Knows basic model fit or "goodness-of-fit" statistics. -- ## Objectives - To revise the principles and differences between predictive and etiological research. - To revise the use of main "goodness-of-fit" statistics. - To revise the overall framework for model parameterization and specification for prediction models. - To revise variable parameterization and selection. --- class:middle ## What is model selection? - There are two important considerations when building a regression model: - The kind of outcome data you want to model (continuous, count, binary, etc.) - **.black[The variables and interactions you include in your model]** <br> -- <br> - **.red[Model selection]** is a process that attempts to find the best model for a given purpose. <br> -- <br> - The two main purposes of regression models are: - **.red[Causal Inference:]** To estimate the effect of one or more variables while adjusting for the possible confounding effects of other variables. - **.red[Prediction:]** To predict outcomes for a set of similar individuals. --- class: middle ## Prediction: Examples Not only .red[weather]... - Credit scores - Netflix recommendations - Cardiovascular risk scores `\(^*\)` - Kidney function estimates `\(^*\)` - Predicting mortality <br> `\(^*\)` Some, including the misstep "race correction", different form "race-adjustment". --- class: middle **Prediction Examples** .pull-left[ Framingham cardiovascular disease (10-year risk) [calculator](https://ccs.ca/frs/) <img src="images/ccs_frs.png" width="75%" style="display: block; margin: auto;" /><img src="images/ccs_frs1.png" width="75%" style="display: block; margin: auto;" /> ] -- .pull-right[ Or [AHA's CV Risk Calculator](https://static.heart.org/riskcalc/app/index.html#!/baseline-risk) <img src="images/ascvd.png" width="80%" style="display: block; margin: auto;" /><img src="images/ascvd1.png" width="80%" style="display: block; margin: auto;" /> ] --- class:middle ## Causal Inference vs. Prediction **.red[Important:]** We can't interpret coefficients from prediction models as causal parameters - Not appropriately controlled for confounding (because we aren't worried about this) - .red[This is often a pitfall that people fall into!] -- <img src="images/table2fallacy.png" width="80%" style="display: block; margin: auto;" /> --- class:inverse, center, middle # OK, but how do I build a model? --- class:middle ## What is required to build a "Good" model? **.blue[Criteria:]** - Is our .blue[research question] causal or predictive? + This will determine the "philosophy" we use to build our model - Sample size (Power and .blue[Precision]) - Choice of statistical model/.blue[distribution/link] (Poisson, logistic, linear, etc.) - Consideration of different .blue[sources of bias] (for causal inference). - What data do we have available to us? --- class: middle ### Overall framework for model building / specification: 1) Variable specification (including assessment of number and parameterization) 2) Interaction assessment (including assessment of heterogeneity)* 3) Confounding assessment (including consideration of precision)* <br> *Steps 1 and 2 can occur iteratively and depending on the research question investigated differently. --- class: middle ### When do we say our model is "good"? - Variable specification Especially when we have many predictor variables (with many possible interactions), it can be difficult to find a good model. - Which main effects do we include? - Which interactions do we include? - With 6 variables, there are .red[64 potential models] with just main effects! --- class: middle ## What did we get from our "dream model"? <img src="L22_EPIB704_Model_Building_25_files/figure-html/unnamed-chunk-5-1.png" width="110%" style="display: block; margin: auto;" /> --- class: middle ## What did we get from our "dream model"? .pull-left[ **Outcome** <img src="L22_EPIB704_Model_Building_25_files/figure-html/unnamed-chunk-6-1.png" width="110%" style="display: block; margin: auto;" /> ] -- .pull-right[ **Exposures** <img src="L22_EPIB704_Model_Building_25_files/figure-html/unnamed-chunk-7-1.png" width="110%" style="display: block; margin: auto;" /> ] --- class: middle ### When do we say our model is "good"? - Variable specification - An active research problem in statistics - Model selection procedures try to simplify this task. -- **Evaluating Model Fit** To implement a model selection procedure, we first need a criterion to compare models. The goal is to select the model with the optimal value of the criterion. To that end, we assess the: **Goodness-of-Fit Statistics and use some statistical model metrics** --- class:middle ### Evaluating Model Fit: .red[Recall] .pull-left[ **R-squared `\(R^2\)`** - `\(R^2\)` represents the proportion of variance in the outcome explained by all the predictors in the model. - **.red[Criterion:]** Choose the model with the largest `\(R^2\)` - **.red[Problem:]** `\(R^2\)` always increases with model size - If only use `\(R^2\)` means simply choosing the largest model: .red[not very useful]. - Helpful when comparing models with the same number of parameters. ] -- .pull-right[ **Adj. R-squared `\(R^2\)`** - Adjusted `\(R^2\)` represents the proportion of variance in the outcome explained by all the predictors in the model, **.red[penalized for the number of variables included in the model]**. - **.red[Criterion:]** Choose the model with the largest adjusted `\(R^2\)`. - Here, the largest model is not necessarily the best model. ] --- class:middle .pull-left[ **AIC = Akaike Information Criterion** `\(AIC = -2ln(Likelihood) + 2p\)` - **.red[The best model is the one with the smallest AIC]**. - The AIC is formed by two terms: - The likelihood: measure of fit - The penalty term: `\(2p\)`, accounts for adding more terms to the model. The first term **.red[always]** decreases as more terms are added to the model, so `\(2p\)` is needed for "balance". - AIC can be used whenever we have a likelihood, so this generalizes to many statistical models. ] -- .pull-right[ **BIC = Bayesian Information Criterion** `\(BIC = -2ln(Likelihood) + p ln(n)\)` - **.red[The best model is the one with the smallest BIC]**. - AIC and BIC are very similar - only the last term changes - BIC will always choose a model as small or smaller than the AIC (if using the same search strategy). ] --- class:middle ## Selection Strategies - Prediction - Now that we know how to evaluate model fit, we need to figure out how to find the model with the **.red[best]** fit. - **.red[Best subset:]** Search all possible models and take the one with the highest `\(R^2\)`, or lowest MSE/AIC/BIC, etc. - Such searches are typically only feasible when you have less than 30 potential predictor variables. - **.red[Stepwise (forward, backward, or both) searches:]** Useful when the number of potential predictor variables is large. --- class:middle ## Illustration with the datasets used in 704 (1) ``` r library(epib.704.data) data("Amazonas_HQoL") glimpse(Amazonas_HQoL) ``` ``` ## Rows: 1,179 ## Columns: 47 ## $ X <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18,… ## $ ID <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18,… ## $ morador <int> 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1,… ## $ FILTRO <int> 3, 3, 2, 3, 3, 3, 2, 3, 2, 2, 3, 3, 3, 2, 3, 2, 3, 3, 3, 2, 3,… ## $ P72 <int> 1, 1, 1, 2, 1, 1, 1, 1, 3, 3, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1,… ## $ P73 <int> 1, 1, 1, 2, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 2, 2, 1, 1, 1, 1,… ## $ P74 <int> 1, 1, 1, 2, 1, 1, 3, 1, 2, 3, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1,… ## $ P75 <int> 2, 2, 1, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 3, 2, 2, 1, 2, 2, 2,… ## $ P76 <int> 1, 1, 1, 2, 2, 1, 2, 1, 1, 3, 2, 2, 1, 1, 2, 1, 2, 1, 1, 1, 2,… ## $ P77 <int> 90, 80, 80, 50, 40, 80, 50, 70, 50, 10, 80, 60, 80, 80, 70, 80… ## $ P1 <int> 9, 6, 7, 3, 4, 2, 5, 2, 3, 1, 1, 3, 4, 2, 2, 4, 7, 2, 1, 1, 1,… ## $ P11 <int> 3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 3, 3, 2,… ## $ P12 <int> 6, 3, 3, 3, 3, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 1, 1, 1, 3, 1, 3,… ## $ P13 <int> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 1,… ## $ P14 <int> 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1,… ## $ P15 <int> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1,… ## $ P16 <int> 3, 5, 4, 5, 4, 5, 6, 5, 2, 4, 3, 2, 2, 1, 4, 4, 3, 6, 3, 3, 1,… ## $ P17 <int> 2, 1, 1, 2, 2, 3, 1, 1, 1, 1, 1, 2, 0, 0, 0, 1, 1, 1, 1, 1, 1,… ## $ P18 <int> 3, 4, 4, 4, 3, 3, 4, 3, 3, 3, 3, 3, NA, NA, NA, 3, 2, 4, 4, 1,… ## $ P19 <int> 2, 2, 1, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 3, 2, 2, 2, 2, 1,… ## $ P20A <int> 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0,… ## $ P20B <int> 2, 1, 1, 1, 2, 2, 2, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 1, 1, 1, 1,… ## $ P20C <int> 0, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1,… ## $ P20D <int> 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0,… ## $ P20E <int> 1, 1, 1, 1, 0, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1,… ## $ P20F <int> 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,… ## $ P20G <int> 0, 0, 1, 0, 0, 1, 1, 0, 1, 1, 1, 1, 0, 1, 0, 1, 1, 2, 0, 0, 0,… ## $ P20H <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,… ## $ P20I <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1,… ## $ P20J <int> 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,… ## $ P20K <int> 0, 2, 1, 1, 1, 0, 0, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0,… ## $ P20L <int> 2, 2, 4, 1, 1, 0, 1, 2, 3, 1, 1, 1, 3, 0, 2, 2, 4, 2, 2, 1, 1,… ## $ P20M <int> 2, 1, 1, 1, 2, 1, 1, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2,… ## $ P20N <int> 2, 1, 1, 1, 2, 1, 1, 1, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 1, 2, 2,… ## $ P21 <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,… ## $ P24 <int> 1, 1, 2, 1, 1, 1, 1, 2, 1, 1, 2, 2, 1, 2, 1, 2, 1, 1, 1, 9, 2,… ## $ P28 <int> 2, 7, 7, 2, 3, 2, 7, 7, 9, 3, 5, 7, 6, 2, 3, 7, 6, 2, 2, 4, 2,… ## $ P5 <int> 1, 2, 2, 1, 2, 2, 2, 1, 1, 1, 2, 1, 1, 2, 1, 2, 1, 2, 2, 2, 2,… ## $ P4 <int> 44, 35, 32, 50, 50, 79, 46, 47, 31, 63, 59, 20, 31, 46, 39, 18… ## $ P78 <int> 2, 2, 2, 2, 2, 2, 2, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,… ## $ P81A <int> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 2, 2, 2, 2, 1, 2, 2, 2, 1,… ## $ P81B <int> 2, 2, 2, 2, 1, 1, 1, 1, 2, 2, 1, 2, 2, 2, 2, 2, 1, 2, 2, 2, 1,… ## $ P81C <int> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,… ## $ P81D <int> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2,… ## $ P82 <int> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 1, 2,… ## $ P83 <int> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2,… ## $ ZONA <int> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,… ``` --- class:middle **Regression subset selection: Illustration with the "Amazonas_HQoL" data** ``` r library(leaps) #Regression subset selection package *regfit_full.1 <- regsubsets(binQoL ~ ., data = Amazonas_HQoL, method = "exhaustive", really.big=T) ``` ``` ## Reordering variables and trying again: ``` .red[Need to specify when >50 vars] -- ``` r reg_summary.1 <- summary(regfit_full.1) summary(regfit_full.1) ``` ``` ## Subset selection object ## Call: regsubsets.formula(binQoL ~ ., data = Amazonas_HQoL, method = "exhaustive", ## really.big = T) ## 58 Variables (and intercept) ## Forced in Forced out ## X FALSE FALSE ## ID FALSE FALSE ## morador FALSE FALSE ## FILTRO FALSE FALSE ## P72 FALSE FALSE ## P73 FALSE FALSE ## P74 FALSE FALSE ## P75 FALSE FALSE ## P76 FALSE FALSE ## P77 FALSE FALSE ## P1 FALSE FALSE ## P11 FALSE FALSE ## P12 FALSE FALSE ## P13 FALSE FALSE ## P14 FALSE FALSE ## P15 FALSE FALSE ## P16 FALSE FALSE ## P17 FALSE FALSE ## P18 FALSE FALSE ## P19 FALSE FALSE ## P20A FALSE FALSE ## P20B FALSE FALSE ## P20C FALSE FALSE ## P20D FALSE FALSE ## P20E FALSE FALSE ## P20F FALSE FALSE ## P20G FALSE FALSE ## P20H FALSE FALSE ## P20I FALSE FALSE ## P20J FALSE FALSE ## P20K FALSE FALSE ## P20L FALSE FALSE ## P20M FALSE FALSE ## P20N FALSE FALSE ## P21 FALSE FALSE ## P24 FALSE FALSE ## P28 FALSE FALSE ## P5 FALSE FALSE ## P4 FALSE FALSE ## P78 FALSE FALSE ## P81A FALSE FALSE ## P81B FALSE FALSE ## P81C FALSE FALSE ## P81D FALSE FALSE ## P82 FALSE FALSE ## P83 FALSE FALSE ## ZONA FALSE FALSE ## exposure FALSE FALSE ## age_category1 FALSE FALSE ## age_category2 FALSE FALSE ## age_category3 FALSE FALSE ## age_category4 FALSE FALSE ## educat FALSE FALSE ## chronic FALSE FALSE ## P77_1 FALSE FALSE ## education FALSE FALSE ## P24_1 FALSE FALSE ## P11_1 FALSE FALSE ## 1 subsets of each size up to 9 ## Selection Algorithm: exhaustive ## X ID morador FILTRO P72 P73 P74 P75 P76 P77 P1 P11 P12 P13 P14 P15 ## 1 ( 1 ) " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " ## 2 ( 1 ) " " " " " " " " " " " " " " " " " " "*" " " " " " " " " " " " " ## 3 ( 1 ) " " " " " " " " "*" " " " " " " " " " " " " " " " " " " " " " " ## 4 ( 1 ) " " " " " " "*" "*" " " " " " " " " " " " " " " " " " " " " " " ## 5 ( 1 ) " " " " " " "*" "*" " " " " " " " " " " " " " " " " " " " " " " ## 6 ( 1 ) " " " " " " "*" "*" " " " " " " " " " " " " " " " " " " " " " " ## 7 ( 1 ) " " " " " " "*" "*" " " " " " " " " "*" " " " " " " " " " " " " ## 8 ( 1 ) " " " " " " "*" "*" " " " " " " " " " " " " " " " " " " " " " " ## 9 ( 1 ) " " " " " " "*" "*" " " " " " " " " " " " " " " " " " " " " " " ## P16 P17 P18 P19 P20A P20B P20C P20D P20E P20F P20G P20H P20I P20J P20K ## 1 ( 1 ) " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " ## 2 ( 1 ) " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " ## 3 ( 1 ) " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " ## 4 ( 1 ) " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " ## 5 ( 1 ) " " " " " " "*" " " " " " " " " " " " " " " " " " " " " " " ## 6 ( 1 ) "*" " " " " "*" " " " " " " " " " " " " " " " " " " " " " " ## 7 ( 1 ) "*" " " " " "*" " " " " " " " " " " " " " " " " " " " " " " ## 8 ( 1 ) "*" " " " " "*" " " " " " " "*" " " " " " " " " " " " " " " ## 9 ( 1 ) "*" " " " " "*" " " " " " " "*" " " " " " " " " " " " " " " ## P20L P20M P20N P21 P24 P28 P5 P4 P78 P81A P81B P81C P81D P82 P83 ## 1 ( 1 ) " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " ## 2 ( 1 ) " " " " " " " " " " "*" " " " " " " " " " " " " " " " " " " ## 3 ( 1 ) " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " ## 4 ( 1 ) " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " ## 5 ( 1 ) " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " ## 6 ( 1 ) " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " ## 7 ( 1 ) "*" " " " " " " " " " " " " " " " " " " " " " " " " " " " " ## 8 ( 1 ) "*" " " " " " " " " " " " " " " " " " " " " " " " " " " " " ## 9 ( 1 ) "*" " " " " " " " " " " " " " " " " " " " " " " " " " " " " ## ZONA P77_1 exposure age_category1 age_category2 age_category3 ## 1 ( 1 ) " " "*" " " " " " " " " ## 2 ( 1 ) " " " " " " " " " " " " ## 3 ( 1 ) " " "*" " " " " " " " " ## 4 ( 1 ) " " "*" " " " " " " " " ## 5 ( 1 ) " " "*" " " " " " " " " ## 6 ( 1 ) " " "*" " " " " " " " " ## 7 ( 1 ) " " " " " " " " " " " " ## 8 ( 1 ) " " "*" " " " " " " " " ## 9 ( 1 ) " " "*" " " "*" " " " " ## age_category4 education educat chronic P24_1 P11_1 ## 1 ( 1 ) " " " " " " " " " " " " ## 2 ( 1 ) " " " " " " " " " " " " ## 3 ( 1 ) " " "*" " " " " " " " " ## 4 ( 1 ) " " "*" " " " " " " " " ## 5 ( 1 ) " " "*" " " " " " " " " ## 6 ( 1 ) " " "*" " " " " " " " " ## 7 ( 1 ) " " "*" " " " " " " " " ## 8 ( 1 ) " " "*" " " " " " " " " ## 9 ( 1 ) " " "*" " " " " " " " " ``` --- class:middle ### Illustration with the "Amazonas_HQoL" data ``` r par(mfrow = c(1,2)) plot(reg_summary.1$adjr2, xlab = "Number of Variables", ylab = "Adjusted RSq", type = "l") adj_r2_max.1 = which.max(reg_summary.1$adjr2) points(adj_r2_max.1, reg_summary.1$adjr2[adj_r2_max.1], col ="red", cex = 2, pch = 20) plot(reg_summary.1$bic, xlab = "Number of Variables", ylab = "BIC", type = "l") bic_min.1 = which.min(reg_summary.1$bic) points(bic_min.1, reg_summary.1$bic[bic_min.1], col = "red", cex = 2, pch = 20) ``` <img src="L22_EPIB704_Model_Building_25_files/figure-html/unnamed-chunk-12-1.png" width="60%" height="60%" style="display: block; margin: auto;" /> --- class:middle ### Illustration with the "Amazonas_HQoL" data ``` r plot(regfit_full.1, scale = "adjr2", main= "'adjr2' for 'Amazonas_HQoL' data") ``` <img src="L22_EPIB704_Model_Building_25_files/figure-html/unnamed-chunk-13-1.png" width="110%" height="50%" style="display: block; margin: auto;" /> --- class:middle ### Illustration with the "Amazonas_HQoL" data ``` r plot(regfit_full.1, scale = "bic", main= "'BIC' for 'Amazonas_HQoL' data") ``` <img src="L22_EPIB704_Model_Building_25_files/figure-html/unnamed-chunk-14-1.png" width="110%" height="50%" style="display: block; margin: auto;" /> --- class: middle ## What did we get from our "dream model"? |What dataset you used? | Write the equation of the model | |:---------------:|:----------------------------------------------------------| |Amazonas_HQoL| VAS ~ any_malaria + SES_Index_scaled + Age + Sex +ZONA | |Amazonas_HQoL| "logit(P(Yi=1))=β0+β1Xi, when adding in the weights | |Amazonas_HQoL | Prob(HRQoL_VeryHigh=1∣Malaria)]=β0+β1(Recent Malaria)| |Amazonas_HQoL |"Model <- glm(binQoL ~ Status2, data = Data1, family = binomial(), weights = w.out1$weights)" | |Amazonas_HQoL | "Model fitted: weighted logistic regression: log{(P(Y=0∣Matched Set)/P(Y=1∣Matched Set)} = B0 + Bexposure.Xexposure+ Bsex.Xsex +Bage.Xage + BSES.XSES + Bresidence.Xresidence | --- class:middle ### Illustration with the dataset used in 704 (2) ``` r glimpse(Covid_Bangladesh) ``` ``` ## Rows: 2,502 ## Columns: 27 ## $ ex <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 1… ## $ hhid <int> 2, 2, 3, 3, 6, 6, 8, 8, 9, 9, 10, 10, 11, 12, 12, 13, 13, 14… ## $ memid <int> 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 1, 2, 1, 2, 1, 2, 1, … ## $ gih_4 <chr> "Dhaka", "Dhaka", "Dhaka", "Dhaka", "Dhaka", "Dhaka", "Dhaka… ## $ village <chr> "Duaripara", "Duaripara", "Duaripara", "Duaripara", "Duaripa… ## $ vdoses <int> 2, 2, 2, 2, 1, 2, 1, 3, 3, 2, 2, 2, 2, 2, 2, 0, 2, 2, 2, 2, … ## $ dose2 <int> 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, … ## $ dose3 <int> 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, … ## $ age_cat <chr> "18-40 years old", "18-40 years old", "18-40 years old", "18… ## $ gender <chr> "Male", "Female", "Female", "Male", "Female", "Male", "Male"… ## $ marital <chr> "Married", "Married", "Married", "Separated/Divorced/Widow(e… ## $ educ <chr> "No education", "Primary or less", "Primary or less", "Above… ## $ occup2 <chr> "Day labor", "Service", "Day labor", "Business or self-emplo… ## $ relation2 <chr> "HH head", "Spouse", "Spouse", "Others", "Spouse", "Others",… ## $ sesh_15_5 <int> 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 0, … ## $ sesh_15_7 <int> 0, 0, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, … ## $ sesi_51 <chr> "No", "No", "Yes", "No", "No", "No", "No", "No", "No", "No",… ## $ chroill_1 <chr> "No", "No", "No", "No", "No", "No", "No", "No", "No", "No", … ## $ acill_26 <chr> "No", "No", "No", "No", "No", "Yes", "No", "No", "No", "No",… ## $ mi_1 <chr> "No", "No", "No", "No", "No", "No", "No", "No", "No", "No", … ## $ gih_15 <int> 5, 5, 4, 4, 6, 6, 4, 4, 6, 6, 4, 4, 2, 3, 3, 11, 11, 4, 4, 2… ## $ income <dbl> 10000, 10000, 20000, 20000, 30000, 30000, 16000, 16000, 4000… ## $ wbscore <int> 12, 20, 76, 68, 80, 36, 52, 64, 32, 36, 40, 40, 20, 60, 80, … ## $ distance <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, … ## $ wbbin <fct> 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, … ## $ exposure <int> 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, … ## $ income1 <dbl> 10000, 10000, 20000, 20000, 30000, 30000, 16000, 16000, 4000… ``` --- class:middle **Regression subset selection: Illustration with the "Covid_Bangladesh" data** ``` r library(leaps) #Regression subset selection package Covid_Bangladesh1 <- Covid_Bangladesh %>% select(wbbin, exposure, age_cat, gender, marital, educ,occup2, vdoses, dose3, chroill_1, income1, wbscore, relation2, sesh_15_7, sesh_15_5, acill_26) *regfit_full.2 <- regsubsets(wbbin ~ ., data = Covid_Bangladesh1, method = "exhaustive") ``` -- ``` r reg_summary.2 <- summary(regfit_full.2) summary(regfit_full.2) ``` ``` ## Subset selection object ## Call: regsubsets.formula(wbbin ~ ., data = Covid_Bangladesh1, method = "exhaustive") ## 20 Variables (and intercept) ## Forced in Forced out ## exposure FALSE FALSE ## age_catAbove 40 FALSE FALSE ## genderMale FALSE FALSE ## maritalSeparated/Divorced/Widow(er)/Unmarried FALSE FALSE ## educNo education FALSE FALSE ## educPrimary or less FALSE FALSE ## occup2Business or self-employed FALSE FALSE ## occup2Day labor FALSE FALSE ## occup2Others FALSE FALSE ## occup2Service FALSE FALSE ## vdoses FALSE FALSE ## dose3 FALSE FALSE ## chroill_1Yes FALSE FALSE ## income1 FALSE FALSE ## wbscore FALSE FALSE ## relation2Others FALSE FALSE ## relation2Spouse FALSE FALSE ## sesh_15_7 FALSE FALSE ## sesh_15_5 FALSE FALSE ## acill_26Yes FALSE FALSE ## 1 subsets of each size up to 8 ## Selection Algorithm: exhaustive ## exposure age_catAbove 40 genderMale ## 1 ( 1 ) " " " " " " ## 2 ( 1 ) " " " " " " ## 3 ( 1 ) " " " " " " ## 4 ( 1 ) " " " " " " ## 5 ( 1 ) " " " " " " ## 6 ( 1 ) " " " " " " ## 7 ( 1 ) " " " " " " ## 8 ( 1 ) " " " " " " ## maritalSeparated/Divorced/Widow(er)/Unmarried educNo education ## 1 ( 1 ) " " " " ## 2 ( 1 ) " " "*" ## 3 ( 1 ) " " "*" ## 4 ( 1 ) " " "*" ## 5 ( 1 ) " " "*" ## 6 ( 1 ) " " "*" ## 7 ( 1 ) " " "*" ## 8 ( 1 ) " " "*" ## educPrimary or less occup2Business or self-employed occup2Day labor ## 1 ( 1 ) " " " " " " ## 2 ( 1 ) " " " " " " ## 3 ( 1 ) " " " " "*" ## 4 ( 1 ) " " " " " " ## 5 ( 1 ) " " " " " " ## 6 ( 1 ) " " " " " " ## 7 ( 1 ) " " " " " " ## 8 ( 1 ) " " " " " " ## occup2Others occup2Service vdoses dose3 chroill_1Yes income1 wbscore ## 1 ( 1 ) " " " " " " " " " " " " "*" ## 2 ( 1 ) " " " " " " " " " " " " "*" ## 3 ( 1 ) " " " " " " " " " " " " "*" ## 4 ( 1 ) "*" "*" " " " " " " " " "*" ## 5 ( 1 ) "*" "*" " " " " " " "*" "*" ## 6 ( 1 ) "*" "*" " " " " " " "*" "*" ## 7 ( 1 ) "*" "*" " " " " "*" "*" "*" ## 8 ( 1 ) "*" "*" " " " " "*" "*" "*" ## relation2Others relation2Spouse sesh_15_7 sesh_15_5 acill_26Yes ## 1 ( 1 ) " " " " " " " " " " ## 2 ( 1 ) " " " " " " " " " " ## 3 ( 1 ) " " " " " " " " " " ## 4 ( 1 ) " " " " " " " " " " ## 5 ( 1 ) " " " " " " " " " " ## 6 ( 1 ) " " " " " " "*" " " ## 7 ( 1 ) " " " " " " "*" " " ## 8 ( 1 ) " " "*" " " "*" " " ``` --- class:middle ### Illustration with the "Covid_Bangladesh" data ``` r par(mfrow = c(1,2)) plot(reg_summary.2$adjr2, xlab = "Number of Variables", ylab = "Adjusted RSq", type = "l") adj_r2_max.2 = which.max(reg_summary.2$adjr2) points(adj_r2_max.2, reg_summary.2$adjr2[adj_r2_max.2], col ="red", cex = 2, pch = 20) plot(reg_summary.2$bic, xlab = "Number of Variables", ylab = "BIC", type = "l") bic_min.2 = which.min(reg_summary.2$bic) points(bic_min.2, reg_summary.2$bic[bic_min.2], col = "red", cex = 2, pch = 20) ``` <img src="L22_EPIB704_Model_Building_25_files/figure-html/unnamed-chunk-19-1.png" width="60%" height="60%" style="display: block; margin: auto;" /> --- class:middle ### Illustration with the "Covid_Bangladesh" data ``` r plot(regfit_full.2, scale = "adjr2", main= "'adjr2' for 'Covid_Bangladesh' data") ``` <img src="L22_EPIB704_Model_Building_25_files/figure-html/unnamed-chunk-20-1.png" width="110%" height="50%" style="display: block; margin: auto;" /> --- class:middle ### Illustration with the "Covid_Bangladesh" data ``` r plot(regfit_full.2, scale = "bic", main= "'BIC' for 'Covid_Bangladesh' data") ``` <img src="L22_EPIB704_Model_Building_25_files/figure-html/unnamed-chunk-21-1.png" width="110%" height="50%" style="display: block; margin: auto;" /> --- class: middle ## What did we get from our "dream model"? | What dataset you used? | Write the equation of the model | |:-----------------:|:-------------------------------------------------------------| |Covid_Bangladesh| Mental score=B0+B1xCovid-19vaccine+B2xGender+B3xage + B4xarea + B5xEducation+ B6xChronic + B7xrelation_household+ B8xOccupation + B9xHousehold_size + B10xMigration + B11xAccess_TV + B12x Acces_smartphone.| |Covid_Bangladesh| log(pi) = beta0 + betaXi| |Covid_Bangladesh| "E (Yi∣Xi) = B0 + B1 x Exp+ B2 x Sex + B3 x Age+ B4 x Chronic Illness + B5 x Occupation + B6 x Income"| --- class:middle ### Illustration with the dataset used in 704 (3) ``` r glimpse(VL_Nigeria) ``` ``` ## Rows: 127,198 ## Columns: 30 ## $ S_No_ <int> 1, 2, 4, 5, 8, 9, 11, 12, 13, 15, 16,… ## $ State <chr> "Adamawa", "Adamawa", "Adamawa", "Ada… ## $ SEX <chr> "Female", "Male", "Male", "Male", "Fe… ## $ Age.group <chr> "20-29", "40+", "40+", "40+", "40+", … ## $ ART_Start <int> 2021, 2020, 2015, 2020, 2016, 2018, 2… ## $ Days_of_ARV_Refill <int> 180, 180, 180, 180, 180, 180, 180, 18… ## $ Current_Regimen_Line <chr> "Adult.1st.Line", "Adult.1st.Line", "… ## $ Current_ART_Regimen <chr> "TDF-3TC-DTG", "TDF-3TC-DTG", "TDF-3T… ## $ Functional_Status_at_last_Visit <chr> "Working", "Working", "Working", "Wor… ## $ Clinical_Staging_at_Last_Visit <chr> "Stage I", "Stage I", "Stage I", "Sta… ## $ VL.Result <dbl> NA, 38.1, 81.7, 419.0, 0.0, NA, 1617.… ## $ Date_of_current_viral_load__yyyy <int> NA, 2021, 2020, 2022, 2019, NA, 2021,… ## $ Viral_Load_Indication <chr> "", "Routine Monitoring", "Routine Mo… ## $ Current_ART_Status <chr> "Active", "Active", "Active", "Active… ## $ ART_enrollment_setting <chr> "Community", "Facility", "Facility", … ## $ Enhanced_Adherence_Counselling <chr> "No", "No", "No", "No", "No", "No", "… ## $ Date_of_commencement_of_EAC_yyyy <chr> "", "", "", "", "", "", "2021", "", "… ## $ Number_of_EAC_Sessions_Completed <int> NA, NA, NA, NA, NA, NA, 3, NA, 3, 3, … ## $ X <int> NA, NA, NA, NA, NA, NA, 2021, NA, 202… ## $ Repeat_Viral_Load___Post_EAC_VL <chr> "", "", "", "", "", "", "Yes", "", "Y… ## $ Date_of_Repeat_Viral_Load___Post <int> NA, NA, NA, NA, NA, NA, 2022, NA, 202… ## $ Repeated.VL <dbl> NA, NA, NA, NA, NA, NA, NA, NA, 26.1,… ## $ X.1 <int> NA, NA, NA, NA, NA, NA, NA, NA, 2022,… ## $ TB_status_at_Last_Visit <chr> "No sign or symptoms of TB", "No sign… ## $ TB_Screening_Outcome <chr> "No sign or symptoms of TB", "No sign… ## $ exposure <dbl> 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0… ## $ binSupress <dbl> NA, 1, 1, 1, 1, NA, 0, 1, 1, 1, 1, 1,… ## $ clinstage <chr> "Stage I", "Stage I", "Stage I", "Sta… ## $ funcstat <fct> Working, Working, Working, Working, W… ## $ log.outcome <dbl> NA, 3.640214, 4.403054, 6.037871, -In… ``` --- class:middle **Regression subset selection: Illustration with the "VL_Nigeria" data** ``` r library(leaps) #Regression subset selection package VL_Nigeria1 <- VL_Nigeria %>% select(binSupress, exposure, Age.group, SEX, clinstage, funcstat, TB_Screening_Outcome, ART_Start, Enhanced_Adherence_Counselling, ART_enrollment_setting, Viral_Load_Indication, Current_ART_Regimen) #Number_of_EAC_Sessions_Completed, Repeated.VL, *regfit_full.3 <- regsubsets(binSupress ~ ., data = VL_Nigeria1, method = "exhaustive") ``` ``` ## Reordering variables and trying again: ``` -- ``` r reg_summary.3 <- summary(regfit_full.3) summary(regfit_full.3) ``` ``` ## Subset selection object ## Call: regsubsets.formula(binSupress ~ ., data = VL_Nigeria1, method = "exhaustive") ## 44 Variables (and intercept) ## Forced in ## exposure FALSE ## Age.group10-19 FALSE ## Age.group19-20 FALSE ## Age.group20-29 FALSE ## Age.group30-39 FALSE ## Age.group40+ FALSE ## SEXMale FALSE ## clinstageStage II FALSE ## clinstageStage III FALSE ## clinstageStage IV FALSE ## funcstatBedridden FALSE ## funcstatAmbulatory FALSE ## TB_Screening_OutcomePresumptive TB FALSE ## ART_Start FALSE ## ART_enrollment_settingClinical Platforms (Community Pharmacy) FALSE ## ART_enrollment_settingClinical Platforms (Laboratories) FALSE ## ART_enrollment_settingClinical Platforms (PHCs/Private Clinics/Nursing Homes) FALSE ## ART_enrollment_settingClinical Platforms (TBAs) FALSE ## ART_enrollment_settingCommunity FALSE ## ART_enrollment_settingCommunity Based Organisation FALSE ## ART_enrollment_settingFacility FALSE ## ART_enrollment_settingOVC FALSE ## ART_enrollment_settingSelf Testing FALSE ## Viral_Load_IndicationTargeted Monitoring FALSE ## Current_ART_RegimenABC-3TC-DDI FALSE ## Current_ART_RegimenABC-3TC-DTG FALSE ## Current_ART_RegimenABC-3TC-EFV FALSE ## Current_ART_RegimenABC-3TC-LPV/r FALSE ## Current_ART_RegimenABC-3TC-NVP FALSE ## Current_ART_RegimenABC-3TC-pDTG FALSE ## Current_ART_RegimenAZT-3TC-ATV/r FALSE ## Current_ART_RegimenAZT-3TC-LPV/r FALSE ## Current_ART_RegimenAZT-3TC-NVP FALSE ## Current_ART_RegimenAZT-3TC-TDF FALSE ## Current_ART_RegimenAZT-TDF-3TC-LPV/r FALSE ## Current_ART_RegimenOther FALSE ## Current_ART_RegimenTDF-3TC-ATV/r FALSE ## Current_ART_RegimenTDF-3TC-DTG FALSE ## Current_ART_RegimenTDF-3TC-EFV FALSE ## Current_ART_RegimenTDF-3TC-LPV/r FALSE ## Current_ART_RegimenTDF-3TC-NVP FALSE ## Current_ART_RegimenTDF-FTC-ATV/r FALSE ## Current_ART_RegimenTDF-FTC-LPV/r FALSE ## Enhanced_Adherence_CounsellingYes FALSE ## Forced out ## exposure FALSE ## Age.group10-19 FALSE ## Age.group19-20 FALSE ## Age.group20-29 FALSE ## Age.group30-39 FALSE ## Age.group40+ FALSE ## SEXMale FALSE ## clinstageStage II FALSE ## clinstageStage III FALSE ## clinstageStage IV FALSE ## funcstatBedridden FALSE ## funcstatAmbulatory FALSE ## TB_Screening_OutcomePresumptive TB FALSE ## ART_Start FALSE ## ART_enrollment_settingClinical Platforms (Community Pharmacy) FALSE ## ART_enrollment_settingClinical Platforms (Laboratories) FALSE ## ART_enrollment_settingClinical Platforms (PHCs/Private Clinics/Nursing Homes) FALSE ## ART_enrollment_settingClinical Platforms (TBAs) FALSE ## ART_enrollment_settingCommunity FALSE ## ART_enrollment_settingCommunity Based Organisation FALSE ## ART_enrollment_settingFacility FALSE ## ART_enrollment_settingOVC FALSE ## ART_enrollment_settingSelf Testing FALSE ## Viral_Load_IndicationTargeted Monitoring FALSE ## Current_ART_RegimenABC-3TC-DDI FALSE ## Current_ART_RegimenABC-3TC-DTG FALSE ## Current_ART_RegimenABC-3TC-EFV FALSE ## Current_ART_RegimenABC-3TC-LPV/r FALSE ## Current_ART_RegimenABC-3TC-NVP FALSE ## Current_ART_RegimenABC-3TC-pDTG FALSE ## Current_ART_RegimenAZT-3TC-ATV/r FALSE ## Current_ART_RegimenAZT-3TC-LPV/r FALSE ## Current_ART_RegimenAZT-3TC-NVP FALSE ## Current_ART_RegimenAZT-3TC-TDF FALSE ## Current_ART_RegimenAZT-TDF-3TC-LPV/r FALSE ## Current_ART_RegimenOther FALSE ## Current_ART_RegimenTDF-3TC-ATV/r FALSE ## Current_ART_RegimenTDF-3TC-DTG FALSE ## Current_ART_RegimenTDF-3TC-EFV FALSE ## Current_ART_RegimenTDF-3TC-LPV/r FALSE ## Current_ART_RegimenTDF-3TC-NVP FALSE ## Current_ART_RegimenTDF-FTC-ATV/r FALSE ## Current_ART_RegimenTDF-FTC-LPV/r FALSE ## Enhanced_Adherence_CounsellingYes FALSE ## 1 subsets of each size up to 9 ## Selection Algorithm: exhaustive ## exposure Age.group10-19 Age.group19-20 Age.group20-29 Age.group30-39 ## 1 ( 1 ) "*" " " " " " " " " ## 2 ( 1 ) " " " " " " " " " " ## 3 ( 1 ) "*" " " " " " " " " ## 4 ( 1 ) "*" " " " " " " " " ## 5 ( 1 ) "*" " " " " " " " " ## 6 ( 1 ) "*" " " " " " " " " ## 7 ( 1 ) "*" " " " " " " " " ## 8 ( 1 ) " " " " " " " " " " ## 9 ( 1 ) " " " " " " " " " " ## Age.group40+ SEXMale clinstageStage II clinstageStage III ## 1 ( 1 ) " " " " " " " " ## 2 ( 1 ) " " " " " " " " ## 3 ( 1 ) " " " " " " " " ## 4 ( 1 ) " " " " " " " " ## 5 ( 1 ) " " " " " " " " ## 6 ( 1 ) "*" " " " " " " ## 7 ( 1 ) "*" " " "*" " " ## 8 ( 1 ) "*" " " "*" " " ## 9 ( 1 ) "*" " " "*" " " ## clinstageStage IV funcstatBedridden funcstatAmbulatory ## 1 ( 1 ) " " " " " " ## 2 ( 1 ) " " " " " " ## 3 ( 1 ) " " " " " " ## 4 ( 1 ) " " " " " " ## 5 ( 1 ) " " " " " " ## 6 ( 1 ) " " " " " " ## 7 ( 1 ) " " " " " " ## 8 ( 1 ) " " " " " " ## 9 ( 1 ) " " " " " " ## TB_Screening_OutcomePresumptive TB ART_Start ## 1 ( 1 ) " " " " ## 2 ( 1 ) " " " " ## 3 ( 1 ) " " " " ## 4 ( 1 ) " " " " ## 5 ( 1 ) "*" " " ## 6 ( 1 ) "*" " " ## 7 ( 1 ) "*" " " ## 8 ( 1 ) "*" " " ## 9 ( 1 ) "*" " " ## Enhanced_Adherence_CounsellingYes ## 1 ( 1 ) " " ## 2 ( 1 ) "*" ## 3 ( 1 ) " " ## 4 ( 1 ) " " ## 5 ( 1 ) " " ## 6 ( 1 ) " " ## 7 ( 1 ) " " ## 8 ( 1 ) "*" ## 9 ( 1 ) "*" ## ART_enrollment_settingClinical Platforms (Community Pharmacy) ## 1 ( 1 ) " " ## 2 ( 1 ) " " ## 3 ( 1 ) " " ## 4 ( 1 ) " " ## 5 ( 1 ) " " ## 6 ( 1 ) " " ## 7 ( 1 ) " " ## 8 ( 1 ) " " ## 9 ( 1 ) " " ## ART_enrollment_settingClinical Platforms (Laboratories) ## 1 ( 1 ) " " ## 2 ( 1 ) " " ## 3 ( 1 ) " " ## 4 ( 1 ) " " ## 5 ( 1 ) " " ## 6 ( 1 ) " " ## 7 ( 1 ) " " ## 8 ( 1 ) " " ## 9 ( 1 ) " " ## ART_enrollment_settingClinical Platforms (PHCs/Private Clinics/Nursing Homes) ## 1 ( 1 ) " " ## 2 ( 1 ) " " ## 3 ( 1 ) " " ## 4 ( 1 ) " " ## 5 ( 1 ) " " ## 6 ( 1 ) " " ## 7 ( 1 ) " " ## 8 ( 1 ) " " ## 9 ( 1 ) " " ## ART_enrollment_settingClinical Platforms (TBAs) ## 1 ( 1 ) " " ## 2 ( 1 ) " " ## 3 ( 1 ) " " ## 4 ( 1 ) " " ## 5 ( 1 ) " " ## 6 ( 1 ) " " ## 7 ( 1 ) " " ## 8 ( 1 ) " " ## 9 ( 1 ) " " ## ART_enrollment_settingCommunity ## 1 ( 1 ) " " ## 2 ( 1 ) " " ## 3 ( 1 ) " " ## 4 ( 1 ) " " ## 5 ( 1 ) " " ## 6 ( 1 ) " " ## 7 ( 1 ) " " ## 8 ( 1 ) "*" ## 9 ( 1 ) "*" ## ART_enrollment_settingCommunity Based Organisation ## 1 ( 1 ) " " ## 2 ( 1 ) " " ## 3 ( 1 ) " " ## 4 ( 1 ) " " ## 5 ( 1 ) " " ## 6 ( 1 ) " " ## 7 ( 1 ) " " ## 8 ( 1 ) " " ## 9 ( 1 ) " " ## ART_enrollment_settingFacility ART_enrollment_settingOVC ## 1 ( 1 ) " " " " ## 2 ( 1 ) " " " " ## 3 ( 1 ) "*" " " ## 4 ( 1 ) "*" " " ## 5 ( 1 ) "*" " " ## 6 ( 1 ) "*" " " ## 7 ( 1 ) "*" " " ## 8 ( 1 ) "*" " " ## 9 ( 1 ) "*" " " ## ART_enrollment_settingSelf Testing ## 1 ( 1 ) " " ## 2 ( 1 ) " " ## 3 ( 1 ) " " ## 4 ( 1 ) " " ## 5 ( 1 ) " " ## 6 ( 1 ) " " ## 7 ( 1 ) " " ## 8 ( 1 ) " " ## 9 ( 1 ) " " ## Viral_Load_IndicationTargeted Monitoring ## 1 ( 1 ) " " ## 2 ( 1 ) "*" ## 3 ( 1 ) "*" ## 4 ( 1 ) "*" ## 5 ( 1 ) "*" ## 6 ( 1 ) "*" ## 7 ( 1 ) "*" ## 8 ( 1 ) "*" ## 9 ( 1 ) "*" ## Current_ART_RegimenABC-3TC-DDI Current_ART_RegimenABC-3TC-DTG ## 1 ( 1 ) " " " " ## 2 ( 1 ) " " " " ## 3 ( 1 ) " " " " ## 4 ( 1 ) " " " " ## 5 ( 1 ) " " " " ## 6 ( 1 ) " " " " ## 7 ( 1 ) " " " " ## 8 ( 1 ) " " " " ## 9 ( 1 ) " " " " ## Current_ART_RegimenABC-3TC-EFV Current_ART_RegimenABC-3TC-LPV/r ## 1 ( 1 ) " " " " ## 2 ( 1 ) " " " " ## 3 ( 1 ) " " " " ## 4 ( 1 ) " " " " ## 5 ( 1 ) " " " " ## 6 ( 1 ) " " " " ## 7 ( 1 ) " " " " ## 8 ( 1 ) " " " " ## 9 ( 1 ) " " " " ## Current_ART_RegimenABC-3TC-NVP Current_ART_RegimenABC-3TC-pDTG ## 1 ( 1 ) " " " " ## 2 ( 1 ) " " " " ## 3 ( 1 ) " " " " ## 4 ( 1 ) " " " " ## 5 ( 1 ) " " " " ## 6 ( 1 ) " " " " ## 7 ( 1 ) " " " " ## 8 ( 1 ) " " " " ## 9 ( 1 ) " " " " ## Current_ART_RegimenAZT-3TC-ATV/r Current_ART_RegimenAZT-3TC-LPV/r ## 1 ( 1 ) " " " " ## 2 ( 1 ) " " " " ## 3 ( 1 ) " " " " ## 4 ( 1 ) " " " " ## 5 ( 1 ) " " " " ## 6 ( 1 ) " " " " ## 7 ( 1 ) " " " " ## 8 ( 1 ) " " " " ## 9 ( 1 ) " " " " ## Current_ART_RegimenAZT-3TC-NVP Current_ART_RegimenAZT-3TC-TDF ## 1 ( 1 ) " " " " ## 2 ( 1 ) " " " " ## 3 ( 1 ) " " " " ## 4 ( 1 ) " " " " ## 5 ( 1 ) " " " " ## 6 ( 1 ) " " " " ## 7 ( 1 ) " " " " ## 8 ( 1 ) " " " " ## 9 ( 1 ) " " " " ## Current_ART_RegimenAZT-TDF-3TC-LPV/r Current_ART_RegimenOther ## 1 ( 1 ) " " " " ## 2 ( 1 ) " " " " ## 3 ( 1 ) " " " " ## 4 ( 1 ) " " " " ## 5 ( 1 ) " " " " ## 6 ( 1 ) " " " " ## 7 ( 1 ) " " " " ## 8 ( 1 ) " " " " ## 9 ( 1 ) " " " " ## Current_ART_RegimenTDF-3TC-ATV/r Current_ART_RegimenTDF-3TC-DTG ## 1 ( 1 ) " " " " ## 2 ( 1 ) " " " " ## 3 ( 1 ) " " " " ## 4 ( 1 ) " " "*" ## 5 ( 1 ) " " "*" ## 6 ( 1 ) " " "*" ## 7 ( 1 ) " " "*" ## 8 ( 1 ) " " "*" ## 9 ( 1 ) "*" "*" ## Current_ART_RegimenTDF-3TC-EFV Current_ART_RegimenTDF-3TC-LPV/r ## 1 ( 1 ) " " " " ## 2 ( 1 ) " " " " ## 3 ( 1 ) " " " " ## 4 ( 1 ) " " " " ## 5 ( 1 ) " " " " ## 6 ( 1 ) " " " " ## 7 ( 1 ) " " " " ## 8 ( 1 ) " " " " ## 9 ( 1 ) " " " " ## Current_ART_RegimenTDF-3TC-NVP Current_ART_RegimenTDF-FTC-ATV/r ## 1 ( 1 ) " " " " ## 2 ( 1 ) " " " " ## 3 ( 1 ) " " " " ## 4 ( 1 ) " " " " ## 5 ( 1 ) " " " " ## 6 ( 1 ) " " " " ## 7 ( 1 ) " " " " ## 8 ( 1 ) " " " " ## 9 ( 1 ) " " " " ## Current_ART_RegimenTDF-FTC-LPV/r ## 1 ( 1 ) " " ## 2 ( 1 ) " " ## 3 ( 1 ) " " ## 4 ( 1 ) " " ## 5 ( 1 ) " " ## 6 ( 1 ) " " ## 7 ( 1 ) " " ## 8 ( 1 ) " " ## 9 ( 1 ) " " ``` --- class:middle ### Illustration with the "VL_Nigeria" data ``` r par(mfrow = c(1,2)) plot(reg_summary.3$adjr2, xlab = "Number of Variables", ylab = "Adjusted RSq", type = "l") adj_r2_max.3 = which.max(reg_summary.3$adjr2) points(adj_r2_max.3, reg_summary.3$adjr2[adj_r2_max.3], col ="red", cex = 2, pch = 20) plot(reg_summary.3$bic, xlab = "Number of Variables", ylab = "BIC", type = "l") bic_min.3 = which.min(reg_summary.3$bic) points(bic_min.3, reg_summary.3$bic[bic_min.3], col = "red", cex = 2, pch = 20) ``` <img src="L22_EPIB704_Model_Building_25_files/figure-html/unnamed-chunk-26-1.png" width="60%" height="60%" style="display: block; margin: auto;" /> --- class:middle ### Illustration with the "VL_Nigeria" data ``` r plot(regfit_full.3, scale = "adjr2", main= "'adjr2' for 'VL_Nigeria' data") ``` <img src="L22_EPIB704_Model_Building_25_files/figure-html/unnamed-chunk-27-1.png" width="120%" height="100%" style="display: block; margin: auto;" /> --- class:middle ### Illustration with the "VL_Nigeria" data ``` r plot(regfit_full.3, scale = "bic", main= "'BIC' for 'VL_Nigeria' data") ``` <img src="L22_EPIB704_Model_Building_25_files/figure-html/unnamed-chunk-28-1.png" width="120%" height="100%" style="display: block; margin: auto;" /> --- class: middle ## What did we get from our "dream model"? | What dataset you used? | Write the equation of the model | |:-----------------:|:-------------------------------------------------------------| |VL_Nigeria | PS Model (Weight Generation): l𝑜𝑔𝑖𝑡(𝑃 (𝐸𝑥𝑝𝑜𝑠𝑢𝑟𝑒 = 1)) = 𝛽0 +𝛽1(𝐴𝑔𝑒.𝑔𝑟𝑜𝑢𝑝) + 𝛽2(𝑆𝐸𝑋) + 𝛽3(𝐴𝑅𝑇_𝑆𝑡𝑎𝑟𝑡) + 𝛽4(𝐶𝑙𝑖𝑛𝑖𝑐𝑎𝑙_𝑆𝑡𝑎𝑔𝑖𝑛𝑔) + 𝛽5(𝑇𝐵_𝑆𝑐𝑟𝑒𝑒𝑛𝑖𝑛𝑔_𝑂𝑢𝑡𝑐𝑜𝑚𝑒) | |VL_Nigeria | log((P(Outcome=1∣Exposure)/P(Outcome=0∣Exposure))= log(Odds)= logit(P)=β0+β1xExposure; log((P(binSupress=1|enhanced_adherence_counselling/P(binSupress=0|enahanced_adherence_counselling)) =log (Odds) = β0+ β1xEnhanced_adherence_counselling + β2xsex +β3xage_group + β4xtb_screening_outcome + β5xclinical_staging_at_last_visit +β6xart_start | |VL_Nigeria | `\(ViralLoadSupression = \beta_0 + \beta_{EAC}X_{EAC} + \beta_{age_2}X_{age2} + \beta_{age_3}X_{age3} + \beta_{age_4}X_{age4} + \beta_{Male}X_{Male} + \beta_{AdvancedStage}X_{AdvancedStage} + \beta_{Clinical}X_{Clinical} + \beta_{Year}X_{Year} + \beta_{TB}X_{TB} + \epsilon\)` | --- class: middle ### Illustration of selection with the 'VL_Nigeria' dataset Subset selection object `\(\to\)` automatically selects the "best" sets: regsubsets.formula(binSupress ~ ., data = VL_Nigeria, method = "exhaustive") - .blue[44 Variables/ parameter (and intercept) with 1 subsets of each size up to 9 variables] -- - Based on this, we would probably choose to include a model with .red[3-8 variables]: - Likely exposure, age, SAB, clinstage, funcstat, TB_Screening_Outcome, Current_ART_Regimen depending on the choice of metrics... -- ## .red[Do you trust this approach?] .purple[Any concerns??] --- class:middle ### Even the _"smartest"_ software, program or technology will need, .red[until now], a bit of guidance... **Selecting a pool of variables to choose from, may help** ``` r VL_Nigeria2 <- VL_Nigeria %>% select(binSupress, exposure, Age.group, SEX, clinstage, funcstat, TB_Screening_Outcome, ART_Start, Enhanced_Adherence_Counselling, ART_enrollment_setting) #removing Current_ART_Regimen, State ``` -- **Then re-formulating the selection** ``` r regfit_full.4 <- regsubsets(binSupress ~ ., * data = VL_Nigeria2, method = "exhaustive") ``` ``` ## Reordering variables and trying again: ``` -- ``` r reg_summary.4 <- summary(regfit_full.4) ``` --- class: middle ### Illustration of selection with the "VL_Nigeria" data From the selected covariates, there are .blue[9 variables and 24 parameters] to be estimated. - The package proposes .red[9 models with several combinations] of the parameters/covariates. -- ``` ## Subset selection object ## Call: regsubsets.formula(binSupress ~ ., data = VL_Nigeria2, method = "exhaustive") ## 24 Variables (and intercept) ## Forced in ## exposure FALSE ## Age.group10-19 FALSE ## Age.group19-20 FALSE ## Age.group20-29 FALSE ## Age.group30-39 FALSE ## Age.group40+ FALSE ## SEXMale FALSE ## clinstageStage II FALSE ## clinstageStage III FALSE ## clinstageStage IV FALSE ## funcstatBedridden FALSE ## funcstatAmbulatory FALSE ## TB_Screening_OutcomePresumptive TB FALSE ## ART_Start FALSE ## ART_enrollment_settingClinical Platforms (Community Pharmacy) FALSE ## ART_enrollment_settingClinical Platforms (Laboratories) FALSE ## ART_enrollment_settingClinical Platforms (PHCs/Private Clinics/Nursing Homes) FALSE ## ART_enrollment_settingClinical Platforms (TBAs) FALSE ## ART_enrollment_settingCommunity FALSE ## ART_enrollment_settingCommunity Based Organisation FALSE ## ART_enrollment_settingFacility FALSE ## ART_enrollment_settingOVC FALSE ## ART_enrollment_settingSelf Testing FALSE ## Enhanced_Adherence_CounsellingYes FALSE ## Forced out ## exposure FALSE ## Age.group10-19 FALSE ## Age.group19-20 FALSE ## Age.group20-29 FALSE ## Age.group30-39 FALSE ## Age.group40+ FALSE ## SEXMale FALSE ## clinstageStage II FALSE ## clinstageStage III FALSE ## clinstageStage IV FALSE ## funcstatBedridden FALSE ## funcstatAmbulatory FALSE ## TB_Screening_OutcomePresumptive TB FALSE ## ART_Start FALSE ## ART_enrollment_settingClinical Platforms (Community Pharmacy) FALSE ## ART_enrollment_settingClinical Platforms (Laboratories) FALSE ## ART_enrollment_settingClinical Platforms (PHCs/Private Clinics/Nursing Homes) FALSE ## ART_enrollment_settingClinical Platforms (TBAs) FALSE ## ART_enrollment_settingCommunity FALSE ## ART_enrollment_settingCommunity Based Organisation FALSE ## ART_enrollment_settingFacility FALSE ## ART_enrollment_settingOVC FALSE ## ART_enrollment_settingSelf Testing FALSE ## Enhanced_Adherence_CounsellingYes FALSE ## 1 subsets of each size up to 9 ## Selection Algorithm: exhaustive ## exposure Age.group10-19 Age.group19-20 Age.group20-29 Age.group30-39 ## 1 ( 1 ) " " " " " " " " " " ## 2 ( 1 ) "*" " " " " " " " " ## 3 ( 1 ) "*" " " " " " " " " ## 4 ( 1 ) "*" " " " " " " " " ## 5 ( 1 ) "*" " " " " " " " " ## 6 ( 1 ) "*" "*" " " " " " " ## 7 ( 1 ) "*" "*" " " " " " " ## 8 ( 1 ) "*" "*" " " " " " " ## 9 ( 1 ) "*" "*" " " " " " " ## Age.group40+ SEXMale clinstageStage II clinstageStage III ## 1 ( 1 ) " " " " " " " " ## 2 ( 1 ) " " " " " " " " ## 3 ( 1 ) "*" " " " " " " ## 4 ( 1 ) "*" " " " " " " ## 5 ( 1 ) "*" " " " " " " ## 6 ( 1 ) "*" " " " " " " ## 7 ( 1 ) "*" " " " " " " ## 8 ( 1 ) "*" " " " " " " ## 9 ( 1 ) "*" " " "*" " " ## clinstageStage IV funcstatBedridden funcstatAmbulatory ## 1 ( 1 ) " " " " " " ## 2 ( 1 ) " " " " " " ## 3 ( 1 ) " " " " " " ## 4 ( 1 ) " " " " " " ## 5 ( 1 ) " " " " " " ## 6 ( 1 ) " " " " " " ## 7 ( 1 ) " " " " " " ## 8 ( 1 ) "*" " " " " ## 9 ( 1 ) "*" " " " " ## TB_Screening_OutcomePresumptive TB ART_Start ## 1 ( 1 ) " " " " ## 2 ( 1 ) " " " " ## 3 ( 1 ) " " " " ## 4 ( 1 ) " " "*" ## 5 ( 1 ) " " "*" ## 6 ( 1 ) " " "*" ## 7 ( 1 ) "*" "*" ## 8 ( 1 ) "*" "*" ## 9 ( 1 ) "*" "*" ## Enhanced_Adherence_CounsellingYes ## 1 ( 1 ) "*" ## 2 ( 1 ) " " ## 3 ( 1 ) " " ## 4 ( 1 ) " " ## 5 ( 1 ) " " ## 6 ( 1 ) " " ## 7 ( 1 ) " " ## 8 ( 1 ) " " ## 9 ( 1 ) " " ## ART_enrollment_settingClinical Platforms (Community Pharmacy) ## 1 ( 1 ) " " ## 2 ( 1 ) " " ## 3 ( 1 ) " " ## 4 ( 1 ) " " ## 5 ( 1 ) " " ## 6 ( 1 ) " " ## 7 ( 1 ) " " ## 8 ( 1 ) " " ## 9 ( 1 ) " " ## ART_enrollment_settingClinical Platforms (Laboratories) ## 1 ( 1 ) " " ## 2 ( 1 ) " " ## 3 ( 1 ) " " ## 4 ( 1 ) " " ## 5 ( 1 ) " " ## 6 ( 1 ) " " ## 7 ( 1 ) " " ## 8 ( 1 ) " " ## 9 ( 1 ) " " ## ART_enrollment_settingClinical Platforms (PHCs/Private Clinics/Nursing Homes) ## 1 ( 1 ) " " ## 2 ( 1 ) " " ## 3 ( 1 ) " " ## 4 ( 1 ) " " ## 5 ( 1 ) " " ## 6 ( 1 ) " " ## 7 ( 1 ) " " ## 8 ( 1 ) " " ## 9 ( 1 ) " " ## ART_enrollment_settingClinical Platforms (TBAs) ## 1 ( 1 ) " " ## 2 ( 1 ) " " ## 3 ( 1 ) " " ## 4 ( 1 ) " " ## 5 ( 1 ) " " ## 6 ( 1 ) " " ## 7 ( 1 ) " " ## 8 ( 1 ) " " ## 9 ( 1 ) " " ## ART_enrollment_settingCommunity ## 1 ( 1 ) " " ## 2 ( 1 ) " " ## 3 ( 1 ) " " ## 4 ( 1 ) " " ## 5 ( 1 ) "*" ## 6 ( 1 ) "*" ## 7 ( 1 ) "*" ## 8 ( 1 ) "*" ## 9 ( 1 ) "*" ## ART_enrollment_settingCommunity Based Organisation ## 1 ( 1 ) " " ## 2 ( 1 ) " " ## 3 ( 1 ) " " ## 4 ( 1 ) " " ## 5 ( 1 ) " " ## 6 ( 1 ) " " ## 7 ( 1 ) " " ## 8 ( 1 ) " " ## 9 ( 1 ) " " ## ART_enrollment_settingFacility ART_enrollment_settingOVC ## 1 ( 1 ) " " " " ## 2 ( 1 ) "*" " " ## 3 ( 1 ) "*" " " ## 4 ( 1 ) "*" " " ## 5 ( 1 ) "*" " " ## 6 ( 1 ) "*" " " ## 7 ( 1 ) "*" " " ## 8 ( 1 ) "*" " " ## 9 ( 1 ) "*" " " ## ART_enrollment_settingSelf Testing ## 1 ( 1 ) " " ## 2 ( 1 ) " " ## 3 ( 1 ) " " ## 4 ( 1 ) " " ## 5 ( 1 ) " " ## 6 ( 1 ) " " ## 7 ( 1 ) " " ## 8 ( 1 ) " " ## 9 ( 1 ) " " ``` --- class: middle **Illustration of selection with the "VL_Nigeria" data** <img src="L22_EPIB704_Model_Building_25_files/figure-html/unnamed-chunk-33-1.png" width="100%" height="60%" style="display: block; margin: auto;" /> --- class:middle <img src="L22_EPIB704_Model_Building_25_files/figure-html/unnamed-chunk-34-1.png" width="120%" height="60%" style="display: block; margin: auto;" /> --- class:middle <img src="L22_EPIB704_Model_Building_25_files/figure-html/unnamed-chunk-35-1.png" width="120%" style="display: block; margin: auto;" /> --- class: middle ### Other Subset Selection ``` r #Backwards selection regsubsets(outcome ~ ., data = VL_Nigeria, method = "backward") #Forwards selection regsubsets(outcome ~ ., data = VL_Nigeria, method = "forward") #Stepwise selection regsubsets(outcome ~ ., data = cVL_Nigeria, method = "seqrep") ``` <br> **.red[Still any decision should be made with caution!]** --- class:middle ## Selection Strategies: .red[Caution!] - Different model fit metrics (e.g., `\(R^2\)`, AIC, Mallow's Cp) can show different models as the "best". <br> -- <br> - Can be a recipe for a type I error (chance significant findings) .red[recall Ioannidis paper?] - Particularly when the number of candidate predictors is large relative to the sample size. <br> -- <br> - Automated selection procedures make it easy for researchers to ignore good practices for causal inference, like .blue[choosing variables based on a DAG]. <br> -- <br> - High .purple[risk of overfitting] to your data set. - Selected variables may strongly discriminate among individuals in your data set, but may have less ability to do so on other data. --- class: middle ##Overall framework for model specification: 1) Variable specification: > .blue[ - What is the universe of variables I would consider? - There is a limited number available, some of the ones we would need, will not be. - We need to think this through and choose. - Leaving variables out is a strong assumption about that not having an effect ( `\(\beta = 0\)` ) ] 2) Interaction assessment (including assessment of heterogeneity) 3) Confounding assessment (including consideration of precision) --- class: middle ## What did we get from our "dream model"? <img src="images/john_travolta.gif" width="65%" style="display: block; margin: auto;" /> -- ## .red[Where are the interactions at????] --- class: middle ## Selection and Specification > _"The approach is guided by several principles, including the adherence to a hierarchically defined initial (full) model, and a backward elimination strategy._ <br> -- <br> **Corollaries** - Collinearity, correlation (other hierarchical structures) - Moderate or correct for your own Degrees of Freedom - Multiple testing and Bonferroni corrections --- class: middle ## Summary - **.red[Prediction and inference are different types of problems]**. - If the goal is causal inference, we must think about confounding .red[(and other biases!)] and model interpretation is important. -- <br> - If the goal is prediction, model interpretation can be less important, and the choice of predictor variables is driven more by the data we have available and model evaluation metrics. -- <br> - Model selection strategies can be used for both, but much more care must be taken if you choose to use them for an inference question --- class:inverse, center, middle ##“All models are wrong, but some are useful” — George Box (1919-2013) --- class: middle ## What did we get from our "dream model"? .pull-left[ <img src="L22_EPIB704_Model_Building_25_files/figure-html/unnamed-chunk-39-1.png" width="125%" style="display: block; margin: auto;" /> ] -- .pull-right[ <img src="L22_EPIB704_Model_Building_25_files/figure-html/unnamed-chunk-40-1.png" width="125%" style="display: block; margin: auto;" /> ] --- class:inverse, center, middle ### Read it again: ## “_All models are wrong, but some are useful_” — George Box (1919-2013) --- class: middle ### QUESTIONS? ## COMMENTS? # RECOMMENDATIONS? --- class: middle <img src="images/L25tweet.png" width="65%" style="display: block; margin: auto;" /> [Endogeneity tweet](https://twitter.com/BearCurd/status/1596590553323298816) --- class: middle ## Resources: - Greenland et al. 2016. "Outcome modelling strategies in epidemiology: traditional methods and basic alternatives." Int. J. Epidemiol. - Steyerberg, Ewout W. 2019. Clinical Prediction Models: A Practical Approach to Development, Validation, and Updating. Springer, Cham. - Harrell, Frank E., Jr. 2015. Regression Modeling Strategies: With Applications to Linear Models, Logistic and Ordinal Regression, and Survival Analysis. Springer Series in Statistics. Cham: Springer International Publishing. --- class: middle ### Causal Inference and Prediction: Different Goals! - **.blue[Causal Inference]**: estimating the effect of a variable on an outcome - Usually adjusted for confounding - Want a model with good confounder selection, determined via a DAG and substantive knowledge - **.purple[Prediction]**: predicting a future outcome using a set of covariates - We want predictions that are close to the actual value, but avoid overfitting - May prioritize measurable predictors over strong predictors - **.red[The two goals do NOT require the same model selection approach!]** --- class:middle ## Prediction?? - **.red[Prediction]** aims to anticipate some future outcome using a set of covariates - Prediction models are typically built using data from an existing group of individuals, with the goal of being able to predict the outcome for future individuals. - Here, we are concerned with getting the best model that gives "optimal" predictions for future subjects. - A good predictive model doesn't necessarily tell you anything useful about how to intervene to change the outcome! - Age, sex, and prior hospitalization may be good predictors of heart failure, but we wouldn't stop admitting people to the hospital to prevent heart failure --- class:middle ## Miscelaneous --- class: middle ### Goodness-of-Fit Statistics and model metrics - **R-squared** `\(R^2\)` : % of variation in `Y` explained by `"predictors"` variables. The higher `\(R^2\)`, the better the model. - **Root Mean Squared Error (RMSE)**: the average error performed by the model in predicting the outcome for an observation. RMSE = sqrt(MSE); The `\(\uparrow\)` the RMSE, the better the model. - **Residual Standard Error (RSE):** the model sigma, a variant of RMSE adjusted # predictors in the model. The `\(\downarrow\)` RSE, the better the model. - **Mean Absolute Error (MAE)**, measures the prediction error. MAE = mean(abs(observed- predicted)). Less sensitive to outliers compared to RMSE. --- class: middle ### Goodness-of-Fit Statistics and model metrics - **AIC:** (Akaike’s Information Criteria) penalizes the inclusion of additional variables to a model. - **AICc:** is a version of AIC corrected for small sample sizes. - **BIC:** (or Bayesian information criteria) is a variant of AIC with a stronger penalty. - **Mallows Cp:** A variant of AIC developed by Colin Mallows. - **WAIC:** Widely (Watanabe) Application Criterion (Bayesian) - **PSIS:** Pareto-Smoothed Importance Sampling [(Bayesian)](https://mouse-imaging-centre.github.io/blog/post/2018-01-31_loo-intro/) - **DIC:** Deviance Information Criterion (Bayesian) --- class: middle ## Evaluating Prediction Models - The model fit in the **training data** (i.e., the data used to develop a predictive model) is not our primary interest - What we are primarily interested in is the accuracy of our model predictions **when our model is applied to new data** that was not used as part of the model development process (i.e., **test data**) - **.red[A good model fit in the training data doesn't necessarily ensure a good ability to predict using other data]**. --- class: middle ###Paramaterization, Trends, Dose-Response? **<span style="color:darkmagenta"> Recall this?</span>**
<caption class='gt_caption'><span class='gt_from_md'><strong>Table 1. Summary of Study Characteristics</strong></span></caption>
Characteristic
N = 355
1
covid_status
55 (16%)
adm_agemons
Median (Q1, Q3)
30 (15, 63)
muac_average
Median (Q1, Q3)
14.70 (13.40, 15.70)
caregiver_educ1
None
21 (5.9%)
Primary
216 (61%)
Secondary
87 (25%)
Above secondary
30 (8.5%)
nutrition
malnutrition
123 (35%)
nutrition
230 (65%)
1
Median (IQR) or Frequency (%); ‘adm_agemons’ is age in months; ‘muac_average’ is measured in cms; ‘nutrition’, and ‘muac’ are measures of nutritional status; ‘caregiver_edu’ is education level of caregiver
[Clinical epidemiology of COVID-19 among hospitalized children in rural western Kenya](https://journals.plos.org/globalpublichealth/article?id=10.1371/journal.pgph.0002011#abstract0) --- class: middle ###Paramaterization, Trends, Dose-Response? Let's consider Education of the caregiver and nutritional status measured by Mid Upper Arm Circumference (MUAC) in cm, as the continuous outcome. .pull-left[ <img src="L22_EPIB704_Model_Building_25_files/figure-html/unnamed-chunk-43-1.png" width="80%" style="display: block; margin: auto;" /> ] -- .pull-right[ ``` ## # A tibble: 5 × 2 ## caregiver_educ_n m.muac ## <dbl> <dbl> ## 1 0 15.2 ## 2 1 14.6 ## 3 2 14.7 ## 4 3 15.2 ## 5 NA 14.4 ``` ] [Clinical epidemiology of COVID-19 among hospitalized children in rural western Kenya](https://journals.plos.org/globalpublichealth/article?id=10.1371/journal.pgph.0002011#abstract0) --- class: middle ###Paramaterization, Trends, Dose-Response? How can we assess their relationship? **Using a continuous variable <span style="color:red">assumptions?</span>** ``` r muac1<- glm(muac_average ~ caregiver_educ_n, data = L25data) round(j_summ(muac1, confint = T)$coeftable, 2) ``` ``` ## Est. 2.5% 97.5% t val. p ## (Intercept) 14.67 14.11 15.22 52.15 0.00 ## caregiver_educ_n -0.03 -0.39 0.33 -0.18 0.86 ``` -- **Using a categorical version of the variable <span style="color:red">assumptions?</span>** ``` r muac2<- glm(muac_average ~ factor(caregiver_educ_n), data = L25data) round(j_summ(muac2, confint = T)$coeftable, 2) ``` ``` ## Est. 2.5% 97.5% t val. p ## (Intercept) 15.23 14.18 16.29 28.23 0.00 ## factor(caregiver_educ_n)1 -0.65 -1.76 0.46 -1.15 0.25 ## factor(caregiver_educ_n)2 -0.79 -1.97 0.39 -1.31 0.19 ## factor(caregiver_educ_n)3 -0.24 -1.62 1.14 -0.35 0.73 ``` [Clinical epidemiology of COVID-19 among hospitalized children in rural western Kenya](https://journals.plos.org/globalpublichealth/article?id=10.1371/journal.pgph.0002011#abstract0) --- class: middle ###Paramaterization, Trends, Dose-Response? Let's consider COVID-19 status (disease yes/no) and Education, can we assess their relationship? .pull-left[ <img src="L22_EPIB704_Model_Building_25_files/figure-html/unnamed-chunk-47-1.png" width="80%" style="display: block; margin: auto;" /> ] .pull-right[ <img src="L22_EPIB704_Model_Building_25_files/figure-html/unnamed-chunk-48-1.png" width="80%" style="display: block; margin: auto;" /> ] [Clinical epidemiology of COVID-19 among hospitalized children in rural western Kenya](https://journals.plos.org/globalpublichealth/article?id=10.1371/journal.pgph.0002011#abstract0) --- class: middle ###Paramaterization, Trends, Dose-Response? **Dichotomous Outcome** .pull-left[ ``` r muac3<- glm(covid_status ~ caregiver_educ_n, data = L25data, family= binomial(link = "logit")) round(j_summ(muac3, confint = T)$coeftable, 2) ``` ``` ## Est. 2.5% 97.5% z val. p ## (Intercept) -1.44 -2.05 -0.83 -4.64 0.00 ## caregiver_educ_n -0.18 -0.60 0.24 -0.85 0.39 ``` ``` r muac4<- glm(covid_status ~ factor(caregiver_educ_n), data = L25data, family= binomial(link = "logit")) round(j_summ(muac4, confint = T)$coeftable, 2) ``` ``` ## Est. 2.5% 97.5% z val. p ## (Intercept) -1.16 -2.17 -0.16 -2.27 0.02 ## factor(caregiver_educ_n)1 -0.47 -1.54 0.59 -0.87 0.38 ## factor(caregiver_educ_n)2 -0.85 -2.05 0.35 -1.39 0.16 ## factor(caregiver_educ_n)3 -0.41 -1.80 0.99 -0.57 0.57 ``` ] -- .pull-right[ <img src="L22_EPIB704_Model_Building_25_files/figure-html/unnamed-chunk-51-1.png" width="80%" style="display: block; margin: auto;" /> ] [Clinical epidemiology of COVID-19 among hospitalized children in rural western Kenya](https://journals.plos.org/globalpublichealth/article?id=10.1371/journal.pgph.0002011#abstract0) --- class: middle ###Paramaterization, Trends, Dose-Response? What happens when the referent group has a very small sample size? - OR estimates are all imprecise <span style="color:darkred">(since the imprecision for the referent group is propogated)</span> - One way are this is to use incremental coding of the dummy variables ``` r L25data$edu1<- L25data$edu2<-L25data$edu3<-0 L25data$edu1[L25data$caregiver_educ_n>=1]<- 1 L25data$edu2[L25data$caregiver_educ_n>=2]<- 1 L25data$edu3[L25data$caregiver_educ_n>=3]<- 1 L25data$edu1[is.na(L25data$caregiver_educ_n)==T]<- NA L25data$edu2[is.na(L25data$caregiver_educ_n)==T]<- NA L25data$edu3[is.na(L25data$caregiver_educ_n)==T]<- NA ``` [Clinical epidemiology of COVID-19 among hospitalized children in rural western Kenya](https://journals.plos.org/globalpublichealth/article?id=10.1371/journal.pgph.0002011#abstract0) --- class: middle **Paramaterization, Trends, Dose-Response?** .pull-left[ **Incremental categories*** ``` r *muac5<- glm(covid_status ~ edu1+edu2+edu3, data = L25data, family= binomial(link = "logit")) round(j_summ(muac5, confint = T)$coeftable, 2) ``` ``` ## Est. 2.5% 97.5% z val. p ## (Intercept) -1.16 -2.17 -0.16 -2.27 0.02 ## edu1 -0.47 -1.54 0.59 -0.87 0.38 ## edu2 -0.38 -1.13 0.38 -0.98 0.33 ## edu3 0.45 -0.72 1.61 0.75 0.45 ``` **OR what we usually do...** ``` r muac6<- glm(covid_status ~ caregiver_educ1, data = L25data, family= binomial(link = "logit")) round(j_summ(muac6, confint = T)$coeftable, 2) ``` ``` ## Est. 2.5% 97.5% z val. p ## (Intercept) -1.16 -2.17 -0.16 -2.27 0.02 ## caregiver_educ1Primary -0.47 -1.54 0.59 -0.87 0.38 ## caregiver_educ1Secondary -0.85 -2.05 0.35 -1.39 0.16 ## caregiver_educ1Above secondary -0.41 -1.80 0.99 -0.57 0.57 ``` ] -- .pull-right[ <img src="L22_EPIB704_Model_Building_25_files/figure-html/unnamed-chunk-55-1.png" width="80%" style="display: block; margin: auto;" /> ] .small[ *Incremental means, that the reference is the immediately previous category/group [Clinical epidemiology of COVID-19 among hospitalized children in rural western Kenya](https://journals.plos.org/globalpublichealth/article?id=10.1371/journal.pgph.0002011#abstract0) ] --- class: middle **Splines examples!** .pull-left[ - `Y`= muac_average (MUAC in cms); - `X`= hfaz (Height for age, Z-score) <img src="L22_EPIB704_Model_Building_25_files/figure-html/unnamed-chunk-56-1.png" width="80%" style="display: block; margin: auto;" /><img src="L22_EPIB704_Model_Building_25_files/figure-html/unnamed-chunk-56-2.png" width="80%" style="display: block; margin: auto;" /> ] -- .pull-right[ ``` r require(splines) lmmod6.11a<- glm(muac_average~ breast_feeding1 + adm_sex + age_cat_new+ * ns(hfaz, df=2), data = covidkenya) #round(j_summ(lmmod6.11a, confint = T)$coeftable, 2) lmmod6.11b<- glm(muac_average~ breast_feeding1 + adm_sex + age_cat_new + * ns(hfaz, df=5), data = covidkenya) round(j_summ(lmmod6.11b, confint = T)$coeftable, 2) ``` ``` ## Est. 2.5% 97.5% t val. p ## (Intercept) 12.95 9.55 16.35 7.47 0.00 ## breast_feeding1nobreastfeed -0.08 -0.74 0.59 -0.22 0.83 ## adm_sexMale 0.08 -0.34 0.49 0.36 0.72 ## age_cat_new>=5 years 3.99 3.01 4.96 8.02 0.00 ## age_cat_new12-23 months 1.58 0.66 2.50 3.38 0.00 ## age_cat_new2-5 years 2.53 1.59 3.46 5.31 0.00 ## age_cat_new6-11 months 1.24 0.26 2.22 2.48 0.01 ## ns(hfaz, df = 5)1 -0.64 -3.91 2.63 -0.38 0.70 ## ns(hfaz, df = 5)2 0.03 -3.32 3.39 0.02 0.98 ## ns(hfaz, df = 5)3 3.32 0.24 6.40 2.11 0.04 ## ns(hfaz, df = 5)4 -2.67 -9.94 4.60 -0.72 0.47 ## ns(hfaz, df = 5)5 7.34 3.22 11.46 3.49 0.00 ``` ``` r #attr(terms(lmmod6.11b), "predvars") ``` ] --- class: middle **Last Splines examples!** .pull-left[ - `Y`= muac_average (MUAC in cms); - `X`= adm_agemons (Age in months) <img src="L22_EPIB704_Model_Building_25_files/figure-html/unnamed-chunk-58-1.png" width="80%" style="display: block; margin: auto;" /><img src="L22_EPIB704_Model_Building_25_files/figure-html/unnamed-chunk-58-2.png" width="80%" style="display: block; margin: auto;" /> ] -- .pull-right[ ``` r lmmod6.12a<- glm(muac_average~ breast_feeding1 + adm_sex + * ns(adm_agemons, df=5), data = covidkenya) #round(j_summ(lmmod6.12a, confint = T)$coeftable, 2) lmmod6.12b<- glm(muac_average~ breast_feeding1 + adm_sex + * poly(adm_agemons, 5, raw=T), data = covidkenya) round(j_summ(lmmod6.12b, confint = T)$coeftable, 2) ``` ``` ## Est. 2.5% 97.5% t val. p ## (Intercept) 11.70 10.75 12.66 23.96 0.00 ## breast_feeding1nobreastfeed -0.29 -0.97 0.38 -0.85 0.40 ## adm_sexMale 0.13 -0.30 0.56 0.59 0.56 ## poly(adm_agemons, 5, raw = T)1 0.21 0.09 0.34 3.31 0.00 ## poly(adm_agemons, 5, raw = T)2 0.00 -0.01 0.00 -1.90 0.06 ## poly(adm_agemons, 5, raw = T)3 0.00 0.00 0.00 1.11 0.27 ## poly(adm_agemons, 5, raw = T)4 0.00 0.00 0.00 -0.52 0.61 ## poly(adm_agemons, 5, raw = T)5 0.00 0.00 0.00 0.08 0.94 ``` ``` r #attr(terms(lmmod6.12b), "predvars") ``` ]