In this short analysis, we compare the prediction accuracy of several linear regression in the four simulation examples of Zou & Hastie (2005). The six methods compared are:
ridge regression;
the Lasso;
the Elastic Net;
Sum of Single Effects regression (SuSiE), described here;
variational inference for Bayesian variable selection, or “varbvs”, described here; and
“varbvsmix”, an elaboration of varbvs that replaces the single normal prior with a mixture-of-normals.
See here for setup instructions. A webpage was generated from this source code by running rmarkdown::render("linreg.Rmd")
in the “analysis” directory.
Load a few packages and custom functions used in the analysis below.
library(dscrutils)
library(ggplot2)
library(cowplot)
source("../code/plots.R")
Here we use function “dscquery” from the dscrutils package to extract the DSC results we are interested in—the mean squared error in the predictions from each method and in each simulation scenario. After this call, the “dsc” data frame should contain results for 480 pipelines—6 methods times 4 scenarios times 20 data sets simulated in each scenario.
library(dscrutils)
methods <- c("ridge","lasso","elastic_net","susie","varbvs","varbvsmix")
dsc <- dscquery("../dsc/linreg",c("simulate.scenario","fit","mse.err"),
verbose = FALSE)
dsc <- transform(dsc,fit = factor(fit,methods))
nrow(dsc)
# [1] 480
Note that you will need to run the DSC before querying the results; see here for instructions on running the DSC. If you did not run the DSC to generate these results, you can replace the dscquery call above by this line to load the pre-extracted results stored in a CSV file:
dsc <- read.csv("../output/linreg_mse.csv")
This is how the CSV file was created:
write.csv(dsc,"../output/linreg_mse.csv",row.names = FALSE,quote = FALSE)
The boxplots summarize the prediction errors in each of the simulations.
p1 <- mse.boxplot(subset(dsc,simulate.scenario == 1)) + ggtitle("Scenario 1")
p2 <- mse.boxplot(subset(dsc,simulate.scenario == 2)) + ggtitle("Scenario 2")
p3 <- mse.boxplot(subset(dsc,simulate.scenario == 3)) + ggtitle("Scenario 3")
p4 <- mse.boxplot(subset(dsc,simulate.scenario == 4)) + ggtitle("Scenario 4")
plot_grid(p1,p2,p3,p4)
Here are a few initial (i.e., imprecise) impressions from these plots.
In most cases, the Elastic Net does at least as well, or better than the Lasso, which is what we would expect.
Ridge regression actually achieves excellent accuracy in all cases except Scenario 4. Ridge regression is expected to do less well in Scenario 4 because the majority of the true coefficients are zero, so a sparse model would be favoured.
In Scenario 4 where the predictors are correlated in a structured way, and the effects are sparse, varbvs and varbvsmix perform better than the other methods.
varbvsmix yields competitive predictions in all four scenarios.
This is the version of R and the packages that were used to generate these results.
sessionInfo()
# R version 3.4.3 (2017-11-30)
# Platform: x86_64-apple-darwin15.6.0 (64-bit)
# Running under: macOS High Sierra 10.13.6
#
# Matrix products: default
# BLAS: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRblas.0.dylib
# LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib
#
# locale:
# [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
#
# attached base packages:
# [1] stats graphics grDevices utils datasets methods base
#
# other attached packages:
# [1] cowplot_0.9.4 ggplot2_3.1.0 dscrutils_0.3.5
#
# loaded via a namespace (and not attached):
# [1] Rcpp_1.0.0 knitr_1.20 magrittr_1.5 tidyselect_0.2.5
# [5] munsell_0.4.3 colorspace_1.4-0 R6_2.2.2 rlang_0.3.1
# [9] dplyr_0.8.0.1 stringr_1.3.1 plyr_1.8.4 tools_3.4.3
# [13] grid_3.4.3 gtable_0.2.0 withr_2.1.2 htmltools_0.3.6
# [17] assertthat_0.2.0 yaml_2.2.0 lazyeval_0.2.1 rprojroot_1.3-2
# [21] digest_0.6.17 tibble_2.1.1 crayon_1.3.4 purrr_0.2.5
# [25] glue_1.3.0 evaluate_0.11 rmarkdown_1.10 labeling_0.3
# [29] stringi_1.2.4 compiler_3.4.3 pillar_1.3.1 scales_0.5.0
# [33] backports_1.1.2 pkgconfig_2.0.2