Last updated: 2018-07-27
workflowr checks: (Click a bullet for more information) ✔ R Markdown file: up-to-date
Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.
✔ Environment: empty
Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.
✔ Seed:
set.seed(20180714)
The command set.seed(20180714)
was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.
✔ Session information: recorded
Great job! Recording the operating system, R version, and package versions is critical for reproducibility.
✔ Repository version: 2d81ef4
wflow_publish
or wflow_git_commit
). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:
Ignored files:
Ignored: .DS_Store
Ignored: .Rhistory
Ignored: .Rproj.user/
Ignored: data/warmstart2/
Ignored: docs/.DS_Store
Ignored: docs/figure/.DS_Store
Untracked files:
Untracked: analysis/warmstart2.Rmd
Untracked: data/greedy19.rds
Untracked: data/init_fn2/
Unstaged changes:
Modified: analysis/init_fn.Rmd
Modified: analysis/warmstart.Rmd
Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.
File | Version | Author | Date | Message |
---|---|---|---|---|
Rmd | 2d81ef4 | Jason Willwerscheid | 2018-07-27 | wflow_publish(c(“analysis/index.Rmd”, “analysis/init_fn2.Rmd”)) |
Here, I return to a question I asked in a previous investigation: does the choice of init_fn
affect the final objective attained?
I perform a very simple (if time-consuming) experiment. I fit FLASH to the GTEx dataset from previous investigations using init_fn = udv_si
with ten different seeds, and compare the final objective in each case to the objective I get using init_fn = udv_svd
.
I pre-run the code and load the results from file. I am using warmstarts to avoid the problem described here (at present, this functionality is only available in branch trackObj
).
# devtools::install_github("stephenslab/flashr", ref="trackObj")
devtools::load_all("/Users/willwerscheid/GitHub/flashr")
# devtools::install_github("stephenslab/ebnm")
devtools::load_all("/Users/willwerscheid/GitHub/ebnm")
gtex <- readRDS(gzcon(url("https://github.com/stephenslab/gtexresults/blob/master/data/MatrixEQTLSumStats.Portable.Z.rds?raw=TRUE")))
strong <- t(gtex$strong.z)
niter <- 10
obj.udv_si <- rep(0, niter)
nfactor.udv_si <- rep(0, niter)
for (i in 1:niter) {
res <- flash_add_greedy(strong, Kmax=50, init_fn="udv_si",
warmstart=TRUE, verbose=TRUE, seed=i)
obj.udv_si[i] <- flash_get_objective(strong, res$f)
nfactor.udv_si[i] <- flash_get_nfactors(res$f)
}
res2 <- flash_add_greedy(strong, Kmax=50, init_fn="udv_svd",
warmstart=TRUE, verbose=TRUE, seed=i)
obj.udv_svd <- flash_get_objective(strong, res2$f)
nfactor.udv_svd <- flash_get_nfactors(res2$f)
all_res <- list(obj.udv_si = obj.udv_si,
obj.udv_svd = obj.udv_svd,
nfactor.udv_si = nfactor.udv_si)
nfactor.udv_svd = nfactor.udv_svd)
saveRDS(all_res, "../data/init_fn2/res.rds")
all_res <- readRDS("./data/init_fn2/res.rds")
Results are as follows.
obj.diff <- all_res$obj.udv_si - all_res$obj.udv_svd
col <- c("lightgreen", "skyblue", "royalblue", "purple4")
plot.col <- col[all_res$nfactor.udv_si - 22]
plot(1:10, obj.diff,
xlab="seed", ylab="difference in objective",
xlim=c(1, 11), ylim=c(-1200, 200),
pch=19, col=plot.col,
main="Obj. attained for udv_si (relative to udv_svd)")
abline(0, 0, lty=2)
legend("topright", as.character(23:26), pch=19, col=col,
title="# factors")
So, six seeds yield an objective that is much worse than the objective attained using udv_svd
, and four seeds yield an objective that is as good or slightly better. The latter all include 25 factor/loading pairs, which is also the number of factor/loading pairs given by udv_svd
.
These results lend further support to my previous conclusion that we should make udv_si_svd
the default initialization function when there is missing data.
sessionInfo()
R version 3.4.3 (2017-11-30)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS High Sierra 10.13.6
Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
loaded via a namespace (and not attached):
[1] workflowr_1.0.1 Rcpp_0.12.17 digest_0.6.15
[4] rprojroot_1.3-2 R.methodsS3_1.7.1 backports_1.1.2
[7] git2r_0.21.0 magrittr_1.5 evaluate_0.10.1
[10] stringi_1.1.6 whisker_0.3-2 R.oo_1.21.0
[13] R.utils_2.6.0 rmarkdown_1.8 tools_3.4.3
[16] stringr_1.3.0 yaml_2.1.17 compiler_3.4.3
[19] htmltools_0.3.6 knitr_1.20
This reproducible R Markdown analysis was created with workflowr 1.0.1