Last updated: 2019-01-12
workflowr checks: (Click a bullet for more information) ✔ R Markdown file: up-to-date
Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.
✔ Environment: empty
Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.
✔ Seed:
set.seed(20180714)
The command set.seed(20180714)
was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.
✔ Session information: recorded
Great job! Recording the operating system, R version, and package versions is critical for reproducibility.
✔ Repository version: 7951c3c
wflow_publish
or wflow_git_commit
). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:
Ignored files:
Ignored: .DS_Store
Ignored: .Rhistory
Ignored: .Rproj.user/
Ignored: docs/.DS_Store
Ignored: docs/figure/.DS_Store
Untracked files:
Untracked: analysis/gd_notes.Rmd
Unstaged changes:
Modified: analysis/brain.Rmd
Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.
File | Version | Author | Date | Message |
---|---|---|---|---|
Rmd | 7951c3c | Jason Willwerscheid | 2019-01-12 | workflowr::wflow_publish(“analysis/flashier_bench.Rmd”) |
First, I fit the “strong” subset of SNP-gene association statistics used in Urbut, Wang, Carbonetto, and Stephens (2018) (the strong.z
dataset found here). I fit five FLASH factors with normal-mixture priors using flashr
and using flashier
with backfit.order
set to "dropout"
, "sequential"
, and "montaigne"
.
The data is a dense 16k x 44 matrix that takes up 7.0 MB of memory when loaded into R
. I used the broadwl
partition of the midway2
RCC cluster with 4 CPUs and 8 GB of memory, and I used Gao Wang’s monitor_memory.py
script to test memory usage, as recommended in Peter Carbonetto’s large-scale data analysis tutorial.
VMS (GB) | RSS (GB) | Greedy (s/iter) | Backfit (s/iter) | Backfit iter | Obj diff | Time (min) | |
---|---|---|---|---|---|---|---|
flashr | 0.67 | 0.42 | 0.56 | 0.41 | 195 | 1894 | 2.22 |
dropout | 0.48 | 0.30 | 0.41 | 0.41 | 182 | 1894 | 1.98 |
sequential | 0.40 | 210 | 1894 | 2.15 | |||
montaigne | 0.41 | 324 | 1894 | 2.97 |
Next, I fit the droplet-based 3’ scRNA-seq dataset analyzed in Montoro et al. (2018) (the data can be obtained here). I performed a log-plus-one transform of the data, then I fit five FLASH factors using normal-mixture priors.
The data matrix is 18k x 7k and takes up 1011 MB of memory when loaded into R
as a dense matrix. However, only 9.3% of entries are nonzero, so the data can also be loaded as a sparse Matrix
object, in which case the data takes up 143 MB of memory. I fit FLASH objects to the larger matrix
object using the same four approaches used to fit the GTEx dataset, then I fit a FLASH object to the sparse Matrix
object using flashier
with backfit.order = "dropout"
(flashr
does not support objects of class Matrix
). All fits were performed on the broadwl
partition of the midway2
RCC cluster using 4 CPUs and 32 GB of memory.
VMS (GB) | RSS (GB) | Greedy (s/iter) | Backfit (s/iter) | Backfit iter | Obj diff | Time (min) | |
---|---|---|---|---|---|---|---|
flashr | 18.4 | 18.1 | 23.49 | 15.64 | 250 | 27002 | 101.97 |
dropout | 3.9 | 3.7 | 1.91 | 1.92 | 285 | 26982 | 11.27 |
sequential | 2.71 | 250 | 27109 | 13.45 | |||
montaigne | 1.82 | 100 | 13004 | 5.20 | |||
sparse | 1.1 | 0.9 | 0.90 | 0.88 | 285 | 26982 | 5.22 |
Finally, I fit the larger full-length scRNA “PulseSeq” dataset from Montoro et al. (2018). The dataset is about ten times larger than the droplet-based scRNA-seq dataset, so it was not feasible to use flashr
. I again fit five factors using normal-mixture priors, and I set backfit.order = "dropout"
.
The dataset is 22k x 66k, with 9.3% of entries not equal to zero, and occupies 1.49 GB of memory when loaded into R
as a sparse Matrix
object. (I did not attempt to fit a larger matrix
object.) The fit was again performed on broadwl
using 4 CPUs and 32 GB of memory.
VMS (GB) | RSS (GB) | Greedy (s/iter) | Backfit (s/iter) | Backfit iter | Obj diff | Time (min) | |
---|---|---|---|---|---|---|---|
sparse | 6 | 5.8 | 7.74 | 3.9 | 334 | 300023 | 27.23 |
sessionInfo()
R version 3.4.3 (2017-11-30)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS High Sierra 10.13.6
Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
loaded via a namespace (and not attached):
[1] workflowr_1.0.1 Rcpp_1.0.0 digest_0.6.18
[4] rprojroot_1.3-2 R.methodsS3_1.7.1 backports_1.1.2
[7] magrittr_1.5 git2r_0.21.0 evaluate_0.12
[10] highr_0.7 stringi_1.2.4 whisker_0.3-2
[13] R.oo_1.21.0 R.utils_2.6.0 rmarkdown_1.10
[16] tools_3.4.3 stringr_1.3.1 xfun_0.4
[19] yaml_2.2.0 compiler_3.4.3 htmltools_0.3.6
[22] knitr_1.20.22
This reproducible R Markdown analysis was created with workflowr 1.0.1