Last updated: 2018-08-17

workflowr checks: (Click a bullet for more information)
  • R Markdown file: up-to-date

    Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.

  • Environment: empty

    Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.

  • Seed: set.seed(20180714)

    The command set.seed(20180714) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.

  • Session information: recorded

    Great job! Recording the operating system, R version, and package versions is critical for reproducibility.

  • Repository version: 76185c6

    Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility. The version displayed above was the version of the Git repository at the time these results were generated.

    Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:
    
    Ignored files:
        Ignored:    .DS_Store
        Ignored:    .Rhistory
        Ignored:    .Rproj.user/
        Ignored:    docs/.DS_Store
        Ignored:    docs/figure/.DS_Store
    
    Untracked files:
        Untracked:  data/greedy19.rds
    
    Unstaged changes:
        Modified:   analysis/index.Rmd
    
    
    Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.
Expand here to see past versions:
    File Version Author Date Message
    Rmd 76185c6 Jason Willwerscheid 2018-08-17 wflow_publish(“analysis/squarem.Rmd”)
    html 022af3c Jason Willwerscheid 2018-08-17 Build site.
    Rmd b9ef007 Jason Willwerscheid 2018-08-17 wflow_publish(“analysis/squarem.Rmd”)


Introduction

This analysis is complementary to my investigations into parallelization (see here and here) in that I further explore ways to speed up FLASH backfits.

I use two off-the-shelf EM (and MM) accelerators, SQUAREM and DAAREM. For SQUAREM details, see Varadhan and Roland (2008). For DAAREM, see Henderson and Varadhan (2018).

Experiments

I use the same GTEx dataset that I use here and in Investigation 8.

I use flash_add_greedy to create three flash fit objects with, respectively, 5, 10, and 20 factor/loading pairs. I then refine each fit using flash_backfit with: 1. no acceleration; 2. acceleration via SQUAREM; and 3. acceleration via DAAREM.

Results

Since the experiments take a long time to run, I pre-run the code below and load the results from file.

res5 <- readRDS("./data/squarem/res5.rds")
res10 <- readRDS("./data/squarem/res10.rds")
res20 <- readRDS("./data/squarem/res20.rds")

Convergence behavior

In every case, DAAREM takes the fewest iterations to converge and convergence is nearly monotonic. In constrast, SQUAREM evinces highly non-monotonic behavior and tends to take more iterations than backfitting with no acceleration at all.

plot_obj <- function(res, main) {
  data <- c(res$backfit_obj, res$squarem_obj$V1, res$daarem_obj)
  plot(1:length(res$backfit_obj), res$backfit_obj,
       type='l', col='red', ylim=c(min(data), max(data)),
       xlab="Iteration", ylab="Objective", main=main)
  lines(1:length(res$squarem_obj$V1), res$squarem_obj$V1,
        col='blue')
  lines(1:length(res$daarem_obj), res$daarem_obj,
        col='green')
  legend("bottomright", legend=c("DAAREM", "No acceleration", "SQUAREM"),
         lty=1, col=c('green', 'red', 'blue'))
}

plot_obj(res5, "5 factor model")

Expand here to see past versions of conv-1.png:
Version Author Date
022af3c Jason Willwerscheid 2018-08-17

plot_obj(res10, "10 factor model")

Expand here to see past versions of conv-2.png:
Version Author Date
022af3c Jason Willwerscheid 2018-08-17

plot_obj(res20, "20 factor model")

Expand here to see past versions of conv-3.png:
Version Author Date
022af3c Jason Willwerscheid 2018-08-17

Final objective

Since the final objectives attained are difficult to see in the plots above, I list the differences in the table below. In every case, DAAREM beats the final objective attained with no acceleration. SQUAREM does much worse on the 5- and 10-factor models, but better on the 20-factor model.

final_obj <- function(res) {
  backfit = res$backfit_obj[length(res$backfit_obj)]
  return(c(res$daarem_obj[length(res$daarem_obj)] - backfit,
           res$squarem_obj$V1[length(res$squarem_obj$V1)] - backfit))
}
table_data <- cbind("5_factors" = final_obj(res5), 
                    "10_factors" = final_obj(res10), 
                    "20_factors" = final_obj(res20))
rownames(table_data) = c("DAAREM (diff from backfit)", 
                         "SQUAREM (diff from backfit)")
knitr::kable(table_data, digits=1)
5_factors 10_factors 20_factors
DAAREM (diff from backfit) 0.1 3.8 9.1
SQUAREM (diff from backfit) -56.5 -48.4 12.4

Time per iteration

The acceleration methods add a bit of overhead in terms of time required per iteration, but not much. The following are seconds required per iteration:

times <- function(res) {
  return(c(res$backfit_t[3] / length(res$backfit_obj),
           res$daarem_t[3] / length(res$daarem_obj),
           res$squarem_t[3] / length(res$squarem_obj$V1)))
}
table_data <- cbind("5_factors" = times(res5), 
                    "10_factors" = times(res10), 
                    "20_factors" = times(res20))
rownames(table_data) <- c("backfit", "DAAREM", "SQUAREM")
knitr::kable(table_data, digits = 2)
5_factors 10_factors 20_factors
backfit 1.35 2.50 5.57
DAAREM 1.37 2.71 5.53
SQUAREM 1.47 2.79 5.60

Total iterations to convergence

Perhaps the most important consideration is how quickly we can obtain an estimate that is reasonable close to convergence, where “reasonably close” is defined by some stopping rule. Below I give the number of iterations required given different tolerance parameters (i.e., the number of iterations required before the difference in objective from one iteration to the next is less than tol). (I omit results for SQUAREM since it does so poorly in other respects.)

niter <- function(res, tols) {
  backfit_diff <- (res$backfit_obj[2:length(res$backfit_obj)] -
                     res$backfit_obj[1:(length(res$backfit_obj) - 1)])
  daarem_diff <- (res$daarem_obj[2:length(res$daarem_obj)] -
                     res$daarem_obj[1:(length(res$daarem_obj) - 1)])
  res <- matrix(NA, nrow = length(tols), ncol = 2)
  for (i in 1:length(tols)) {
    res[i, 1] <- min(which(backfit_diff < tols[i])) + 1
    res[i, 2] <- min(which(abs(daarem_diff) < tols[i])) + 1
  }
  rownames(res) = paste("tol =", as.character(tols))
  colnames(res) = c("backfit", "DAAREM")
  return(t(res))
}

table_data <- niter(res5, c(0.5, 0.1, 0.05))
knitr::kable(table_data, caption = "5 factors")
5 factors
tol = 0.5 tol = 0.1 tol = 0.05
backfit 240 283 286
DAAREM 142 142 174
table_data <- niter(res10, c(0.5, 0.1, 0.05))
knitr::kable(table_data, caption = "10 factors")
10 factors
tol = 0.5 tol = 0.1 tol = 0.05
backfit 192 201 208
DAAREM 111 116 116
table_data <- niter(res20, c(0.5, 0.1, 0.05))
knitr::kable(table_data, caption = "20 factors")
20 factors
tol = 0.5 tol = 0.1 tol = 0.05
backfit 183 452 483
DAAREM 127 129 129

Code

Click “Code” to view the code used to produce the above results.

# Load packages ---------------------------------------------------------

# devtools::install_github("stephenslab/flashr")
devtools::load_all("/Users/willwerscheid/GitHub/flashr/")
# devtools::install_github("stephenslab/ebnm")
devtools::load_all("/Users/willwerscheid/GitHub/ebnm/")
library(SQUAREM)
library(daarem)

# Load data -------------------------------------------------------------

gtex <- readRDS(gzcon(url("https://github.com/stephenslab/gtexresults/blob/master/data/MatrixEQTLSumStats.Portable.Z.rds?raw=TRUE")))
data <- flash_set_data(t(gtex$strong.z), S = 1)

# Create flash objects for backfitting ----------------------------------

fl_g5 <- flash_add_greedy(data, Kmax = 5,
                          var_type = "zero", init_fn = "udv_svd")

fl_g10 <- flash_add_greedy(data, Kmax = 5, f_init = fl_g5,
                           var_type = "zero", init_fn = "udv_svd")

fl_g20 <- flash_add_greedy(data, Kmax = 10, f_init = fl_g10,
                           var_type = "zero", init_fn = "udv_svd")

# Testing functions -----------------------------------------------------

logit <- function(x) {log(x / (1 - x))}
inv_logit <- function(x) {exp(x) / (1 + exp(x))}

fl.to.param <- function(f) {
  c(as.vector(f$EL), as.vector(f$EF),
    as.vector(f$EL2), as.vector(f$EF2),
    sapply(f$gl, function(k) {logit(k$pi0)}),
    sapply(f$gf, function(k) {logit(k$pi0)}),
    sapply(f$gl, function(k) {log(k$a)}),
    sapply(f$gf, function(k) {log(k$a)}))
}

param.to.fl <- function(param, m, n, k) {
  LL = matrix(param[1:(m*k)], ncol=k)
  curr_idx = m*k
  FF = matrix(param[(curr_idx + 1):(curr_idx + n*k)], ncol=k)
  curr_idx = curr_idx + n*k

  f = flashr:::flash_init_lf(LL, FF)

  f$EL2 = matrix(param[(curr_idx + 1):(curr_idx + m*k)], ncol=k)
  curr_idx = curr_idx + m*k
  f$EF2 = matrix(param[(curr_idx + 1):(curr_idx + n*k)], ncol=k)
  curr_idx = curr_idx + n*k

  f$gl = list()
  f$gf = list()
  for (i in 1:k) {
    f$gl[[i]] = list()
    f$gl[[i]]$pi0 = inv_logit(param[curr_idx + 1])
    curr_idx = curr_idx + 1
  }
  for (i in 1:k) {
    f$gf[[i]] = list()
    f$gf[[i]]$pi0 = inv_logit(param[curr_idx + 1])
    curr_idx = curr_idx + 1
  }
  for (i in 1:k) {
    f$gl[[i]]$a = exp(param[curr_idx + 1])
    curr_idx = curr_idx + 1
  }
  for (i in 1:k) {
    f$gf[[i]]$a = exp(param[curr_idx + 1])
    curr_idx = curr_idx + 1
  }

  return(f)
}

flash.iter <- function(p, data, m, n, k) {
  init_fl = param.to.fl(p, m, n, k)
  fl = flash_backfit(data, init_fl,
                     ebnm_fn = "ebnm_pn", var_type = "zero",
                     nullcheck = FALSE, verbose = FALSE,
                     maxiter = 1)
  obj = flash_get_objective(data, fl)
  message(obj)
  return(c(fl.to.param(fl), obj))
}

flash.obj <- function(p, data, m, n, k) {
  return(p[length(p)])
}

run_test <- function(f_init, data, niter) {
  m <- flash_get_n(f_init)
  n <- flash_get_p(f_init)
  k <- flash_get_k(f_init)

  backfit_obj <- rep(NA, niter)
  f <- f_init
  backfit_t <- system.time({
    for (i in 1:niter) {
      f <- flash_backfit(data, f,
                         ebnm_fn = "ebnm_pn", var_type = "zero",
                         nullcheck = FALSE, verbose = FALSE,
                         maxiter = 1)
      obj <- flash_get_objective(data, f)
      message(obj)
      backfit_obj[i] <- obj
    }
  })

  message("Sinking SQUAREM results to file...")
  zz <- file("tmp.txt", open="wt")
  sink(zz, type="message")
  squarem_t <- system.time(
    squarem_res <- squarem(c(fl.to.param(f_init),
                             flash_get_objective(data, f_init)),
                           flash.iter, flash.obj,
                           data = data, m = m, n = n, k = k,
                           control = (list(tol=1, maxiter=niter)))
  )
  sink(type="message")
  squarem_obj <- read.csv("tmp.txt", header = FALSE, sep = '\n')
  file.remove("tmp.txt")

  daarem_t <- system.time(
    daarem_res <- daarem(c(fl.to.param(f_init),
                           flash_get_objective(data, f_init)),
                         flash.iter, flash.obj,
                         data = data, m = m, n = n, k = k,
                         control = (list(tol=1, maxiter=niter)))
  )

  return(list(backfit_obj = backfit_obj, backfit_t = backfit_t,
              squarem_obj = squarem_obj, squarem_t = squarem_t,
              daarem_obj = daarem_res$objfn.track, daarem_t = daarem_t))
}

# Run tests -------------------------------------------------------------

fpath <- "./data/squarem/"

# Normal backfit of fl_g5 takes 304 iterations.
res_5 <- run_test(fl_g5, data, niter = 300)
saveRDS(res_5, paste0(fpath, "res5.rds"))

# Normal backfit of fl_g10 takes 228 iterations.
res_10 <- run_test(fl_g10, data, niter = 225)
saveRDS(res_10, paste0(fpath, "res10.rds"))

# Normal backfit of fl_g20 takes 497 iterations.
res_20 <- run_test(fl_g20, data, niter = 495)
saveRDS(res_20, paste0(fpath, "res20.rds"))


# Use this function if gf and gl parameters aren't converted to a
#   log/logit scale.
#
# flash.iter <- function(p) {
#   init_fl = param.to.fl(p[1:(length(p) - 1)])
#   fl = try(flash_backfit(data, init_fl, ebnm_fn="ebnm_pn",
#                          ebnm_param=list(warmstart=TRUE),
#                          var_type="zero", nullcheck=F, maxiter=1))
#   if (class(fl) == "try-error") {
#     fl = flash_backfit(data, init_fl, ebnm_fn="ebnm_pn",
#                        ebnm_param=list(warmstart=FALSE),
#                        var_type="zero", nullcheck=F, maxiter=1)
#   }
#   return(c(f.to.param(fl), flash_get_objective(data, fl)))
# }

plot_obj <- function(res) {
  data <- c(res$backfit_obj, res$squarem_obj$V1, res$daarem_obj)
  plot(1:length(res$backfit_obj), res$backfit_obj,
       type='l', col='red', ylim=c(min(data), max(data)),
       xlab="Iteration", ylab="Objective")
  lines(1:length(res$squarem_obj$V1), res$squarem_obj$V1,
        col='blue')
  lines(1:length(res$daarem_obj), res$daarem_obj,
        col='green')
}

plot_obj_zoom <- function(res, yrange) {
  data <- c(res$backfit_obj, res$squarem_obj$V1, res$daarem_obj)
  max_obj <- max(data)
  t <- max_obj - yrange
  begin_iter <- min(c(which(res$backfit_obj > t),
                      which(res$squarem_obj > t),
                      which(res$daarem_obj > t)))
  plot(1:length(res$backfit_obj), res$backfit_obj,
       type='l', col='red',
       xlim=c(begin_iter, length(res$backfit_obj)),
       ylim=c(t, max_obj),
       xlab="Iteration", ylab="Objective")
  lines(1:length(res$squarem_obj$V1), res$squarem_obj$V1,
        col='blue')
  lines(1:length(res$daarem_obj), res$daarem_obj,
        col='green')
}

Session information

sessionInfo()
R version 3.4.3 (2017-11-30)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS High Sierra 10.13.6

Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
 [1] workflowr_1.0.1   Rcpp_0.12.17      digest_0.6.15    
 [4] rprojroot_1.3-2   R.methodsS3_1.7.1 backports_1.1.2  
 [7] git2r_0.21.0      magrittr_1.5      evaluate_0.10.1  
[10] highr_0.6         stringi_1.1.6     whisker_0.3-2    
[13] R.oo_1.21.0       R.utils_2.6.0     rmarkdown_1.8    
[16] tools_3.4.3       stringr_1.3.0     yaml_2.1.17      
[19] compiler_3.4.3    htmltools_0.3.6   knitr_1.20       

This reproducible R Markdown analysis was created with workflowr 1.0.1