Last updated: 2018-10-09
workflowr checks: (Click a bullet for more information) ✔ R Markdown file: up-to-date 
Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.
 ✔ Environment: empty 
Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.
 ✔ Seed: 
set.seed(12345) 
The command set.seed(12345) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.
 ✔ Session information: recorded 
Great job! Recording the operating system, R version, and package versions is critical for reproducibility.
 ✔ Repository version: cdfe5c5 
wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:
Ignored files:
    Ignored:    .DS_Store
    Ignored:    .Rhistory
    Ignored:    .Rproj.user/
    Ignored:    analysis/.RData
    Ignored:    analysis/.Rhistory
    Ignored:    analysis/ALStruct_cache/
    Ignored:    data/.Rhistory
    Ignored:    docs/figure/.DS_Store
Untracked files:
    Untracked:  .dropbox
    Untracked:  Icon
    Untracked:  _workflowr.yml
    Untracked:  analysis/GTEX-cogaps.Rmd
    Untracked:  analysis/SPCAvRP.rmd
    Untracked:  analysis/eQTL.perm.rand.pdf
    Untracked:  analysis/ieQTL.perm.rand.pdf
    Untracked:  analysis/mash_bhat_z.Rmd
    Untracked:  analysis/mash_ieqtl_permutations.Rmd
    Untracked:  analysis/pseudodata.Rmd
    Untracked:  analysis/sc_bimodal.Rmd
    Untracked:  analysis/susie_example.Rmd
    Untracked:  analysis/svd-timing.Rmd
    Untracked:  analysis/test_sparse.Rmd
    Untracked:  analysis/z.txt
    Untracked:  code/multivariate_testfuncs.R
    Untracked:  data/4matthew/
    Untracked:  data/4matthew2/
    Untracked:  data/E-MTAB-2805.processed.1/
    Untracked:  data/ENSG00000156738.Sim_Y2.RDS
    Untracked:  data/GDS5363_full.soft.gz
    Untracked:  data/GSE41265_allGenesTPM.txt
    Untracked:  data/Muscle_Skeletal.ACTN3.pm1Mb.RDS
    Untracked:  data/Thyroid.FMO2.pm1Mb.RDS
    Untracked:  data/bmass.HaemgenRBC2016.MAF01.Vs2.MergedDataSources.200kRanSubset.ChrBPMAFMarkerZScores.vs1.txt.gz
    Untracked:  data/bmass.HaemgenRBC2016.Vs2.NewSNPs.ZScores.hclust.vs1.txt
    Untracked:  data/bmass.HaemgenRBC2016.Vs2.PreviousSNPs.ZScores.hclust.vs1.txt
    Untracked:  data/finemap_data/fmo2.sim/b.txt
    Untracked:  data/finemap_data/fmo2.sim/dap_out.txt
    Untracked:  data/finemap_data/fmo2.sim/dap_out2.txt
    Untracked:  data/finemap_data/fmo2.sim/dap_out2_snp.txt
    Untracked:  data/finemap_data/fmo2.sim/dap_out_snp.txt
    Untracked:  data/finemap_data/fmo2.sim/data
    Untracked:  data/finemap_data/fmo2.sim/fmo2.sim.config
    Untracked:  data/finemap_data/fmo2.sim/fmo2.sim.k
    Untracked:  data/finemap_data/fmo2.sim/fmo2.sim.k4.config
    Untracked:  data/finemap_data/fmo2.sim/fmo2.sim.k4.snp
    Untracked:  data/finemap_data/fmo2.sim/fmo2.sim.ld
    Untracked:  data/finemap_data/fmo2.sim/fmo2.sim.snp
    Untracked:  data/finemap_data/fmo2.sim/fmo2.sim.z
    Untracked:  data/finemap_data/fmo2.sim/pos.txt
    Untracked:  data/logm.csv
    Untracked:  data/m.cd.RDS
    Untracked:  data/m.cdu.old.RDS
    Untracked:  data/m.new.cd.RDS
    Untracked:  data/m.old.cd.RDS
    Untracked:  data/mainbib.bib.old
    Untracked:  data/mat.csv
    Untracked:  data/mat.txt
    Untracked:  data/mat_new.csv
    Untracked:  data/paintor_data/
    Untracked:  data/temp.txt
    Untracked:  data/y.txt
    Untracked:  data/y_f.txt
    Untracked:  docs/figure/eigen.Rmd/
    Untracked:  docs/figure/fmo2.sim.Rmd/
    Untracked:  docs/figure/newVB.elbo.Rmd/
    Untracked:  docs/figure/rbc_zscore_mash2.Rmd/
    Untracked:  docs/figure/rbc_zscore_mash2_analysis.Rmd/
    Untracked:  docs/figure/rbc_zscores.Rmd/
    Untracked:  docs/trend_files/
    Untracked:  docs/z.txt
    Untracked:  explore_udi.R
    Untracked:  output/fit.varbvs.RDS
    Untracked:  output/glmnet.fit.RDS
    Untracked:  output/test.bv.txt
    Untracked:  output/test.gamma.txt
    Untracked:  output/test.hyp.txt
    Untracked:  output/test.log.txt
    Untracked:  output/test.param.txt
    Untracked:  output/test2.bv.txt
    Untracked:  output/test2.gamma.txt
    Untracked:  output/test2.hyp.txt
    Untracked:  output/test2.log.txt
    Untracked:  output/test2.param.txt
    Untracked:  output/test3.bv.txt
    Untracked:  output/test3.gamma.txt
    Untracked:  output/test3.hyp.txt
    Untracked:  output/test3.log.txt
    Untracked:  output/test3.param.txt
    Untracked:  output/test4.bv.txt
    Untracked:  output/test4.gamma.txt
    Untracked:  output/test4.hyp.txt
    Untracked:  output/test4.log.txt
    Untracked:  output/test4.param.txt
    Untracked:  output/test5.bv.txt
    Untracked:  output/test5.gamma.txt
    Untracked:  output/test5.hyp.txt
    Untracked:  output/test5.log.txt
    Untracked:  output/test5.param.txt
Unstaged changes:
    Modified:   analysis/_site.yml
    Deleted:    analysis/chunks.R
    Modified:   analysis/eigen.Rmd
    Modified:   analysis/fmo2.sim.Rmd
    Modified:   analysis/newVB.Rmd
    Modified:   analysis/wSVD.Rmd
I’m going to try to do a version of Poisson low rank approximation in R so I can play with it, based on Abhishek Sarkar’s python code (here)[https://github.com/aksarkar/wlra/blob/master/wlra/wlra.py].
# return truncated svd (truncated to rank principal components)
lra = function(x, rank){
  x.svd = svd(x,nu=rank,nv=rank)
  
  return(x.svd$u %*% diag(x.svd$d[1:rank],nrow=rank) %*% t(x.svd$v))
}
  
# Return the weighted low rank approximation of x
# Minimize the weighted Frobenius norm between x and the approximation z using EM [SJ03].
#
#' @param x input data (n, p)
#' @param w input weights (n, p)
#' @param rank rank of the approximation (non-negative)
#' @param max_iters - maximum number of EM iterations
#' @param atol - minimum absolute difference in objective function for convergence
#' @param verbose - print objective function updates
#' @return an n by p matrix
wlra = function(x, w, rank, max_iters=1000, atol=1e-3,verbose=FALSE){
  n = nrow(x)
  p = ncol(x)
  # Important: WLRA requires weights 0 <= w <= 1
  w = w/max(w)
  
    # Important: the procedure is deterministic, so initialization
  # matters.
  #
  # Srebro and Jaakkola suggest the best strategy is to initialize
  # from zero, but go from a full rank down to a rank k approximation in
  # the first iterations
  #
  # For now, take the simpler strategy of just initializing to zero. Srebro and
  # Jaakkola suggest this can underfit.
  z = matrix(0,nrow=n,ncol=p)
  obj = mean(w * x^2)
  if(verbose){
    print(paste0("wsvd [0]=",obj))
  }
  for(i in 0:max_iters){
    z1 = lra(w * x + (1 - w) * z, rank)
    update = mean(w * (x - z1)^2)
    if(verbose){print(paste0("wsvd [",i+1,"]=",update))}
    if(update > obj){
      stop("objective increased")
    } else if(max(abs(update-obj))<atol){
      return(z1)
    } else {
      z = z1
      obj = update
    }
  }
  stop("failed to converge")
}
# return log(p(Y|lambda=exp(eta))) for Y \sim Poi(lambda)
pois_llik= function(y, eta){
  sum(dpois(y,exp(eta),log=TRUE))
}
#' @details Assume x_ij ~ Poisson(exp(eta_ij)), eta_ij = L_ik F_kj
#'  Maximize the log likelihood by using Taylor approximation to rewrite the problem as WLRA.
#' @param x: input data (n, p)
#' @param rank: rank of the approximation
#' @param max_outer_iters: maximum number of calls to WLRA
#' @param max_iters: maximum number of EM iterations in WLRA
#' @param verbose: print objective function updates
#' @return low rank approximation (n, p) matrix
pois_lra= function(x, rank, max_outer_iters=50, max_iters=1000, atol=1e-3, verbose=FALSE){
  n = nrow(x)
  p = ncol(x)
  nmf = NNLM::nnmf(x, rank)
  eta = log(nmf$W %*% nmf$H)
  obj = mean(pois_llik(x, eta))
  if(verbose)
    print(paste0("pois_lra [0]:",obj))
  for(i in 0:max_outer_iters){
    lam = exp(eta)
    w = lam
    target = eta + x / lam - 1
    w[is.na(x)]=0 # Mark missing data with weight 0
    
      # Now we can go ahead and fill in the missing values with something
      # computationally convenient, because the WLRA EM update will ignore the
      # value for weight zero.
    target[is.na(x)] = 0
    eta1 = wlra(target, w, rank, max_iters=max_iters, atol=atol, verbose=verbose)
    update = mean(pois_llik(x, eta1))
    if(verbose){
      print(paste0("pois_lra [",i + 1,"]:",update))
    }
    if(max(abs(update-obj))<atol){
      return(list(fit=eta1,w=w,target=target))
    } else {
      eta = eta1
      obj = update
    }
  }
  stop("failed to converge")
}
# this just does a simple TSE about lam and runs wSVD once without any iteration
pois_lra1 = function(x, rank, lam = ifelse(x>0,x,0.5), max_iters=1000, atol=1e-3, verbose=FALSE){
  target = log(lam) + x / lam - 1
  w = lam
  w[is.na(x)]=0 # Mark missing data with weight 0
    
      # Now we can go ahead and fill in the missing values with something
      # computationally convenient, because the WLRA EM update will ignore the
      # value for weight zero.
    target[is.na(x)] = 0
    eta1 = wlra(target, w, rank, max_iters=max_iters, atol=atol, verbose=verbose)
    return(list(fit=eta1,w=w,target=target))
}First simulate some data:
set.seed(1)
l = rnorm(100)
f = rnorm(100)
eta = outer(l,f)
lambda = exp(eta)
x = matrix(rpois(length(lambda),lambda),nrow=nrow(lambda))Now fit various models: the plra, plra1 with the “naive” expansion (around x, or 0.5 for x=0) and around the true value of lambda:
x.plra=pois_lra(x,rank=1,verbose=TRUE)[1] "pois_lra [0]:-56393.8383721883"
[1] "wsvd [0]=9.11187664142563"
[1] "wsvd [1]=9.09767330084035"
[1] "wsvd [2]=9.0942294282996"
[1] "wsvd [3]=9.09252076154906"
[1] "wsvd [4]=9.09146781601123"
[1] "wsvd [5]=9.09073663864168"
[1] "pois_lra [1]:-20730.5265718132"
[1] "wsvd [0]=0.0254032600526617"
[1] "wsvd [1]=0.018730302153258"
[1] "wsvd [2]=0.0173472213350827"
[1] "wsvd [3]=0.0166205649043371"
[1] "pois_lra [2]:-24476.2310334721"
[1] "wsvd [0]=0.0342958650072393"
[1] "wsvd [1]=0.0302492341697698"
[1] "wsvd [2]=0.0295946545829222"
[1] "pois_lra [3]:-26631.5708456328"
[1] "wsvd [0]=0.0594742391599983"
[1] "wsvd [1]=0.0558114666026009"
[1] "wsvd [2]=0.0551636349735505"
[1] "pois_lra [4]:-26775.5968189102"
[1] "wsvd [0]=0.0654476543467692"
[1] "wsvd [1]=0.0617080003798833"
[1] "wsvd [2]=0.0609634116407461"
[1] "pois_lra [5]:-26663.0720352895"
[1] "wsvd [0]=0.0639419959608682"
[1] "wsvd [1]=0.0601941802731442"
[1] "wsvd [2]=0.0594430040900086"
[1] "pois_lra [6]:-26653.5190878185"
[1] "wsvd [0]=0.0637737656022713"
[1] "wsvd [1]=0.0600262104698666"
[1] "wsvd [2]=0.0592754713652051"
[1] "pois_lra [7]:-26653.9093042342"
[1] "wsvd [0]=0.063778632660418"
[1] "wsvd [1]=0.0600311044827268"
[1] "wsvd [2]=0.0592803862918574"
[1] "pois_lra [8]:-26653.9404182592"
[1] "wsvd [0]=0.0637791754208205"
[1] "wsvd [1]=0.0600316465363189"
[1] "wsvd [2]=0.0592809270276872"
[1] "pois_lra [9]:-26653.9393089239"
[1] "wsvd [0]=0.0637791618254631"
[1] "wsvd [1]=0.0600316328646063"
[1] "wsvd [2]=0.0592809132945037"
[1] "pois_lra [10]:-26653.9392199259"x.plra1.naive=pois_lra1(x,rank=1,verbose=TRUE)[1] "wsvd [0]=0.0357370934894206"
[1] "wsvd [1]=0.0216973393395743"
[1] "wsvd [2]=0.0183559836838193"
[1] "wsvd [3]=0.0166671902679346"
[1] "wsvd [4]=0.0156058206162763"
[1] "wsvd [5]=0.0148556454379835"x.plra1.true = pois_lra1(x,1,lam=lambda) pois_llik(x,eta) #log likelihood at true value of eta[1] -13269.62pois_llik(x,x.plra$fit)[1] -26653.94pois_llik(x,x.plra1.naive$fit)[1] -20770.27pois_llik(x,x.plra1.true$fit)[1] -20830.33This was weird that expansion around the truth was so bad. So I tried more stringent convergence:
x.plra1.true2 = pois_lra1(x,1,lam=lambda,max_iters = 10000,atol=1e-9) 
x.plra1.naive2 = pois_lra1(x,1,max_iters = 10000,atol=1e-9) 
pois_llik(x,x.plra1.naive2$fit)[1] -13163.59pois_llik(x,x.plra1.true2$fit)[1] -13145.19sessionInfo()R version 3.5.1 (2018-07-02)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: OS X El Capitan 10.11.6
Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     
loaded via a namespace (and not attached):
 [1] workflowr_1.1.1   Rcpp_0.12.19      NNLM_0.4.2       
 [4] digest_0.6.17     rprojroot_1.3-2   R.methodsS3_1.7.1
 [7] backports_1.1.2   git2r_0.23.0      magrittr_1.5     
[10] evaluate_0.11     stringi_1.2.4     whisker_0.3-2    
[13] R.oo_1.22.0       R.utils_2.7.0     rmarkdown_1.10   
[16] tools_3.5.1       stringr_1.3.1     yaml_2.2.0       
[19] compiler_3.5.1    htmltools_0.3.6   knitr_1.20       
This reproducible R Markdown analysis was created with workflowr 1.1.1