Variance Regularization: Part Two

Last updated: 2020-03-13

Checks: 6 0

Knit directory: scFLASH/

This reproducible R Markdown analysis was created with workflowr (version 1.2.0). The Report tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.

R Markdown file: up-to-date

Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.

Environment: empty

Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.

Seed: set.seed(20181103)

The command set.seed(20181103) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.

Session information: recorded

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.

Cache: none

Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.

Repository version: 0608852

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility. The version displayed above was the version of the Git repository at the time these results were generated.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Ignored files:
    Ignored:    .DS_Store
    Ignored:    .Rhistory
    Ignored:    .Rproj.user/
    Ignored:    code/initialization/
    Ignored:    data-raw/10x_assigned_cell_types.R
    Ignored:    data/.DS_Store
    Ignored:    data/10x/
    Ignored:    data/Ensembl2Reactome.txt
    Ignored:    data/droplet.rds
    Ignored:    data/mus_pathways.rds
    Ignored:    output/backfit/
    Ignored:    output/final_montoro/
    Ignored:    output/lowrank/
    Ignored:    output/prior_type/
    Ignored:    output/pseudocount/
    Ignored:    output/pseudocount_redux/
    Ignored:    output/size_factors/
    Ignored:    output/var_type/

Untracked files:
    Untracked:  analysis/NBapprox.Rmd
    Untracked:  analysis/trachea4.Rmd
    Untracked:  code/alt_montoro/
    Untracked:  code/missing_data.R
    Untracked:  code/pulseseq/
    Untracked:  code/trachea4.R
    Untracked:  code/var_reg/
    Untracked:  fl_tmp.rds
    Untracked:  output/alt_montoro/
    Untracked:  output/pulseseq_fit.rds
    Untracked:  output/var_reg/

Unstaged changes:
    Modified:   code/utils.R
    Modified:   data-raw/pbmc.R

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.

These are the previous versions of the R Markdown and HTML files. If you’ve configured a remote Git repository (see ?wflow_git_remote), click on the hyperlinks in the table below to view them.

File	Version	Author	Date	Message
Rmd	0608852	Jason Willwerscheid	2020-03-13	wflow_publish(“analysis/var_reg_pbmc2.Rmd”)
html	73c62dd	Jason Willwerscheid	2020-03-12	Build site.
Rmd	f8991ba	Jason Willwerscheid	2020-03-12	wflow_publish(“./analysis/var_reg_pbmc2.Rmd”)

Introduction

In a previous analysis, I put a prior on the residual variance parameters to prevent them from going to zero during the backfit (in the past, I had just been thresholding them). I fixed the prior rather than estimating it using empirical Bayes: as it turns out, the latter is simply not effective (I tried a range of prior families, including exponential and gamma priors).

Here, I combine the two approaches. I reason about Poisson mixtures to set a minimum threshold and then, after thresholding, I use empirical Bayes to shrink the variance estimates towards their mean.

As in the previous analysis, all fits add 20 factors greedily using point-normal priors and then backfit. Here, however, I don’t “pre-scale” cells (that is, I scale using library size normalization, but I don’t do any additional scaling based on cell-wise variance estimates). The code used to produce the fits can be viewed here.

Residual variance threshold

The “true” distribution of gene \(j\) can be modeled as \[ X_{ij} \sim \text{Poisson}(s_i \lambda_{ij}), \] where \(s_i\) is the size factor for cell \(i\) and \(\lambda_{ij}\) depends on, for example, cell type. Using a Taylor approximation, the transformed entry \(Y_{ij} = \log(X_{ij} / s_i + 1)\) can be written \[ Y_{ij} \approx \log(\lambda_{ij} + 1) + \frac{1}{s_i(\lambda_{ij} + 1)}(X_{ij} - s_i \lambda_{ij}) - \frac{1}{2s_i^2(\lambda_{ij} + 1)^2}(X_{ij} - s_i \lambda_{ij})^2 \] so that \[ \mathbb{E}Y_{ij} \approx \log(\lambda_{ij} + 1) - \frac{\lambda_{ij}}{2s_i(\lambda_{ij} + 1)^2}\] and \[ \text{Var}(Y_{ij}) \approx \frac{\lambda_{ij}}{s_i(\lambda_{ij} + 1)^2}\]

The law of total variance gives a simple lower bound: \[ \text{Var}(Y_j) \ge \mathbb{E}_i(\text{Var}(Y_{ij})) \approx \frac{1}{n} \sum_{i} \frac{\lambda_{ij}}{s_i(\lambda_{ij} + 1)^2} \] Plugging in the estimator \(\hat{\lambda}_{ij} = X_{ij} / s_i\): \[ \text{Var}(Y_j) \ge \frac{1}{n} \sum_i \frac{X_{ij}}{s_i^2(\exp(Y_{ij}))^2} \]

Thus a reasonable lower bound for the residual variance estimates is \[ \min_j \frac{1}{n} \sum_i \frac{X_{ij}}{s_i^2(\exp(Y_{ij}))^2} \]

Prior family

I’ll use the family of gamma priors, since they have the advantage of being fast (unlike gamma and exponential mixtures) and yet flexible (as compared to one-parameter exponential priors).

Results: Variance Estimates

The regularization step doesn’t seem to be hugely important. Variance estimates are very similar with and without it (the solid lines indicate the threshold):

suppressMessages(library(tidyverse))
suppressMessages(library(Matrix))

source("./code/utils.R")
orig.data <- readRDS("./data/10x/pbmc.rds")
pbmc <- preprocess.pbmc(orig.data)

res <- readRDS("./output/var_reg/varreg_fits2.rds")

var_df <- tibble(thresholded = res$unreg$fl$residuals.sd^2,
                 regularized = res$reg$fl$residuals.sd^2)

ggplot(var_df, aes(x = thresholded, y = regularized)) + 
  geom_point(size = 1) +
  scale_x_log10() + 
  scale_y_log10() +
  geom_abline(slope = 1, linetype = "dashed") +
  ggtitle("Variance Estimates") +
  labs(x = "Without regularization", y = "With regularization") +
  geom_vline(xintercept = 1 / res$unreg$fl$flash.fit$given.tau) +
  geom_hline(yintercept = 1 / res$unreg$fl$flash.fit$given.tau)

Version	Author	Date
73c62dd	Jason Willwerscheid	2020-03-12

The regularized fit is slower, but it does a bit better with respect to both the ELBO (surprisingly!) and the log likelihood of the implied discrete distribution.

res_df <- tibble(Fit = c("Without Regularization", "With Regularization"),
                 ELBO = sapply(res, function(x) x$fl$elbo),
                 Discrete.Llik = sapply(res, function(x) x$p.vals$llik),
                 Elapsed.Time = sapply(res, function(x) x$elapsed.time))
knitr::kable(res_df, digits = 0)

Fit	ELBO	Discrete.Llik	Elapsed.Time
Without Regularization	32325446	-12497124	740
With Regularization	32326656	-12496881	1190

Results: Gene-wise thresholding

The argument I made above can in fact be applied gene by gene. That is, I can impose a gene-wise threshold rather than a single threshold for all genes: \[ \text{Var}(Y_j) \ge \frac{1}{n} \sum_i \frac{X_{ij}}{s_i^2(\exp(Y_{ij}))^2} \]

Indeed, I could go a bit further and estimate the sampling variance directly rather than calculating a rough lower bound: \[ \text{Var}(Y_j) = \mathbb{E}_i(\text{Var}(Y_{ij})) + \text{Var}_i(\mathbb{E} Y_{ij}) \approx \frac{1}{n} \sum_{i} \frac{\lambda_{ij}}{s_i(\lambda_{ij} + 1)^2} + \text{Var}_i \left( \log(\lambda_{ij} + 1) - \frac{\lambda_{ij}}{2s_i(\lambda_{ij} + 1)^2} \right) \] The problem, however, is that if the “true” rate \(\lambda_{ij}\) is the same for all \(i\) (and, for the sake of argument, let all \(s_i\) be identical), then \(\text{Var}_i(\mathbb{E} Y_{ij})\) should be zero. If, however, the plug-in estimators \(\hat{\lambda}_{ij} = X_{ij} / s_i\) are used, then \(\text{Var}_i(\mathbb{E} Y_{ij})\) will be estimated as positive and the residual variance estimate for gene \(j\) risks being too large. For this reason, I think that the best one can do is to use the rough lower bound and then let flash estimate the residual variance.

Note, however, that all but one flash estimate is already greater than this lower bound (for this reason, I won’t bother to re-fit):

tmp.mat <- t(t(orig.data[-pbmc$dropped.genes, ] / (exp(pbmc$data))^2) / pbmc$size.factors^2)
var.lower.bd <- apply(tmp.mat, 1, mean)
var.est <- var.lower.bd + apply(pbmc$data - 0.5 * tmp.mat, 1, var)

var_df <- tibble(flash.estimate = res$reg$fl$residuals.sd^2,
                 var.lower.bd = var.lower.bd,
                 var.est = var.est)

ggplot(var_df, aes(x = var.lower.bd, y = flash.estimate)) + 
  geom_point(size = 1) +
  scale_x_log10() + 
  scale_y_log10() +
  geom_abline(slope = 1, linetype = "dashed")

In contrast, all of the flash estimates are lesser than the direct sampling variance estimate obtained using plug-in estimators, which strongly suggests that this approach would be inappropriate:

ggplot(var_df, aes(x = var.est, y = flash.estimate)) + 
  geom_point(size = 1) +
  scale_x_log10() + 
  scale_y_log10() +
  geom_abline(slope = 1, linetype = "dashed")

Results: Factor comparisons

Factor plots are nearly identical with and without regularization. I like regularization on principle, but I’m not sure that it actually makes much of a difference.

set.seed(666)
plot.factors(res$unreg, pbmc$cell.type, kset = order(res$unreg$fl$pve, decreasing = TRUE),
             title = "Without regularization")

Version	Author	Date
73c62dd	Jason Willwerscheid	2020-03-12

set.seed(666)
plot.factors(res$reg, pbmc$cell.type, kset = order(res$reg$fl$pve, decreasing = TRUE), 
             title = "With regularization")

Version	Author	Date
73c62dd	Jason Willwerscheid	2020-03-12

sessionInfo()

R version 3.5.3 (2019-03-11)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Mojave 10.14.6

Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] flashier_0.2.4  Matrix_1.2-15   forcats_0.4.0   stringr_1.4.0  
 [5] dplyr_0.8.0.1   purrr_0.3.2     readr_1.3.1     tidyr_0.8.3    
 [9] tibble_2.1.1    ggplot2_3.2.0   tidyverse_1.2.1

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.1        lubridate_1.7.4   lattice_0.20-38  
 [4] assertthat_0.2.1  rprojroot_1.3-2   digest_0.6.18    
 [7] foreach_1.4.4     truncnorm_1.0-8   R6_2.4.0         
[10] cellranger_1.1.0  plyr_1.8.4        backports_1.1.3  
[13] evaluate_0.13     httr_1.4.0        highr_0.8        
[16] pillar_1.3.1      rlang_0.4.2       lazyeval_0.2.2   
[19] pscl_1.5.2        readxl_1.3.1      rstudioapi_0.10  
[22] ebnm_0.1-24       irlba_2.3.3       whisker_0.3-2    
[25] rmarkdown_1.12    labeling_0.3      munsell_0.5.0    
[28] mixsqp_0.3-17     broom_0.5.1       compiler_3.5.3   
[31] modelr_0.1.5      xfun_0.6          pkgconfig_2.0.2  
[34] SQUAREM_2017.10-1 htmltools_0.3.6   tidyselect_0.2.5 
[37] workflowr_1.2.0   codetools_0.2-16  crayon_1.3.4     
[40] withr_2.1.2       MASS_7.3-51.1     grid_3.5.3       
[43] nlme_3.1-137      jsonlite_1.6      gtable_0.3.0     
[46] git2r_0.25.2      magrittr_1.5      scales_1.0.0     
[49] cli_1.1.0         stringi_1.4.3     reshape2_1.4.3   
[52] fs_1.2.7          doParallel_1.0.14 xml2_1.2.0       
[55] generics_0.0.2    iterators_1.0.10  tools_3.5.3      
[58] glue_1.3.1        hms_0.4.2         parallel_3.5.3   
[61] yaml_2.2.0        colorspace_1.4-1  ashr_2.2-38      
[64] rvest_0.3.4       knitr_1.22        haven_2.1.1