Variance Structures

Introduction

Gene-wise variance and sparsity

Cell-wise variance and library size

Kronecker variance and pre-scaling

Results: ELBO

Results: Distribution of p-values

Constant variance structure
Gene-wise variance structure
Kronecker variance structure

Results: Log likelihood of implied distribution

Results: Factors and cell types

Ionocytes factor
Tuft cells factors
Goblet cells factors
Hillock factor

Last updated: 2019-09-03

Checks: 6 0

Knit directory: scFLASH/

This reproducible R Markdown analysis was created with workflowr (version 1.2.0). The Report tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.

R Markdown file: up-to-date

Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.

Environment: empty

Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.

Seed: set.seed(20181103)

The command set.seed(20181103) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.

Session information: recorded

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.

Cache: none

Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.

Repository version: 723ee88

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility. The version displayed above was the version of the Git repository at the time these results were generated.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Ignored files:
    Ignored:    .DS_Store
    Ignored:    .Rhistory
    Ignored:    .Rproj.user/
    Ignored:    data/droplet.rds
    Ignored:    output/backfit/
    Ignored:    output/prior_type/
    Ignored:    output/size_factors/
    Ignored:    output/var_type/

Untracked files:
    Untracked:  analysis/NBapprox.Rmd
    Untracked:  analysis/pseudocount_redux.Rmd
    Untracked:  analysis/trachea4.Rmd
    Untracked:  code/missing_data.R
    Untracked:  code/pseudocount/
    Untracked:  code/pseudocounts.R
    Untracked:  code/trachea4.R
    Untracked:  data/Ensembl2Reactome.txt
    Untracked:  data/hard_bimodal1.txt
    Untracked:  data/hard_bimodal2.txt
    Untracked:  data/hard_bimodal3.txt
    Untracked:  data/mus_pathways.rds
    Untracked:  docs/figure/pseudocount2.Rmd/
    Untracked:  docs/figure/pseudocount_redux.Rmd/
    Untracked:  output/pseudocount/

Unstaged changes:
    Modified:   analysis/index.Rmd
    Modified:   analysis/pseudocount.Rmd
    Modified:   code/sc_comparisons.R

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.

These are the previous versions of the R Markdown and HTML files. If you’ve configured a remote Git repository (see ?wflow_git_remote), click on the hyperlinks in the table below to view them.

File	Version	Author	Date	Message
Rmd	723ee88	Jason Willwerscheid	2019-09-03	wflow_publish(“analysis/var_type.Rmd”)
html	851baf8	Jason Willwerscheid	2019-09-01	Build site.
Rmd	fc78232	Jason Willwerscheid	2019-09-01	wflow_publish(c(“analysis/backfit.Rmd”, “analysis/var_type.Rmd”))
html	e2d99bb	Jason Willwerscheid	2019-08-31	Build site.
Rmd	f608bb0	Jason Willwerscheid	2019-08-31	wflow_publish(“analysis/var_type.Rmd”)
html	26d29a3	Jason Willwerscheid	2019-08-16	Build site.
Rmd	bd61d1a	Jason Willwerscheid	2019-08-16	wflow_publish(“analysis/var_type.Rmd”)

Introduction

Recall that EBMF fits the model $Y_{ij} = \sum_k L_{ik}F_{jk} + E_{ij},$ where $E_{ij} \sim N(0, \sigma^2_{ij})$ By assuming that the matrix $\Sigma$ has a given structure, one can estimate the $\sigma_{ij}^2$ s via maximum likelihood. In flashier, one can assume:

A constant variance structure: $\sigma_{ij}^2 = \sigma^2$ .
A gene-wise variance structure: $\sigma_{ij}^2 = \sigma_i^2$ .
A cell-wise variance structure: $\sigma_{ij}^2 = \sigma_j^2$ .
A “Kronecker” (rank-one) variance structure: $\sigma_{ij}^2 = \sigma_{(1)i}^2 \sigma_{(2)j}^2$ .

The question, then, is how much the “errors” in the transformed data matrix $Y$ vary from gene to gene and from cell to cell, and whether accounting for this heteroskedasticity makes any difference to the quality of the factors obtained via flashier.

In this analysis, I’ll argue that accounting for differences in variability among both genes and cells is crucial, but I’ll advocate for a quick approximation to a Kronecker variance structure (which I’ll describe shortly) rather than a full Kronecker fit.

All fits were produced by adding 20 “greedy” factors to the Montoro et al. droplet dataset. The code can be viewed here.

source("./code/utils.R")
droplet <- readRDS("./data/droplet.rds")
droplet <- preprocess.droplet(droplet)
res <- readRDS("./output/var_type/vartype_fits.rds")

Gene-wise variance and sparsity

I calculate the “sparsity” of a gene as the proportion of cells that have a nonzero number of transcripts for that gene. (I prefer this measure to, say, mean expression because it doesn’t depend on how size factors are calculated.)

There is a very clear relationship between sparsity and gene-wise precision: in general, sparser genes yield more precise estimates. This is probably not surprising. If a gene is very sparse, then most estimates will be close to zero with high precision.

The reversal of this trend for the least sparse genes is more surprising. I think that this is at least partially an effect of library-size normalization: since library size is strongly correlated with the expression levels of the most highly expressed genes, the variability of the least sparse genes can be (artificially) reduced by scaling. I will investigate the effects of using size factors to normalize counts in a subsequent analysis.

gene.df <- data.frame(sparsity = droplet$gene.sparsity,
                      genewise.sd = res$bygene$fl$residuals.sd,
                      kronecker.sd = res$kron$fl$residuals.sd[[1]],
                      approx.sd = res$scale.both$gene.prescaling.factors)
ggplot(gene.df, aes(x = sparsity, y = genewise.sd)) +
  geom_point() +
  labs(y = "estimated sd")

Version	Author	Date
e2d99bb	Jason Willwerscheid	2019-08-31
26d29a3	Jason Willwerscheid	2019-08-16

Cell-wise variance and library size

Again the relationship is clear: a larger library size yields higher precision. This is to be expected, since a larger library size essentially means that we have more and better information about a cell.

cell.df <- data.frame(library.size = droplet$lib.size,
                      cellwise.sd = res$bycell$fl$residuals.sd,
                      kronecker.sd = res$kron$fl$residuals.sd[[2]],
                      approx.sd = res$scale.both$cell.prescaling.factors)
ggplot(cell.df, aes(x = library.size, y = cellwise.sd)) +
  geom_point() +
  labs(x = "library size", y = "estimated sd")

Version	Author	Date
e2d99bb	Jason Willwerscheid	2019-08-31
26d29a3	Jason Willwerscheid	2019-08-16

Kronecker variance and pre-scaling

A Kronecker variance structure allows us to account for both gene-wise and cell-wise variance, but it has the disadvantage of being much slower to fit.

To get most of the benefits of the Kronecker variance structure without sacrificing too much in performance, notice that the Kronecker model $Y_{ij} = \sum_k L_{ik}F_{jk} + E_{ij},\ E_{ij} \sim N(0, \sigma_i^{2(1)} \sigma_j^{2(2)})$ is nearly equivalent to the model $Y_{ij} / \sigma_j^{(2)} = \sum_k L_{ik}\frac{F_{jk}}{\sigma_j^{(2)}} + E_{ij},\ E_{ij} \sim N(0, \sigma_i^{2(1)}),$ so that, if one has cell-wise variance estimates, one can simply scale the columns in advance and then fit a gene-wise variance structure.

The difference is that the meaning of the prior on $F$ (which I have ignored until now) changes. Instead of fitting empirical Bayes priors $F_{jk} \sim g_k^{(f)},$ the pre-scaling approach fits the “p-value prior” $\frac{F_{jk}}{\sigma_j^{(2)}} \sim g_k^{(f)}.$ Similarly, by scaling the rows in advance, it is possible to fit $\frac{L_{ik}}{\sigma_i^{(1)}} \sim g_k^{(\ell)}.$ These observations suggest three possible ways of proceeding: scaling the columns in advance and fitting gene-wise variance estimates, scaling the rows and fitting cell-wise estimates, and scaling both and fitting gene-wise estimates. (One could also scale both and fit cell-wise estimates, but I’m convinced that getting variance estimates right for genes is more important than for cells: see below.) All approaches have the important advantage of being able to account for heteroskedasticity among both genes and cells, but they require nearly as little time to fit as the simpler cell- and gene-wise approaches (some additional time is needed to obtain the initial scaling factors):

t.df <- data.frame(elapsed.time = sapply(res, `[[`, "elapsed.time"))
t.df$fit <- names(res)
the.limits <- c("const", "bygene", "bycell", 
                "scale.genes", "scale.cells", "scale.both", "kron")
ggplot(t.df, aes(x = fit, y = elapsed.time)) + 
  geom_bar(stat = "identity") +
  scale_x_discrete(limits = the.limits) +
  scale_y_log10() +
  labs(x = NULL, y = "elapsed time (s)")

Version	Author	Date
e2d99bb	Jason Willwerscheid	2019-08-31
26d29a3	Jason Willwerscheid	2019-08-16

In this analysis, I fit a rank-one Kronecker model to get initial gene- and cell-wise variance estimates. Indeed, these estimates track those obtained using a full Kronecker fit fairly closely (note that the estimates can all be scaled by a constant without changing the model, so I’m looking for a linear relationship here rather than the relationship $y = x$ ).

ggplot(gene.df, aes(x = kronecker.sd, y = approx.sd)) +
  geom_point() +
  labs(x = "gene sd (full Kronecker)", y = "gene sd (rank-one Kronecker)")

Version	Author	Date
e2d99bb	Jason Willwerscheid	2019-08-31
26d29a3	Jason Willwerscheid	2019-08-16

ggplot(cell.df, aes(x = kronecker.sd, y = approx.sd)) +
  geom_point() +
  labs(x = "cell sd (full Kronecker)", y = "cell sd (rank-one Kronecker)")

Version	Author	Date
e2d99bb	Jason Willwerscheid	2019-08-31
26d29a3	Jason Willwerscheid	2019-08-16

Results: ELBO

A comparison of ELBOs suggests that accounting for heteroskedasticity among genes is more important than accounting for differences in variability among cells.

n.genes <- nrow(droplet$data)
n.cells <- ncol(droplet$data)
elbo <- sapply(res, function(x) x$fl$elbo + x$elbo.adj)
addl.par <- c(n.cells - 1, n.genes - 1, n.genes, n.genes, n.cells, n.cells)
nested.df <- data.frame(comparison = c("constant -> cell-wise",
                                       "constant -> gene-wise",
                                       "cell-wise -> kronecker",
                                       "cell-wise -> pre-scale genes",
                                       "gene-wise -> kronecker",
                                       "gene-wise -> pre-scale cells"),
                        addl.par = addl.par,
                        elbo.per.par = (round(c(elbo["bycell"] - elbo["const"],
                                                elbo["bygene"] - elbo["const"],
                                                elbo["kron"] - elbo["bycell"],
                                                elbo["scale.genes"] - elbo["bycell"],
                                                elbo["kron"] - elbo["bygene"],
                                                elbo["scale.cells"] - elbo["bygene"])
                                              / addl.par)))
col.names =  c("Nested models", 
               "Number of additional parameters", 
               "Improvement in ELBO per parameter")
knitr::kable(nested.df[1:2, ], col.names = col.names)

Nested models	Number of additional parameters	Improvement in ELBO per parameter
constant -> cell-wise	7190	893
constant -> gene-wise	14480	3644

A comparison of the Kronecker fit and pre-scaling approaches (being careful to adjust the ELBO of the latter to account for the scaling of the data) confirms that the quality of the two is similar.

knitr::kable(nested.df[3:6, ], col.names = col.names, row.names = FALSE)

Nested models	Number of additional parameters	Improvement in ELBO per parameter
cell-wise -> kronecker	14481	3814
cell-wise -> pre-scale genes	14481	3789
gene-wise -> kronecker	7191	1236
gene-wise -> pre-scale cells	7191	1227

In fact, pre-scaling both genes and cells yields the best ELBO among all fits.

elbo.df <- data.frame(fit = names(elbo), elbo = elbo)
ggplot(subset(elbo.df, elbo > 40000000), aes(x = fit, y = elbo)) + 
  geom_point() +
  scale_x_discrete(limits = c("scale.genes", "scale.cells", "scale.both", "kron")) +
  labs(x = NULL, y = "ELBO")

Version	Author	Date
e2d99bb	Jason Willwerscheid	2019-08-31
26d29a3	Jason Willwerscheid	2019-08-16

Results: Distribution of p-values

A comparison of implied $p$ -value distributions (as described here) offers further evidence for the superiority of the gene-wise fit relative to the constant one. The fits that account for heteroskedasticity among both genes and cells (which I will refer to as “Kronecker-like” fits) show further improvement, at least in terms of the KL-divergence of the distribution of $p$ -values relative to a uniform distribution.

KL <- sapply(lapply(res, `[[`, "p.vals"), `[[`, "KL.divergence")
KL.df <- data.frame(fit = names(KL), KL = KL)
ggplot(KL.df, aes(x = fit, y = KL)) +
  geom_bar(stat = "identity") +
  scale_x_discrete(limits = the.limits) +
  labs(x = NULL, y = "KL divergence relative to uniform")

Version	Author	Date
e2d99bb	Jason Willwerscheid	2019-08-31
26d29a3	Jason Willwerscheid	2019-08-16

Constant variance structure

plot.p.vals(res$const$p.vals)

Version	Author	Date
e2d99bb	Jason Willwerscheid	2019-08-31
26d29a3	Jason Willwerscheid	2019-08-16

Gene-wise variance structure

plot.p.vals(res$bygene$p.vals)

Version	Author	Date
e2d99bb	Jason Willwerscheid	2019-08-31
26d29a3	Jason Willwerscheid	2019-08-16

Kronecker variance structure

I omit the other Kronecker-like fits here; to the eye, they are indistinguishable.

plot.p.vals(res$kron$p.vals)

Version	Author	Date
e2d99bb	Jason Willwerscheid	2019-08-31
26d29a3	Jason Willwerscheid	2019-08-16

Results: Log likelihood of implied distribution

Since the change-of-variables ELBO adjustments are not entirely reliable, it’s useful to compare the log likelihoods of the implied discrete distributions as well.

From this perspective, the Kronecker fit does best, followed by the Kronecker-like fit that pre-scales cells. I’m inclined to believe these statistics more than the ELBOs. In effect, since sparse genes have very small residual standard deviations, the $p$ -value prior on genes implies that nonzero loadings are expected to be much larger for sparse genes than for genes that are highly expressed. The $p$ -value prior on cells doesn’t make much sense either, but the range of residual standard deviations for cells is much smaller, so there’s less of a difference between the $p$ -value prior and the usual prior.

On the other hand, the $p$ -value prior on genes has the welcome effect of inducing less shrinkage among sparser genes than among highly expressed genes. When I’m looking at gene loadings for a particular factor, I’m typically more interested in genes that are uniquely highly expressed in that factor rather than genes that are highly expressed across all factors. Since the former tend to be on the sparser side, the $p$ -value prior can help to make these factor-specific genes more visible in a top- $n$ loadings plot. Some examples are below.

llik <- sapply(lapply(res, `[[`, "p.vals"), `[[`, "llik")
llik.df <- data.frame(fit = names(llik), llik = llik)
ggplot(llik.df, aes(x = fit, y = llik)) + 
  geom_point() +
  scale_x_discrete(limits = c("const", "bycell", "bygene", "kron", 
                              "scale.genes", "scale.cells", "scale.both")) +
  labs(x = NULL, y = "Log likelihood of implied distribution")

Results: Factors and cell types

Here I plot factors in decreasing order of proportion of variance explained.

The constant variance structure is by far the least successful in producing sparse factors that can easily discriminate among cell types. The differences among the other fits are intriguing, but it’s not clear why one rather than another might be preferred. I examine a few individual factors in detail below.

plot.factors(res$const$fl, 
             droplet$cell.type, 
             title = "Constant variance structure")

Version	Author	Date
e2d99bb	Jason Willwerscheid	2019-08-31
26d29a3	Jason Willwerscheid	2019-08-16

plot.factors(res$bygene$fl, 
             droplet$cell.type, 
             title = "Gene-wise variance structure")

Version	Author	Date
e2d99bb	Jason Willwerscheid	2019-08-31
26d29a3	Jason Willwerscheid	2019-08-16

plot.factors(res$kron$fl,
             droplet$cell.type,
             title = "Kronecker variance structure")

Version	Author	Date
e2d99bb	Jason Willwerscheid	2019-08-31
26d29a3	Jason Willwerscheid	2019-08-16

plot.factors(res$scale.both$fl,
             droplet$cell.type,
             title = "Pre-scaling both genes and cells")

Version	Author	Date
e2d99bb	Jason Willwerscheid	2019-08-31
26d29a3	Jason Willwerscheid	2019-08-16

Ionocytes factor

In all fits except for the one that assumes a constant variance structure, there is a factor that clearly picks out ionocytes. (In the above plots of the gene-wise, Kronecker, and pre-scaled fits, it’s factor 18, 19, and 17: look for the big blue dots).

The pre-scaling approach yields the sparsest factor with respect to gene loadings. It’s not as sparse with respect to cell loadings, but the above plot makes clear that the ionocyte-specific effects are entangled with goblet-specific effects. Backfitting or a semi-nonnegative factorization might be able to disentangle the two sets of effects.

fls <- list(res$bygene$fl, res$kron$fl, res$scale.both$fl)
ion.k <- mapply(get.orig.k, fls, c(18, 19, 17))
ion.signif.genes <- mapply(function(fl, k) sum(fl$loadings.lfsr[[1]][, k] < .01), 
                           fls, ion.k)
ion.signif.cells <- mapply(function(fl, k) sum(fl$loadings.lfsr[[2]][, k] < .01), 
                           fls, ion.k)

ion.df <- data.frame(fit = c("Gene-wise", "Kronecker", "Pre-scaled"),
                     signif.genes = ion.signif.genes,
                     signif.cells = ion.signif.cells)
knitr::kable(ion.df, 
             col.names = c("Fit", "Number of significant genes (LFSR < 0.01)",
                           "Number of significant cells (LFSR < 0.01)"))

Fit	Number of significant genes (LFSR < 0.01)	Number of significant cells (LFSR < 0.01)
Gene-wise	501	80
Kronecker	203	28
Pre-scaled	168	73

Below, I plot the top 100 loadings for each fit, labelling genes that Montoro et al. identify as important for ionocyte function. The pre-scaling approach yields especially nice results here, with Cftr and Atp6v0d2 given especially prominent place with respect to both rank and absolute value. Atp6v1c2 is de-emphasized, but Figure 5 in Montoro et al. suggests that the gene is not particularly specifically expressed in ionocytes anyway.

plot.one.factor(res$bygene$fl, ion.k[1], droplet$ionocyte.genes, 
                title = "Gene-wise variance structure")

Version	Author	Date
e2d99bb	Jason Willwerscheid	2019-08-31
26d29a3	Jason Willwerscheid	2019-08-16

plot.one.factor(res$kron$fl, ion.k[2], droplet$ionocyte.genes, 
                title = "Kronecker variance structure")

Version	Author	Date
e2d99bb	Jason Willwerscheid	2019-08-31
26d29a3	Jason Willwerscheid	2019-08-16

plot.one.factor(res$scale.both$fl, ion.k[3], droplet$ionocyte.genes, 
                title = "Pre-scaling approach", invert = TRUE)

Version	Author	Date
e2d99bb	Jason Willwerscheid	2019-08-31
26d29a3	Jason Willwerscheid	2019-08-16

Tuft cells factors

In each of the above factor plots, factor 6 is primarily loaded on tuft cells, and to a lesser extent on neuroendocrine cells. Again, the factor given by the pre-scaling approach does the best job of picking out genes that Montoro et al. claim are important to that cell type’s function.

top.n <- 200
plot.one.factor(res$bygene$fl, get.orig.k(res$bygene$fl, 6), droplet$tuft.genes, 
                title = "Gene-wise variance structure", top.n = top.n)

Version	Author	Date
e2d99bb	Jason Willwerscheid	2019-08-31
26d29a3	Jason Willwerscheid	2019-08-16

plot.one.factor(res$kron$fl, get.orig.k(res$kron$fl, 6), droplet$tuft.genes, 
                title = "Kronecker variance structure", top.n = top.n)

Version	Author	Date
e2d99bb	Jason Willwerscheid	2019-08-31
26d29a3	Jason Willwerscheid	2019-08-16

plot.one.factor(res$scale.both$fl, get.orig.k(res$scale.both$fl, 6), droplet$tuft.genes, 
                title = "Pre-scaling approach", top.n = top.n)

Version	Author	Date
e2d99bb	Jason Willwerscheid	2019-08-31
26d29a3	Jason Willwerscheid	2019-08-16

The gene-wise fit produces a second factor that might serve to differentiate among tuft cells (factor 19), which is particularly intriguing given Montoro et al.’s claim that there exist two distinct subpopulations of tuft cells. Indeed, nearly all genes that are identified as more specific to tuft-1 cells are negatively loaded, while all genes identified as specific to tuft-2 cells are positively loaded:

plot.factor.subpops(res$bygene$fl, get.orig.k(res$bygene$fl, 19),
                    droplet$tuft1.genes, droplet$tuft2.genes, "tuft-1", "tuft-2",
                    title = "Gene-wise variance structure, Factor 19")

Version	Author	Date
e2d99bb	Jason Willwerscheid	2019-08-31
26d29a3	Jason Willwerscheid	2019-08-16

This is an appealing factor to have, especially since Montoro et al. only discover the tuft-1 vs. tuft-2 distinction when investigating the larger “pulse-seq” dataset. In effect, it would allow us to label tuft-1 and tuft-2 cells with some confidence even though those labels aren’t provided for the droplet-based dataset. While it’s disappointing that the factor doesn’t appear in the Kronecker-like fits shown here, it does eventually get added if greedy.Kmax is increased (when scaling both genes and cells, 29 factors are required).

One final thing to note is that with both constant and Kronecker variance structures, factor 4 identifies an axis with tuft cells at one extreme and club cells at the other. I haven’t been able to figure out what this factor is indexing.

Goblet cells factors

Here there are large differences among fits. The gene-wise fit yields a single factor that cleanly picks out goblet cells (Factor 17); the Kronecker fit gives an additional factor that indexes goblet-1 and goblet-2 subpopulations (Factor 20); and the pre-scaling approach yields multiple factors that do a relatively poor job of separating out goblet cells (Factors 11-13).

The pre-scaling approach clearly does worst here, although one hopes that backfitting would be able to clean up the results. Factor 20 from the Kronecker fit plays a similar role with respect to goblet cells as does Factor 19 from the gene-wise fit with respect to tuft cells, and is a very desirable factor to have. (A similar factor gets added to the pre-scaling fit if greedy.Kmax is increased to 21.)

plot.one.factor(res$bygene$fl, get.orig.k(res$bygene$fl, 17), droplet$goblet.genes,
                title = "Gene-wise variance structure", invert = TRUE)

Version	Author	Date
e2d99bb	Jason Willwerscheid	2019-08-31
26d29a3	Jason Willwerscheid	2019-08-16

plot.one.factor(res$kron$fl, get.orig.k(res$kron$fl, 14), droplet$goblet.genes,
                title = "Kronecker variance structure", invert = TRUE)

Version	Author	Date
e2d99bb	Jason Willwerscheid	2019-08-31
26d29a3	Jason Willwerscheid	2019-08-16

plot.factor.subpops(res$kron$fl, get.orig.k(res$kron$fl, 20),
                    droplet$goblet1.genes, droplet$goblet2.genes, "goblet-1", "goblet-2",
                    title = "Kronecker variance structure, Factor 20")

Version	Author	Date
e2d99bb	Jason Willwerscheid	2019-08-31
26d29a3	Jason Willwerscheid	2019-08-16

plot.one.factor(res$scale.both$fl, get.orig.k(res$scale.both$fl, 11), droplet$goblet.genes,
                title = "Pre-scaling approach, Factor 11", invert = TRUE)

Version	Author	Date
e2d99bb	Jason Willwerscheid	2019-08-31
26d29a3	Jason Willwerscheid	2019-08-16

plot.one.factor(res$scale.both$fl, get.orig.k(res$scale.both$fl, 12), droplet$goblet.genes,
                title = "Pre-scaling approach, Factor 12", invert = FALSE)

Version	Author	Date
e2d99bb	Jason Willwerscheid	2019-08-31
26d29a3	Jason Willwerscheid	2019-08-16

plot.one.factor(res$scale.both$fl, get.orig.k(res$scale.both$fl, 13), droplet$goblet.genes,
                title = "Pre-scaling approach, Factor 13", invert = TRUE)

Version	Author	Date
e2d99bb	Jason Willwerscheid	2019-08-31
26d29a3	Jason Willwerscheid	2019-08-16

Hillock factor

Another feature that is identified in the pulse-seq dataset but not the droplet-based dataset is a subpopulation of transitional cells along the basal-to-club trajectory organized in “hillocks.”

To identify the most likely candidate for a hillock factor, I choose the factor with the largest mean loading among genes that Montoro et al. identify as particularly highly expressed in hillock cells. In each case, this turns out be Factor 5, with hillock cells negatively loaded in all cases but the gene-wise one. In the factor plots, one notes a large mass of basal and club cells centered around zero, which gently tails off into a smaller population of club cells.

There’s not a lot of daylight between fits here. All of them match up pretty well with the description given in Montoro et al.

hillock.k <- sapply(fls, function(fl) { 
  which.max(abs(colMeans(fl$loadings.pm[[1]][droplet$hillock.genes, ])))
})
invert <- mapply(function(fl, k) mean(fl$loadings.pm[[1]][droplet$hillock.genes, k]) < 0,
                 fls, hillock.k)

top.n <- 50
plot.one.factor(res$bygene$fl, hillock.k[1], droplet$hillock.genes, 
                title = "Gene-wise variance structure", top.n = top.n, invert = invert[1])

Version	Author	Date
e2d99bb	Jason Willwerscheid	2019-08-31
26d29a3	Jason Willwerscheid	2019-08-16

plot.one.factor(res$kron$fl, hillock.k[2], droplet$hillock.genes, 
                title = "Kronecker variance structure", top.n = top.n, invert = invert[2])

Version	Author	Date
e2d99bb	Jason Willwerscheid	2019-08-31
26d29a3	Jason Willwerscheid	2019-08-16

plot.one.factor(res$scale.both$fl, hillock.k[3], droplet$hillock.genes, 
                title = "Pre-scaling approach", top.n = top.n, invert = invert[3])

Version	Author	Date
e2d99bb	Jason Willwerscheid	2019-08-31
26d29a3	Jason Willwerscheid	2019-08-16

sessionInfo()

#> R version 3.5.3 (2019-03-11)
#> Platform: x86_64-apple-darwin15.6.0 (64-bit)
#> Running under: macOS Mojave 10.14.6
#> 
#> Matrix products: default
#> BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
#> LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib
#> 
#> locale:
#> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] flashier_0.1.15 ggplot2_3.2.0   Matrix_1.2-15  
#> 
#> loaded via a namespace (and not attached):
#>  [1] Rcpp_1.0.1        plyr_1.8.4        highr_0.8        
#>  [4] compiler_3.5.3    pillar_1.3.1      git2r_0.25.2     
#>  [7] workflowr_1.2.0   iterators_1.0.10  tools_3.5.3      
#> [10] digest_0.6.18     evaluate_0.13     tibble_2.1.1     
#> [13] gtable_0.3.0      lattice_0.20-38   pkgconfig_2.0.2  
#> [16] rlang_0.3.1       foreach_1.4.4     parallel_3.5.3   
#> [19] yaml_2.2.0        ebnm_0.1-24       xfun_0.6         
#> [22] withr_2.1.2       stringr_1.4.0     dplyr_0.8.0.1    
#> [25] knitr_1.22        fs_1.2.7          rprojroot_1.3-2  
#> [28] grid_3.5.3        tidyselect_0.2.5  glue_1.3.1       
#> [31] R6_2.4.0          rmarkdown_1.12    mixsqp_0.1-119   
#> [34] reshape2_1.4.3    ashr_2.2-38       purrr_0.3.2      
#> [37] magrittr_1.5      whisker_0.3-2     MASS_7.3-51.1    
#> [40] codetools_0.2-16  backports_1.1.3   scales_1.0.0     
#> [43] htmltools_0.3.6   assertthat_0.2.1  colorspace_1.4-1 
#> [46] labeling_0.3      stringi_1.4.3     pscl_1.5.2       
#> [49] doParallel_1.0.14 lazyeval_0.2.2    munsell_0.5.0    
#> [52] truncnorm_1.0-8   SQUAREM_2017.10-1 crayon_1.3.4