Revisiting semi-NMF for the pancreas data

Last updated: 2025-01-14

Checks: 7 0

Knit directory: single-cell-jamboree/analysis/

This reproducible R Markdown analysis was created with workflowr (version 1.7.1). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.

R Markdown file: up-to-date

Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.

Environment: empty

Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.

Seed: set.seed(1)

The command set.seed(1) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.

Session information: recorded

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.

Cache: none

Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.

File paths: relative

Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.

Repository version: 1797def

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.

The results in this page were generated with repository version 1797def. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Untracked files:
    Untracked:  data/GSE132188_adata.h5ad.h5
    Untracked:  data/Immune_ALL_human.h5ad
    Untracked:  data/pancreas_endocrine.RData
    Untracked:  data/pancreas_endocrine_alldays.h5ad

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.

These are the previous versions of the repository in which changes were made to the R Markdown (analysis/pancreas_snmf.Rmd) and HTML (docs/pancreas_snmf.html) files. If you’ve configured a remote Git repository (see ?wflow_git_remote), click on the hyperlinks in the table below to view the files as they were in that past version.

File	Version	Author	Date	Message
Rmd	1797def	Peter Carbonetto	2025-01-14	wflow_publish("pancreas_snmf.Rmd", view = FALSE, verbose = TRUE)
Rmd	28e9760	Peter Carbonetto	2025-01-14	Added more structure plots to the pancreas_snmf analysis.
html	af8ba37	Peter Carbonetto	2025-01-14	First build of the pancreas_snmf analysis.
Rmd	054ee83	Peter Carbonetto	2025-01-14	Added some structure plots to the pancreas_snmf analysis.
Rmd	0f2a8a6	Peter Carbonetto	2025-01-14	Reworking the script fit_pancreas_celseq2_snmf.R for fitting semi-NMFs to the pancrease celseq2 data.

In the more detailed analysis of the pancreas data, the semi-NMF results looked quite interesting and interpretable. As we show here, increasing the number of semi-NMF factors yields additional interesting structure

First, load the packages needed for this analysis.

library(Matrix)
library(fastTopics)
library(ggplot2)
library(cowplot)

Set the seed for reproducibility.

set.seed(1)

CEL-Seq2 data

Let’s start with the “CEL-Seq2” data. First load the CEL-Seq2 pancreas data and the outputs generated by running the fit_pancreas_celseq2_snmf.R script.

load("../data/pancreas.RData")
load("../output/pancreas_celseq2_snmf.RData")
i           <- which(sample_info$tech == "celseq2")
sample_info <- sample_info[i,]
counts      <- counts[i,]
sample_info <- transform(sample_info,celltype = factor(celltype))
celltype <- sample_info$celltype
celltype <-
 factor(celltype,
        c("acinar","ductal","activated_stellate","quiescent_stellate",
          "endothelial","macrophage","mast","schwann","alpha","beta",
          "delta","gamma","epsilon"))

flashier

Here is the semi-NMF fit generated by flashier:

other_colors <- c("#66c2a5","#fc8d62","#8da0cb")
L <- fl_snmf_ldf$L
k <- ncol(L)
colnames(L) <- paste0("k",1:k)
celltype_factors  <- 2:9
other_factors <- c(1,10,11)
p1 <- structure_plot(L[,celltype_factors],grouping = celltype,
                     gap = 20,perplexity = 70,n = Inf) +
  labs(y = "membership",fill = "factor",color = "factor",
       title = "cell-type factors")
p2 <- structure_plot(L[,other_factors],grouping = celltype,
                     gap = 20,perplexity = 70,n = Inf) +
  scale_color_manual(values = other_colors) +
  scale_fill_manual(values = other_colors) +
  labs(y = "membership",fill = "factor",color = "factor",
       title = "other factors")
plot_grid(p1,p2,nrow = 2,ncol = 1)

As before, the semi-NMF is capturing cell types a different levels of specificity, but with \(K = 11\) factors has identified additional factors for acinar cells and quiescent vs. activated stellate cells.

Covariance decomposition

Another appraoch to generating a semi-NMF is to decompose the cell-by-cell covariance matrix, which implicitly assumes that the LFCs (the “gene signatures”) are orthogonal to each other. This is implemented by the gbcd package. (By default, gbcd also encourages the loadings to be binary through a prior, but here we used a point-exponential prior which does not encourage this.)

For fair comparison, this decomposition was generated with the same number of factors.

L <- fl_cd_ldf$L
k <- ncol(L)
colnames(L) <- paste0("k",1:k)
celltype_factors  <- c(2,3,5,6,4,9,11)
other_factors <- c(1,8,10)
p1 <- structure_plot(L[,celltype_factors],grouping = celltype,
                     gap = 20,perplexity = 70,n = Inf) +
  labs(y = "membership",fill = "factor",color = "factor",
       title = "cell-type factors")
p2 <- structure_plot(L[,other_factors],grouping = celltype,
                     gap = 20,perplexity = 70,n = Inf) +
  scale_color_manual(values = other_colors) +
  scale_fill_manual(values = other_colors) +
  labs(y = "membership",fill = "factor",color = "factor",
       title = "other factors")
plot_grid(p1,p2,nrow = 2,ncol = 1)

This covariance decomposition captures much of the same structure as the semi-NMF above — I tried to show this by using the same colors for the factors — but also missed some interesting structure, e.g., a separate factor for endothelial cells, and does not distinguish well alpha cells from delta and gamma cells.

Smart-seq2 data

Let’s redo the comparisons above on the Smart-seq2 data set.

Load the Smart-Seq2 data and the outputs generated from running the fit_pancreas_smartseq2_snmf.R script.

load("../data/pancreas.RData")
load("../output/pancreas_smartseq2_snmf.RData")
i           <- which(sample_info$tech == "smartseq2")
sample_info <- sample_info[i,]
counts      <- counts[i,]
sample_info <- transform(sample_info,celltype = factor(celltype))
celltype <- sample_info$celltype
celltype <-
 factor(celltype,
        c("acinar","ductal","activated_stellate","quiescent_stellate",
          "endothelial","macrophage","mast","schwann","alpha",
          "beta","delta","gamma","epsilon"))

flashier

Here is the semi-NMF decomposition generated by flashier, with 11 factors:

L <- fl_snmf_ldf$L
k <- ncol(L)
colnames(L) <- paste0("k",1:k)
celltype_factors  <- c(3,4,7,9:11)
other_factors <- c(1,2,5,6,8)
p1 <- structure_plot(L[,celltype_factors],grouping = celltype,
                     gap = 20,perplexity = 70,n = Inf) +
  labs(y = "membership",fill = "factor",color = "factor",
       title = "cell-type factors")
p2 <- structure_plot(L[,other_factors],grouping = celltype,
                     gap = 20,perplexity = 70,n = Inf) +
  labs(y = "membership",fill = "factor",color = "factor",
       title = "other factors")
plot_grid(p1,p2,nrow = 2,ncol = 1)

Covariance decomposition

The covariance decomposition with the same number of factors again captures a lot of the same structure—with the only obvious exception being missing factor for the endothelial cells.

L <- fl_cd_ldf$L
k <- ncol(L)
colnames(L) <- paste0("k",1:k)
celltype_factors  <- c(2,4,5,8:11)
other_factors <- c(1,3,6,7)
p1 <- structure_plot(L[,celltype_factors],grouping = celltype,
                     gap = 20,perplexity = 70,n = Inf) +
  labs(y = "membership",fill = "factor",color = "factor",
       title = "cell-type factors")
p2 <- structure_plot(L[,other_factors],grouping = celltype,
                     gap = 20,perplexity = 70,n = Inf) +
  labs(y = "membership",fill = "factor",color = "factor",
       title = "other factors")
plot_grid(p1,p2,nrow = 2,ncol = 1)

sessionInfo()
# R version 4.3.3 (2024-02-29)
# Platform: aarch64-apple-darwin20 (64-bit)
# Running under: macOS Sonoma 14.7.1
# 
# Matrix products: default
# BLAS:   /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib 
# LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0
# 
# locale:
# [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
# 
# time zone: America/Chicago
# tzcode source: internal
# 
# attached base packages:
# [1] stats     graphics  grDevices utils     datasets  methods   base     
# 
# other attached packages:
# [1] cowplot_1.1.3      ggplot2_3.5.0      fastTopics_0.6-193 Matrix_1.6-5      
# 
# loaded via a namespace (and not attached):
#  [1] gtable_0.3.4        xfun_0.42           bslib_0.6.1        
#  [4] htmlwidgets_1.6.4   ggrepel_0.9.5       lattice_0.22-5     
#  [7] quadprog_1.5-8      vctrs_0.6.5         tools_4.3.3        
# [10] generics_0.1.3      parallel_4.3.3      tibble_3.2.1       
# [13] fansi_1.0.6         highr_0.10          pkgconfig_2.0.3    
# [16] data.table_1.15.2   SQUAREM_2021.1      RcppParallel_5.1.7 
# [19] lifecycle_1.0.4     truncnorm_1.0-9     farver_2.1.1       
# [22] compiler_4.3.3      stringr_1.5.1       git2r_0.33.0       
# [25] progress_1.2.3      munsell_0.5.0       RhpcBLASctl_0.23-42
# [28] httpuv_1.6.14       htmltools_0.5.7     sass_0.4.8         
# [31] yaml_2.3.8          lazyeval_0.2.2      plotly_4.10.4      
# [34] crayon_1.5.2        later_1.3.2         pillar_1.9.0       
# [37] jquerylib_0.1.4     whisker_0.4.1       tidyr_1.3.1        
# [40] uwot_0.2.2.9000     cachem_1.0.8        gtools_3.9.5       
# [43] tidyselect_1.2.1    digest_0.6.34       Rtsne_0.17         
# [46] stringi_1.8.3       dplyr_1.1.4         purrr_1.0.2        
# [49] ashr_2.2-66         labeling_0.4.3      rprojroot_2.0.4    
# [52] fastmap_1.1.1       grid_4.3.3          colorspace_2.1-0   
# [55] cli_3.6.2           invgamma_1.1        magrittr_2.0.3     
# [58] utf8_1.2.4          withr_3.0.0         prettyunits_1.2.0  
# [61] scales_1.3.0        promises_1.2.1      rmarkdown_2.26     
# [64] httr_1.4.7          workflowr_1.7.1     hms_1.1.3          
# [67] pbapply_1.7-2       evaluate_0.23       knitr_1.45         
# [70] viridisLite_0.4.2   irlba_2.3.5.1       rlang_1.1.3        
# [73] Rcpp_1.0.12         mixsqp_0.3-54       glue_1.7.0         
# [76] jsonlite_1.8.8      R6_2.5.1            fs_1.6.3

Revisiting semi-NMF for the pancreas data

Peter Carbonetto

CEL-Seq2 data

flashier

Covariance decomposition

Smart-seq2 data

flashier

Covariance decomposition