Loading [MathJax]/jax/output/HTML-CSS/jax.js

Last updated: 2025-02-20

Checks: 7 0

Knit directory: single-cell-jamboree/analysis/

This reproducible R Markdown analysis was created with workflowr (version 1.7.1). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.


Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.

Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.

The command set.seed(1) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.

Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.

Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.

The results in this page were generated with repository version 9440e05. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Untracked files:
    Untracked:  data/GSE132188_adata.h5ad.h5
    Untracked:  data/Immune_ALL_human.h5ad
    Untracked:  data/pancreas_endocrine.RData
    Untracked:  data/pancreas_endocrine_alldays.h5ad

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.


These are the previous versions of the repository in which changes were made to the R Markdown (analysis/pancreas_annotate.Rmd) and HTML (docs/pancreas_annotate.html) files. If you’ve configured a remote Git repository (see ?wflow_git_remote), click on the hyperlinks in the table below to view the files as they were in that past version.

File Version Author Date Message
Rmd 9440e05 Peter Carbonetto 2025-02-20 workflowr::wflow_publish("pancreas_annotate.Rmd", verbose = TRUE,
Rmd 07d482e Peter Carbonetto 2025-02-19 Fixed a few words in the pancreas_annotate analysis.
Rmd b264152 Peter Carbonetto 2025-02-19 A bit of formatting adjustment to the pancreas_annotate rmarkdown.
Rmd 20ddb47 Peter Carbonetto 2025-02-19 Added another note.
Rmd 190f7c3 Peter Carbonetto 2025-02-19 Edited a note about interpretation to the pancreas_annotate analysis.
html 1f93686 Peter Carbonetto 2025-02-19 Added an ‘interpretable’ annotation plot to the pancreas_annotate analysis.
Rmd b36508b Peter Carbonetto 2025-02-19 wflow_publish("pancreas_annotate.Rmd", verbose = TRUE, view = FALSE)
Rmd a8d4576 Peter Carbonetto 2025-02-19 A few edits to the text of the pancreas_annotate analysis.
Rmd 9dd5b4a Peter Carbonetto 2025-02-19 Added annotation plots for flashier NMF result to pancreas_annotate analysis.
Rmd 335dd81 Peter Carbonetto 2025-02-19 Added structure plot to pancreas_annotate analysis.
html e771738 Peter Carbonetto 2025-02-18 First build of the pancreas_annotate analysis.
Rmd 371108b Peter Carbonetto 2025-02-18 workflowr::wflow_publish("pancreas_annotate.Rmd", verbose = TRUE)

Here we re-examine some of the matrix factorization results from the pancreas CEL-seq2 data, with the goal of understanding how best to annotate the pancreas factors. As we will see, there isn’t a single “one-size-fits-all” strategy that works best, and so we recommend exploring different annotation strategies. Also, careful interpretation of the matrix factorization is discussed.

The plotting functions used in this analysis are from fastTopics.

First, load the packages needed for this analysis.

library(Matrix)
library(flashier)
library(fastTopics)
library(ggplot2)
library(cowplot)

Set the seed for reproducibility.

set.seed(1)

Load the CEL-Seq2 pancreas data and the outputs generated by running the compute_pancreas_celseq2_factors.R script.

load("../data/pancreas.RData")
load("../output/pancreas_celseq2_factors.RData")
i           <- which(sample_info$tech == "celseq2")
sample_info <- sample_info[i,]
counts      <- counts[i,]
sample_info <- transform(sample_info,celltype = factor(celltype))

We will first focus on the non-negative matrix factorization (NMF) produced by flashier.

Structure plot

The Structure plot (also shown in the previous analysis) shows that many of the factors correspond closely to the cell-type assignments that were estimated in the published analysis:

celltype <- sample_info$celltype
celltype <-
 factor(celltype,
        c("acinar","ductal","activated_stellate","quiescent_stellate",
          "endothelial","macrophage","mast","schwann","alpha","beta",
          "delta","gamma","epsilon"))
L <- fl_nmf_ldf$L
colnames(L) <- paste0("k",1:9)
structure_plot(L[,-1],grouping = celltype,gap = 10,perplexity = 70,n = Inf) +
  labs(y = "membership",fill = "factor",color = "factor")

Version Author Date
1f93686 Peter Carbonetto 2025-02-19

Note that the first factor was omitted in the Structure plot because it is a “baseline” factor, and not particularly interesting to look at.

A note about interpretation: For visualization purposes, the columns of the L matrix—the “membership matrix”—were scaled so that the largest membership for a given factor (column) was always exactly 1. However, please note this normalization is arbitrary. Therefore, it is not meaningful to compare memberships across factors (i.e., colors in the Structure plot); it is only meaningful to compare memberships within a given factor (a single color in the Structure plot).

Annotating the factors by “driving genes”

To illustrate annotating the factors, let’s focus on factors 4, 5 and 6—these are the factors that largely capture the islet cells (alpha, beta, etc). Let’s consider two different selection strategies: (i) choosing genes j with the largest fjk; (ii) choosing genes j with the largest differences fjkfjk with other factors k (“distinctive genes”). These two selection strategies are implemented in the annotation_heatmap function:

F <- fl_nmf_ldf$F
colnames(F) <- paste0("k",1:9)
kset <- paste0("k",4:6)
p1 <- annotation_heatmap(F,n = 8,dims = kset,
                         select_features = "largest",
                         font_size = 9) +
  labs(title = "select_features = \"largest\"") +
  theme(plot.title = element_text(face = "plain",size = 9))
p2 <- annotation_heatmap(F,n = 8,dims = kset,
                         select_features = "distinctive",
                         compare_dims = kset,
                         font_size = 9) +
  labs(title = "select_features = \"distinctive\"") +
  theme(plot.title = element_text(face = "plain",size = 9))
plot_grid(p1,p2,nrow = 1,ncol = 2)

Version Author Date
1f93686 Peter Carbonetto 2025-02-19
# Features selected for plot: INS IAPP SCGN SLC30A8 ABCC8 G6PC2 NPTX2 HADH GCG CHGB TM4SF4 TTR SCG2 SCG5 ALDH1A1 PCSK2 SST RBP4 PCSK1 CPE PPY SEC11C ISL1 
c("INS", "IAPP", "SCGN", "SLC30A8", "ABCC8", "G6PC2", "NPTX2", 
"HADH", "GCG", "CHGB", "TM4SF4", "TTR", "SCG2", "SCG5", "ALDH1A1", 
"PCSK2", "SST", "RBP4", "PCSK1", "CPE", "PPY", "SEC11C", "ISL1"
)
# Features selected for plot: INS IAPP NPTX2 MAFA MEG3 ADCYAP1 PFKFB2 DLK1 GCG GC TTR TM4SF4 FAP LOXL4 ALDH1A1 CRYBA2 SST AQP3 PPY LEPR EGR1 RBP4 DPYSL3 AKAP12 
c("INS", "IAPP", "NPTX2", "MAFA", "MEG3", "ADCYAP1", "PFKFB2", 
"DLK1", "GCG", "GC", "TTR", "TM4SF4", "FAP", "LOXL4", "ALDH1A1", 
"CRYBA2", "SST", "AQP3", "PPY", "LEPR", "EGR1", "RBP4", "DPYSL3", 
"AKAP12")

Strategy (i) picks out some canonical marker genes for islet cells such as INS for beta cells and GCG for alpha cells. But it also picks out other genes that are highly expressed in multiple islet cell types, such as TTR and CHGB. Strategy (ii) focusses more strongly on genes that distinguish one cell type from another, and as a result marker genes such as MAFA (beta cells) and GC (alpha cells) are ranked more highly with this strategy.

The better strategy will depend on the setting and on the goals of the analysis, which is why the annotation_heatmap function provides both options. These selection strategies can also reveal complementary insights and so in many situations it may be better to use both.

A more interpretable annotation heatmap

Above we sounded a note of caution about interpreting elements of L across factors/columns. The same applies to the F matrix. To provide a more even footing, above we employed the simple heuristic of scaling the columns of F so that the maximum element in each column was 1. That was helpful for selecting “distinctive” gene, but made the effect sizes difficult to interpret. To produce more easily interpretable effect sizes, we recommend visualizing this F matrix (in this code, fl is a “flash” object, e.g., the return value from a call to flashier::flash()):

out <- ldf(fl)
F <- with(out,F %*% diag(D))

This is what this rescaled F matrix looks like for the pancreas data:

genes <- c("INS","IAPP","NPTX2","MAFA","MEG3","ADCYAP1","PFKFB2", 
           "DLK1","GCG","GC","TTR","TM4SF4","FAP","LOXL4","ALDH1A1", 
           "CRYBA2","SST","AQP3","PPY","LEPR","EGR1","RBP4","DPYSL3", 
           "AKAP12")
F <- with(fl_nmf_ldf,F %*% diag(D))
colnames(F) <- paste0("k",1:9)
annotation_heatmap(F,select_features = genes,font_size = 9)

Version Author Date
1f93686 Peter Carbonetto 2025-02-19

Visually, this plot looks quite similar to before, but now the effect sizes are on a different scale. With this rescaling, the effect sizes have the following interpretation:

fjk is (approximately) the log-fold change (LFC) of gene j in a hypothetical cell i with the largest possible membership to factor k (lik=1) relative to another hypothetical cell i with no membership to factor k (lik=0).

Also note that this same idea could have been also applied to the other annotation heatmap (using the first selection strategy)—we showed this with just the second annotation heatmap for illustration.

Annotating the semi-NMF

Previously, we found the semi-non-negative matrix factorization (semi-NMF) produced by flashier yields a more hierarchical decomposition of the cell types, capturing structure at different levels of cell-type-specifity:

L <- fl_snmf_ldf$L
colnames(L) <- paste0("k",1:9)
structure_plot(L[,-c(1,5)],grouping = celltype,gap = 10,
               perplexity = 70,n = Inf) +
  labs(y = "membership",fill = "factor",color = "factor")

As a result, we would expect that the factors themselves would tend to pick more “distinctive” features; for example, factor 8 capturing expression specific to dselta, gamma and epsilon cells doesn’t need to include the expression that is shared with alpha cells because factor 4 already captures this. Still, it is helpful to explore these two selection strategies for the islet cell factors (3, 4 and 8) in the annotation heatmap:

F <- fl_snmf_ldf$F
colnames(F) <- paste0("k",1:9)
kset <- paste0("k",c(3,4,8))
F <- F[,c(3,4,8,1,2,5,6,7,9)]
p1 <- annotation_heatmap(F,n = 8,dims = kset,
                         select_features = "largest",
                         feature_sign = "positive",
                         font_size = 9) +
  labs(title = "select_features = \"largest\"") +
  theme(plot.title = element_text(face = "plain",size = 9))
p2 <- annotation_heatmap(F,n = 8,dims = kset,
                         select_features = "distinctive",
                         compare_dims = kset,
                         feature_sign = "positive",
                         font_size = 9) +
  labs(title = "select_features = \"distinctive\"") +
  theme(plot.title = element_text(face = "plain",size = 9))
plot_grid(p1,p2,nrow = 1,ncol = 2)

# Features selected for plot: GCG TTR TM4SF4 GC CHGB PCSK2 MALAT1 IGFBP7 INS IAPP HADH NPTX2 MAFA RBP4 PCSK1 SCD5 SST AQP3 PPY LEPR DPYSL3 AKAP12 
c("GCG", "TTR", "TM4SF4", "GC", "CHGB", "PCSK2", "MALAT1", "IGFBP7", 
"INS", "IAPP", "HADH", "NPTX2", "MAFA", "RBP4", "PCSK1", "SCD5", 
"SST", "AQP3", "PPY", "LEPR", "DPYSL3", "AKAP12")
# Features selected for plot: GCG GC TTR TM4SF4 FAP LOXL4 IRX2 ALDH1A1 INS IAPP MAFA NPTX2 ADCYAP1 PFKFB2 MEG3 DLK1 SST AQP3 LEPR AKAP12 MTUS1 EGR1 PPY S100A6 
c("GCG", "GC", "TTR", "TM4SF4", "FAP", "LOXL4", "IRX2", "ALDH1A1", 
"INS", "IAPP", "MAFA", "NPTX2", "ADCYAP1", "PFKFB2", "MEG3", 
"DLK1", "SST", "AQP3", "LEPR", "AKAP12", "MTUS1", "EGR1", "PPY", 
"S100A6")

sessionInfo()
# R version 4.3.3 (2024-02-29)
# Platform: aarch64-apple-darwin20 (64-bit)
# Running under: macOS Sonoma 14.7.1
# 
# Matrix products: default
# BLAS:   /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib 
# LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0
# 
# locale:
# [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
# 
# time zone: America/Chicago
# tzcode source: internal
# 
# attached base packages:
# [1] stats     graphics  grDevices utils     datasets  methods   base     
# 
# other attached packages:
# [1] cowplot_1.1.3     ggplot2_3.5.0     fastTopics_0.7-24 flashier_1.0.55  
# [5] ebnm_1.1-34       Matrix_1.6-5     
# 
# loaded via a namespace (and not attached):
#  [1] tidyselect_1.2.1     viridisLite_0.4.2    farver_2.1.1        
#  [4] dplyr_1.1.4          fastmap_1.1.1        lazyeval_0.2.2      
#  [7] promises_1.2.1       digest_0.6.34        lifecycle_1.0.4     
# [10] invgamma_1.1         magrittr_2.0.3       compiler_4.3.3      
# [13] rlang_1.1.3          sass_0.4.8           progress_1.2.3      
# [16] tools_4.3.3          utf8_1.2.4           yaml_2.3.8          
# [19] data.table_1.15.2    knitr_1.45           labeling_0.4.3      
# [22] prettyunits_1.2.0    htmlwidgets_1.6.4    scatterplot3d_0.3-44
# [25] plyr_1.8.9           RColorBrewer_1.1-3   Rtsne_0.17          
# [28] workflowr_1.7.1      withr_3.0.0          purrr_1.0.2         
# [31] grid_4.3.3           fansi_1.0.6          git2r_0.33.0        
# [34] colorspace_2.1-0     scales_1.3.0         gtools_3.9.5        
# [37] cli_3.6.2            rmarkdown_2.26       crayon_1.5.2        
# [40] generics_0.1.3       RcppParallel_5.1.7   httr_1.4.7          
# [43] reshape2_1.4.4       pbapply_1.7-2        cachem_1.0.8        
# [46] stringr_1.5.1        splines_4.3.3        parallel_4.3.3      
# [49] softImpute_1.4-1     vctrs_0.6.5          jsonlite_1.8.8      
# [52] hms_1.1.3            mixsqp_0.3-54        ggrepel_0.9.5       
# [55] irlba_2.3.5.1        horseshoe_0.2.0      trust_0.1-8         
# [58] plotly_4.10.4        jquerylib_0.1.4      tidyr_1.3.1         
# [61] glue_1.7.0           uwot_0.2.2.9000      stringi_1.8.3       
# [64] Polychrome_1.5.1     gtable_0.3.4         later_1.3.2         
# [67] quadprog_1.5-8       munsell_0.5.0        tibble_3.2.1        
# [70] pillar_1.9.0         htmltools_0.5.7      truncnorm_1.0-9     
# [73] R6_2.5.1             rprojroot_2.0.4      evaluate_0.23       
# [76] lattice_0.22-5       highr_0.10           RhpcBLASctl_0.23-42 
# [79] SQUAREM_2021.1       ashr_2.2-66          httpuv_1.6.14       
# [82] bslib_0.6.1          Rcpp_1.0.12          deconvolveR_1.2-1   
# [85] whisker_0.4.1        xfun_0.42            fs_1.6.3            
# [88] pkgconfig_2.0.3