Annotation of pancreas factors

Last updated: 2025-02-19

Checks: 7 0

Knit directory: single-cell-jamboree/analysis/

This reproducible R Markdown analysis was created with workflowr (version 1.7.1). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.

R Markdown file: up-to-date

Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.

Environment: empty

Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.

Seed: set.seed(1)

The command set.seed(1) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.

Session information: recorded

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.

Cache: none

Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.

File paths: relative

Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.

Repository version: b36508b

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.

The results in this page were generated with repository version b36508b. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Untracked files:
    Untracked:  data/GSE132188_adata.h5ad.h5
    Untracked:  data/Immune_ALL_human.h5ad
    Untracked:  data/pancreas_endocrine.RData
    Untracked:  data/pancreas_endocrine_alldays.h5ad

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.

These are the previous versions of the repository in which changes were made to the R Markdown (analysis/pancreas_annotate.Rmd) and HTML (docs/pancreas_annotate.html) files. If you’ve configured a remote Git repository (see ?wflow_git_remote), click on the hyperlinks in the table below to view the files as they were in that past version.

File	Version	Author	Date	Message
Rmd	b36508b	Peter Carbonetto	2025-02-19	wflow_publish("pancreas_annotate.Rmd", verbose = TRUE, view = FALSE)
Rmd	a8d4576	Peter Carbonetto	2025-02-19	A few edits to the text of the pancreas_annotate analysis.
Rmd	9dd5b4a	Peter Carbonetto	2025-02-19	Added annotation plots for flashier NMF result to pancreas_annotate analysis.
Rmd	335dd81	Peter Carbonetto	2025-02-19	Added structure plot to pancreas_annotate analysis.
html	e771738	Peter Carbonetto	2025-02-18	First build of the pancreas_annotate analysis.
Rmd	371108b	Peter Carbonetto	2025-02-18	workflowr::wflow_publish("pancreas_annotate.Rmd", verbose = TRUE)

Here we re-examine some of the matrix factorization results from the pancreas CEL-seq2 data, with the goal of understanding how best to annotate the pancreas factors. As we will see, there isn’t a single “one-size-fits-all” strategy that works best, and so we recommend exploring different annotation strategies. Also, careful interpretation of the matrix factorization is discussed.

The plotting functions used in this analysis are from fastTopics.

First, load the packages needed for this analysis.

library(Matrix)
library(flashier)
library(fastTopics)
library(ggplot2)
library(cowplot)

Set the seed for reproducibility.

set.seed(1)

Load the CEL-Seq2 pancreas data and the outputs generated by running the compute_pancreas_celseq2_factors.R script.

load("../data/pancreas.RData")
load("../output/pancreas_celseq2_factors.RData")
i           <- which(sample_info$tech == "celseq2")
sample_info <- sample_info[i,]
counts      <- counts[i,]
sample_info <- transform(sample_info,celltype = factor(celltype))

We will first focus on the non-negative matrix factorization (NMF) produced by flashier.

Structure plot

The Structure plot (also shown in the previous analysis) shows that many of the factors correspond closely to the cell-type assignments that were estimated in the published analysis:

celltype <- sample_info$celltype
celltype <-
 factor(celltype,
        c("acinar","ductal","activated_stellate","quiescent_stellate",
          "endothelial","macrophage","mast","schwann","alpha","beta",
          "delta","gamma","epsilon"))
L <- fl_nmf_ldf$L
colnames(L) <- paste0("k",1:9)
structure_plot(L[,-1],grouping = celltype,gap = 10,perplexity = 70,n = Inf) +
  labs(y = "membership",fill = "factor",color = "factor")

Note that the first factor was omitted in the Structure plot because it is a “baseline” factor, and not particularly interesting to look at.

A note about interpretation: For visualization purposes, the columns of the L matrix—the “membership matrix”—were scaled so that the largest membership for a given factor (column) was always exactly 1. However, please note this normalization is arbitrary. Therefore, it is not meaningful to compare memberships across factors (i.e., colors in the Structure plot); it is only meaningful to compare memberships within a given factor (a single color in the Structure plot).

Annotating the factors by “driving genes”

To illustrate annotating the factors, let’s focus on factors 4, 5 and 6—these are the factors that largely capture the islet cells (alpha, beta, etc). Let’s consider two different selection strategies: (i) choosing genes \(j\) with the largest \(f_{jk}\); (ii) choosing genes \(j\) with the largest differences \(f_{jk} - f_{jk'}\) with other factors \(k'\) (“distinctive genes”). These two selection strategies are implemented in the annotation_heatmap function:

F <- fl_nmf_ldf$F
colnames(F) <- paste0("k",1:9)
kset <- paste0("k",4:6)
p1 <- annotation_heatmap(F,n = 8,dims = kset,
                         select_features = "largest",
                         font_size = 9) +
  labs(title = "select_features = \"largest\"") +
  theme(plot.title = element_text(face = "plain",size = 9))
p2 <- annotation_heatmap(F,n = 8,dims = kset,
                         select_features = "distinctive",
                         compare_dims = kset,
                         font_size = 9) +
  labs(title = "select_features = \"distinctive\"") +
  theme(plot.title = element_text(face = "plain",size = 9))
plot_grid(p1,p2,nrow = 1,ncol = 2)

# Features selected for plot: INS IAPP SCGN SLC30A8 ABCC8 G6PC2 NPTX2 HADH GCG CHGB TM4SF4 TTR SCG2 SCG5 ALDH1A1 PCSK2 SST RBP4 PCSK1 CPE PPY SEC11C ISL1 
c("INS", "IAPP", "SCGN", "SLC30A8", "ABCC8", "G6PC2", "NPTX2", 
"HADH", "GCG", "CHGB", "TM4SF4", "TTR", "SCG2", "SCG5", "ALDH1A1", 
"PCSK2", "SST", "RBP4", "PCSK1", "CPE", "PPY", "SEC11C", "ISL1"
)
# Features selected for plot: INS IAPP NPTX2 MAFA MEG3 ADCYAP1 PFKFB2 DLK1 GCG GC TTR TM4SF4 FAP LOXL4 ALDH1A1 CRYBA2 SST AQP3 PPY LEPR EGR1 RBP4 DPYSL3 AKAP12 
c("INS", "IAPP", "NPTX2", "MAFA", "MEG3", "ADCYAP1", "PFKFB2", 
"DLK1", "GCG", "GC", "TTR", "TM4SF4", "FAP", "LOXL4", "ALDH1A1", 
"CRYBA2", "SST", "AQP3", "PPY", "LEPR", "EGR1", "RBP4", "DPYSL3", 
"AKAP12")

Strategy (i) picks out some canonical marker genes for islet cells such as INS for beta cells and GCG for alpha cells. But it also picks out other genes that are highly expressed in multiple islet cell types, such as TTR and CHGB. Strategy (ii) focusses more strongly on genes that distinguish one cell type from another, and as a result marker genes such as MAFA (beta cells) and GC (alpha cells) are ranked more highly with this strategy.

The better strategy will depend on the setting and on the goals of the analysis, which is why the annotation_heatmap function provides both options. These selection strategies can also reveal complementary insights and so in many situations it may be better to use both.

A more interpretable annotation plot

Above we sounded a note of caution about interpreting elements of L across factors/columns. The same applies to the F matrix. To provide a more even footing, above we employed the simple heuristic of scaling the columns of F so that the maximum element in each column was 1. That was helpful for selecting “distinctive” gene, but made the effect sizes difficult to interpret. To produce more easily interpretable effect sizes, we recommend visualizing this F matrix (in this code, fl is a “flash” object, e.g., the return value from a call to flashier::flash()):

out <- ldf(fl)
F <- with(out,F %*% diag(D))

This is what this rescaled F matrix looks like for the pancreas data:

genes <- c("INS","IAPP","NPTX2","MAFA","MEG3","ADCYAP1","PFKFB2", 
           "DLK1","GCG","GC","TTR","TM4SF4","FAP","LOXL4","ALDH1A1", 
           "CRYBA2","SST","AQP3","PPY","LEPR","EGR1","RBP4","DPYSL3", 
           "AKAP12")
F <- with(fl_nmf_ldf,F %*% diag(D))
colnames(F) <- paste0("k",1:9)
annotation_heatmap(F,select_features = genes,font_size = 9)

Visually, this plot looks quite similar to before, but now the effect sizes are on a different scale. With this rescaling, the effect sizes have the following interpretation:

\(f_{jk}\) is (approximately) the log-fold change (LFC) of gene \(j\) in a cell \(i\) with the largest membership in factor \(k\) (\(l_{ik} = 1\)) relative to a cell \(i'\) with no membership in factor \(k\) (\(l_{i'k} = 0\)).

sessionInfo()
# R version 4.3.3 (2024-02-29)
# Platform: aarch64-apple-darwin20 (64-bit)
# Running under: macOS Sonoma 14.7.1
# 
# Matrix products: default
# BLAS:   /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib 
# LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0
# 
# locale:
# [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
# 
# time zone: America/Chicago
# tzcode source: internal
# 
# attached base packages:
# [1] stats     graphics  grDevices utils     datasets  methods   base     
# 
# other attached packages:
# [1] cowplot_1.1.3     ggplot2_3.5.0     fastTopics_0.7-21 flashier_1.0.55  
# [5] ebnm_1.1-34       Matrix_1.6-5     
# 
# loaded via a namespace (and not attached):
#  [1] tidyselect_1.2.1     viridisLite_0.4.2    farver_2.1.1        
#  [4] dplyr_1.1.4          fastmap_1.1.1        lazyeval_0.2.2      
#  [7] promises_1.2.1       digest_0.6.34        lifecycle_1.0.4     
# [10] invgamma_1.1         magrittr_2.0.3       compiler_4.3.3      
# [13] rlang_1.1.3          sass_0.4.8           progress_1.2.3      
# [16] tools_4.3.3          utf8_1.2.4           yaml_2.3.8          
# [19] data.table_1.15.2    knitr_1.45           labeling_0.4.3      
# [22] prettyunits_1.2.0    htmlwidgets_1.6.4    scatterplot3d_0.3-44
# [25] plyr_1.8.9           RColorBrewer_1.1-3   Rtsne_0.17          
# [28] workflowr_1.7.1      withr_3.0.0          purrr_1.0.2         
# [31] grid_4.3.3           fansi_1.0.6          git2r_0.33.0        
# [34] colorspace_2.1-0     scales_1.3.0         gtools_3.9.5        
# [37] cli_3.6.2            rmarkdown_2.26       crayon_1.5.2        
# [40] generics_0.1.3       RcppParallel_5.1.7   httr_1.4.7          
# [43] reshape2_1.4.4       pbapply_1.7-2        cachem_1.0.8        
# [46] stringr_1.5.1        splines_4.3.3        parallel_4.3.3      
# [49] softImpute_1.4-1     vctrs_0.6.5          jsonlite_1.8.8      
# [52] hms_1.1.3            mixsqp_0.3-54        ggrepel_0.9.5       
# [55] irlba_2.3.5.1        horseshoe_0.2.0      trust_0.1-8         
# [58] plotly_4.10.4        jquerylib_0.1.4      tidyr_1.3.1         
# [61] glue_1.7.0           uwot_0.2.2.9000      stringi_1.8.3       
# [64] Polychrome_1.5.1     gtable_0.3.4         later_1.3.2         
# [67] quadprog_1.5-8       munsell_0.5.0        tibble_3.2.1        
# [70] pillar_1.9.0         htmltools_0.5.7      truncnorm_1.0-9     
# [73] R6_2.5.1             rprojroot_2.0.4      evaluate_0.23       
# [76] lattice_0.22-5       highr_0.10           RhpcBLASctl_0.23-42 
# [79] SQUAREM_2021.1       ashr_2.2-66          httpuv_1.6.14       
# [82] bslib_0.6.1          Rcpp_1.0.12          deconvolveR_1.2-1   
# [85] whisker_0.4.1        xfun_0.42            fs_1.6.3            
# [88] pkgconfig_2.0.3

Annotation of pancreas factors

Peter Carbonetto

Structure plot

Annotating the factors by “driving genes”

A more interpretable annotation plot