The paper Comparison and evaluation of statistical error models for scRNA-seq is the basis for the default approach used in Seurat version 5. The following is text from the paper:

  • Heterogeneity in single-cell RNA-seq (scRNA-seq) data is driven by multiple sources, including biological variation in cellular state as well as technical variation introduced during experimental processing. Deconvolving these effects is a key challenge for preprocessing workflows.
    • Separating biological heterogeneity across cells that corresponds to differences in cell type and state from alternative sources of variation represents a key analytical challenge in the normalization and preprocessing of single-cell RNA-seq data.
  • Data normalization aims to adjust for differences in cellular sequencing depth, which collectively arise from fluctuations in cellular RNA content, efficiency in lysis and reverse transcription, and stochastic sampling during next-generation sequencing.
  • Variance stabilization aims to address the confounding relationship between gene abundance and gene variance, and to ensure that both lowly and highly expressed genes can contribute to the downstream definition of cellular state.

Using statistical models like Generalised Linear Models:

  • Two recent studies proposed to use generalized linear models (GLMs), where cellular sequencing depth was included as a covariate, as part of scRNA-seq preprocessing workflows.
  • The sctransform approach utilizes the Pearson residuals from negative binomial regression as input to standard dimensional reduction techniques, while GLM-PCA focuses on a generalized version of principal component analysis (PCA) for data with Poisson-distributed errors.
  • More broadly, multiple techniques aim to learn a latent state that captures biologically relevant cellular heterogeneity using either matrix factorization or neural networks, alongside a defined error model that describes the variation that is not captured by the latent space.

Parameterising statistical models:

  • Likelihood-based approaches require an explicit definition of a statistical error model for scRNA-seq, and there is little consensus on how to define or parameterize this model.
  • Multiple groups have utilized a Poisson error model but others argue that the data exhibit evidence of overdispersion, requiring the use of a negative-binomial (NB) distribution.
  • Methods that assume a NB distribution have different methods to parameterize their model.
    • A recent study argued that fixing the NB inverse overdispersion parameter θ to a single value is an appropriate estimate of technical overdispersion for all genes in all scRNA-seq datasets, while others propose learning unique parameter values for each gene in each dataset.
  • This lack of consensus is further exemplified by the scvi-tools suite, which supports nine different methods for parameterizing error models.
  • The purpose of this error model is to describe and quantify heterogeneity that is not captured by biologically relevant differences in cell state, and highlights a specific question: How can we model the observed variation in gene expression for an scRNA-seq experiment conducted on a biologically “homogeneous” population?

Seurat object

Import raw pbmc3k dataset from my server.

seurat_obj <- readRDS(url("", "rb"))
An object of class Seurat 
32738 features across 2700 samples within 1 assay 
Active assay: RNA (32738 features, 0 variable features)
 1 layer present: counts


pbmc3k <- CreateSeuratObject(
  counts = seurat_obj@assays$RNA$counts,
  min.cells = 3,
  min.features = 200,
  project = "pbmc3k"
An object of class Seurat 
13714 features across 2700 samples within 1 assay 
Active assay: RNA (13714 features, 0 variable features)
 1 layer present: counts

Seurat workflows

Process with the Seurat 4 workflow.

seurat_wf_v4 <- function(seurat_obj, scale_factor = 1e4, num_features = 2000, num_pcs = 30, cluster_res = 0.5, debug_flag = FALSE){
  seurat_obj <- NormalizeData(seurat_obj, normalization.method = "LogNormalize", scale.factor = scale_factor, verbose = debug_flag)
  seurat_obj <- FindVariableFeatures(seurat_obj, selection.method = 'vst', nfeatures = num_features, verbose = debug_flag)
  seurat_obj <- ScaleData(seurat_obj, verbose = debug_flag)
  seurat_obj <- RunPCA(seurat_obj, verbose = debug_flag)
  seurat_obj <- RunUMAP(seurat_obj, dims = 1:num_pcs, verbose = debug_flag)
  seurat_obj <- FindNeighbors(seurat_obj, dims = 1:num_pcs, verbose = debug_flag)
  seurat_obj <- FindClusters(seurat_obj, resolution = cluster_res, verbose = debug_flag)

pbmc3k_v4 <- seurat_wf_v4(pbmc3k)
Warning: The default method for RunUMAP has changed from calling Python UMAP via reticulate to the R-native UWOT using the cosine metric
To use Python UMAP via reticulate, set umap.method to 'umap-learn' and metric to 'correlation'
This message will be shown once per session
An object of class Seurat 
13714 features across 2700 samples within 1 assay 
Active assay: RNA (13714 features, 2000 variable features)
 3 layers present: counts, data,
 2 dimensional reductions calculated: pca, umap


DimPlot(pbmc3k_v4, reduction = "umap")

Version Author Date
0d51a69 Dave Tang 2025-03-10
seurat_wf_v5 <- function(seurat_obj, scale_factor = 1e4, num_features = 2000, num_pcs = 30, cluster_res = 0.5, debug_flag = FALSE){
  seurat_obj <- SCTransform(seurat_obj, verbose = debug_flag)
  seurat_obj <- RunPCA(seurat_obj, verbose = debug_flag)
  seurat_obj <- RunUMAP(seurat_obj, dims = 1:num_pcs, verbose = debug_flag)
  seurat_obj <- FindNeighbors(seurat_obj, dims = 1:num_pcs, verbose = debug_flag)
  seurat_obj <- FindClusters(seurat_obj, resolution = cluster_res, verbose = debug_flag)

pbmc3k_v5 <- seurat_wf_v5(pbmc3k)
An object of class Seurat 
26286 features across 2700 samples within 2 assays 
Active assay: SCT (12572 features, 3000 variable features)
 3 layers present: counts, data,
 1 other assay present: RNA
 2 dimensional reductions calculated: pca, umap


DimPlot(pbmc3k_v5, reduction = "umap")

Version Author Date
0d51a69 Dave Tang 2025-03-10

Data layer

Version 4 store log normalised data.

        1605.823         2027.859         2040.169         1902.960 
        1388.125         1653.061 

The data layer is in the SCT assay.

        786.2686        1024.4731        1029.3032         934.4454 
        666.1142         764.8101 

Compare clustering

More granular clustering of version 4’s cluster 0 in version 5.

stopifnot(all(row.names( == row.names(

      0   1   2   3   4   5   6   7   8   9
  0 970   0  71   2   0   0 100  44   0   0
  1   0 479   0   0   0   9   0   0   3   0
  2   1   0   0 349   0   0   0   1   0   0
  3   4   0 290   1   5   0   0   1   0   0
  4   0   0   5   6 152   0   0   0   0   0
  5   0  16   0   0   0 145   0   0   0   0
  6   0   1   0   0   0   0   0   0  31   0
  7   0   1   0   1   0   0   0   0   0  12

