Last updated: 2026-02-03
Checks: 7 0
Knit directory: muse/
This reproducible R Markdown analysis was created with workflowr (version 1.7.1). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.
Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.
Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.
The command set.seed(20200712) was run prior to running
the code in the R Markdown file. Setting a seed ensures that any results
that rely on randomness, e.g. subsampling or permutations, are
reproducible.
Great job! Recording the operating system, R version, and package versions is critical for reproducibility.
Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.
Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.
Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.
The results in this page were generated with repository version e760666. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.
Note that you need to be careful to ensure that all relevant files for
the analysis have been committed to Git prior to generating the results
(you can use wflow_publish or
wflow_git_commit). workflowr only checks the R Markdown
file, but you know if there are other scripts or data files that it
depends on. Below is the status of the Git repository when the results
were generated:
Ignored files:
Ignored: .Rproj.user/
Ignored: data/1M_neurons_filtered_gene_bc_matrices_h5.h5
Ignored: data/293t/
Ignored: data/293t_3t3_filtered_gene_bc_matrices.tar.gz
Ignored: data/293t_filtered_gene_bc_matrices.tar.gz
Ignored: data/5k_Human_Donor1_PBMC_3p_gem-x_5k_Human_Donor1_PBMC_3p_gem-x_count_sample_filtered_feature_bc_matrix.h5
Ignored: data/5k_Human_Donor2_PBMC_3p_gem-x_5k_Human_Donor2_PBMC_3p_gem-x_count_sample_filtered_feature_bc_matrix.h5
Ignored: data/5k_Human_Donor3_PBMC_3p_gem-x_5k_Human_Donor3_PBMC_3p_gem-x_count_sample_filtered_feature_bc_matrix.h5
Ignored: data/5k_Human_Donor4_PBMC_3p_gem-x_5k_Human_Donor4_PBMC_3p_gem-x_count_sample_filtered_feature_bc_matrix.h5
Ignored: data/97516b79-8d08-46a6-b329-5d0a25b0be98.h5ad
Ignored: data/Parent_SC3v3_Human_Glioblastoma_filtered_feature_bc_matrix.tar.gz
Ignored: data/brain_counts/
Ignored: data/cl.obo
Ignored: data/cl.owl
Ignored: data/jurkat/
Ignored: data/jurkat:293t_50:50_filtered_gene_bc_matrices.tar.gz
Ignored: data/jurkat_293t/
Ignored: data/jurkat_filtered_gene_bc_matrices.tar.gz
Ignored: data/pbmc20k/
Ignored: data/pbmc20k_seurat/
Ignored: data/pbmc3k.csv
Ignored: data/pbmc3k.csv.gz
Ignored: data/pbmc3k.h5ad
Ignored: data/pbmc3k/
Ignored: data/pbmc3k_bpcells_mat/
Ignored: data/pbmc3k_export.mtx
Ignored: data/pbmc3k_matrix.mtx
Ignored: data/pbmc3k_seurat.rds
Ignored: data/pbmc4k_filtered_gene_bc_matrices.tar.gz
Ignored: data/pbmc_1k_v3_filtered_feature_bc_matrix.h5
Ignored: data/pbmc_1k_v3_raw_feature_bc_matrix.h5
Ignored: data/refdata-gex-GRCh38-2020-A.tar.gz
Ignored: data/seurat_1m_neuron.rds
Ignored: data/t_3k_filtered_gene_bc_matrices.tar.gz
Ignored: r_packages_4.4.1/
Ignored: r_packages_4.5.0/
Untracked files:
Untracked: .claude/
Untracked: CLAUDE.md
Untracked: analysis/bioc.Rmd
Untracked: analysis/bioc_scrnaseq.Rmd
Untracked: analysis/chick_weight.Rmd
Untracked: analysis/likelihood.Rmd
Untracked: bpcells_matrix/
Untracked: data/Caenorhabditis_elegans.WBcel235.113.gtf.gz
Untracked: data/GCF_043380555.1-RS_2024_12_gene_ontology.gaf.gz
Untracked: data/SeuratObj.rds
Untracked: data/arab.rds
Untracked: data/astronomicalunit.csv
Untracked: data/femaleMiceWeights.csv
Untracked: data/lung_bcell.rds
Untracked: m3/
Untracked: women.json
Unstaged changes:
Modified: analysis/isoform_switch_analyzer.Rmd
Modified: analysis/linear_models.Rmd
Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.
These are the previous versions of the repository in which changes were
made to the R Markdown (analysis/edger_pb.Rmd) and HTML
(docs/edger_pb.html) files. If you’ve configured a remote
Git repository (see ?wflow_git_remote), click on the
hyperlinks in the table below to view the files as they were in that
past version.
| File | Version | Author | Date | Message |
|---|---|---|---|---|
| Rmd | e760666 | Dave Tang | 2026-02-03 | Include background information |
| html | 8793f37 | Dave Tang | 2026-02-03 | Build site. |
| Rmd | 158ac67 | Dave Tang | 2026-02-03 | Pseudobulk analysis using edgeR |
Performing differential expression (DE) analysis on scRNA-seq data presents unique challenges with the primary problem being that individual cells exhibit high variability.
Pseudobulking tries to address the issue of variability by aggregating counts from cells that share the same biological replicate (e.g., donor/sample) and cell type. This approach:
This notebook demonstrates how to perform pseudobulk differential expression analysis using edgeR, following the workflow described in Section 4.10 of the edgeR User’s Guide.
The single cell RNA-seq data used in this notebook is from the human breast single cell RNA atlas generated by Pal et al. The preprocessing of the data and the complete bioinformatics analyses of the entire atlas study are described in detail in Chen et al. Most of the single cell analysis, such as dimensionality reduction and integration, were performed using Seurat. All the generated Seurat objects are publicly available on Figshare.
The Seurat object used in this notebook was downloaded directly from the website of the edgeR maintainers. This object contains breast tissue micro-environment samples from 13 individual healthy donors. This object has been subsetted to contain 10,000 cells of the total 24,751 cells from the original object.
so <- readRDS("data/SeuratObj.rds")
so
An object of class Seurat
15527 features across 10000 samples within 2 assays
Active assay: integrated (2000 features, 2000 variable features)
1 layer present: data
1 other assay present: RNA
2 dimensional reductions calculated: pca, tsne
Distribution of cell counts across 13 healthy donors and 7 clusters; note that some samples don’t have cells belonging to a certain cluster.
table(so@meta.data$group, so@meta.data$seurat_clusters)
0 1 2 3 4 5 6
N_0019_total 346 183 100 36 33 14 9
N_0021_total 25 214 41 4 2 9 8
N_0064_total 72 93 41 1 0 1 0
N_0092_total 207 102 67 18 2 12 0
N_0093_total 305 433 282 7 11 5 36
N_0123_total 364 189 63 24 3 18 5
N_0169_total 739 220 165 151 115 7 19
N_0230.17_total 657 147 117 12 18 11 6
N_0233_total 622 148 169 72 127 21 11
N_0275_total 56 128 57 1 2 1 0
N_0288_total 58 225 129 1 0 3 0
N_0342_total 567 692 331 19 9 57 10
N_0372_total 355 169 72 34 64 3 18
Pseudo-bulk samples are created by aggregating read counts together
for all the cells with the same combination of human donor and cluster.
Here, we generate pseudo-bulk expression profiles from the Seurat object
using the Seurat2PB() function. The human donor and cell
cluster information of the integrated single cell data is stored in the
group and seurat_clusters columns of the
meta.data component of the Seurat object.
y <- Seurat2PB(so, sample="group", cluster="seurat_clusters")
dim(y$samples)
[1] 85 5
sum(table(so@meta.data$group, so@meta.data$seurat_clusters) > 0)
[1] 85
Counts are aggregated into samples + clusters; note that there aren’t 13 * 7 samples because as we noted in the table, some combinations have 0 counts.
colnames(y$counts)
[1] "N_0019_total_cluster0" "N_0019_total_cluster1"
[3] "N_0019_total_cluster2" "N_0019_total_cluster3"
[5] "N_0019_total_cluster4" "N_0019_total_cluster5"
[7] "N_0019_total_cluster6" "N_0021_total_cluster0"
[9] "N_0021_total_cluster1" "N_0021_total_cluster2"
[11] "N_0021_total_cluster3" "N_0021_total_cluster4"
[13] "N_0021_total_cluster5" "N_0021_total_cluster6"
[15] "N_0064_total_cluster0" "N_0064_total_cluster1"
[17] "N_0064_total_cluster2" "N_0064_total_cluster3"
[19] "N_0064_total_cluster5" "N_0092_total_cluster0"
[21] "N_0092_total_cluster1" "N_0092_total_cluster2"
[23] "N_0092_total_cluster3" "N_0092_total_cluster4"
[25] "N_0092_total_cluster5" "N_0093_total_cluster0"
[27] "N_0093_total_cluster1" "N_0093_total_cluster2"
[29] "N_0093_total_cluster3" "N_0093_total_cluster4"
[31] "N_0093_total_cluster5" "N_0093_total_cluster6"
[33] "N_0123_total_cluster0" "N_0123_total_cluster1"
[35] "N_0123_total_cluster2" "N_0123_total_cluster3"
[37] "N_0123_total_cluster4" "N_0123_total_cluster5"
[39] "N_0123_total_cluster6" "N_0169_total_cluster0"
[41] "N_0169_total_cluster1" "N_0169_total_cluster2"
[43] "N_0169_total_cluster3" "N_0169_total_cluster4"
[45] "N_0169_total_cluster5" "N_0169_total_cluster6"
[47] "N_0230.17_total_cluster0" "N_0230.17_total_cluster1"
[49] "N_0230.17_total_cluster2" "N_0230.17_total_cluster3"
[51] "N_0230.17_total_cluster4" "N_0230.17_total_cluster5"
[53] "N_0230.17_total_cluster6" "N_0233_total_cluster0"
[55] "N_0233_total_cluster1" "N_0233_total_cluster2"
[57] "N_0233_total_cluster3" "N_0233_total_cluster4"
[59] "N_0233_total_cluster5" "N_0233_total_cluster6"
[61] "N_0275_total_cluster0" "N_0275_total_cluster1"
[63] "N_0275_total_cluster2" "N_0275_total_cluster3"
[65] "N_0275_total_cluster4" "N_0275_total_cluster5"
[67] "N_0288_total_cluster0" "N_0288_total_cluster1"
[69] "N_0288_total_cluster2" "N_0288_total_cluster3"
[71] "N_0288_total_cluster5" "N_0342_total_cluster0"
[73] "N_0342_total_cluster1" "N_0342_total_cluster2"
[75] "N_0342_total_cluster3" "N_0342_total_cluster4"
[77] "N_0342_total_cluster5" "N_0342_total_cluster6"
[79] "N_0372_total_cluster0" "N_0372_total_cluster1"
[81] "N_0372_total_cluster2" "N_0372_total_cluster3"
[83] "N_0372_total_cluster4" "N_0372_total_cluster5"
[85] "N_0372_total_cluster6"
The total UMI counts per pseudobulk sample vary considerably, reflecting differences in the number of cells aggregated and their sequencing depth. Importantly, the minimum is greater than zero, confirming that all sample-cluster combinations retained after aggregation contain actual expression data.
summary(colSums(y$counts))
Min. 1st Qu. Median Mean 3rd Qu. Max.
1352 42181 165537 651543 776854 5011510
Before differential expression analysis, we apply two filtering steps to remove low-quality data that could compromise statistical inference.
Pseudobulk samples with very few total counts are unreliable because they may represent too few cells or low-quality aggregations. We remove samples with fewer than 50,000 total UMI counts.
keep.samples <- y$samples$lib.size > 5e4
y <- y[, keep.samples]
dim(y$samples)
[1] 59 5
Genes with very low counts across samples provide little statistical
information and can adversely affect the multiple testing correction.
The filterByExpr() function implements edgeR’s recommended
filtering strategy: it keeps genes that have sufficiently large counts
to be statistically meaningful in at least some samples. By default, it
requires a gene to have at least 10 counts (min.count = 10)
in a minimum number of samples (determined by the smallest group
size).
keep.genes <- filterByExpr(y, group=y$samples$cluster)
y <- y[keep.genes, , keep=FALSE]
Trimmed Mean of M-values (TMM) normalisation corrects for compositional biases between samples. This is important because differences in library size alone don’t account for situations where a few highly-expressed genes consume a disproportionate share of sequencing reads, making other genes appear artificially down-regulated. TMM calculates scaling factors that adjust for these composition effects.
y <- normLibSizes(y)
To perform differential expression analysis between cell clusters, we create a design matrix that models both the biological effect of interest (cluster identity) and a blocking factor (donor). Including donor in the model accounts for individual-to-individual variation, ensuring that detected cluster differences are not confounded by donor-specific effects.
The formula ~ cluster + donor creates an additive model
where:
donor <- factor(y$samples$sample)
cluster <- as.factor(y$samples$cluster)
design <- model.matrix(~ cluster + donor)
colnames(design) <- gsub("donor", "", colnames(design))
colnames(design)[1] <- "Int"
dim(design)
[1] 59 19
The design matrix has 19 columns: 1 (intercept) + 6 (cluster coefficients, with cluster 0 as reference) + 12 (donor coefficients, with the first donor as reference) = 19 parameters to estimate. Each column represents one model coefficient.
The 59 rows correspond to the 59 pseudobulk samples (unique sample-cluster combinations that passed filtering).
head(design)
Int cluster1 cluster2 cluster3 cluster4 cluster5 cluster6 N_0021_total
1 1 0 0 0 0 0 0 0
2 1 1 0 0 0 0 0 0
3 1 0 1 0 0 0 0 0
4 1 0 0 1 0 0 0 0
5 1 0 0 0 0 1 0 0
6 1 0 0 0 0 0 1 0
N_0064_total N_0092_total N_0093_total N_0123_total N_0169_total
1 0 0 0 0 0
2 0 0 0 0 0
3 0 0 0 0 0
4 0 0 0 0 0
5 0 0 0 0 0
6 0 0 0 0 0
N_0230.17_total N_0233_total N_0275_total N_0288_total N_0342_total
1 0 0 0 0 0
2 0 0 0 0 0
3 0 0 0 0 0
4 0 0 0 0 0
5 0 0 0 0 0
6 0 0 0 0 0
N_0372_total
1 0
2 0
3 0
4 0
5 0
6 0
In the design matrix above, the first row represents a sample from
cluster 0 and donor N_0019_total. It shows
Int=1 (the intercept) with all other coefficients as 0
because both the cluster and donor for this sample are the reference
levels. Subsequent rows have 1s in the appropriate cluster and donor
columns to indicate which combination each pseudobulk sample
represents.
RNA-seq count data exhibits overdispersion (variance exceeds the mean), which the negative binomial distribution models through a dispersion parameter. edgeR estimates dispersion using an empirical Bayes approach that shares information across genes, improving estimates especially when sample sizes are small.
The robust=TRUE option protects against outlier genes
that might otherwise inflate dispersion estimates. The quasi-likelihood
framework (glmQLFit) adds an additional layer of variance
modelling that accounts for gene-specific variability beyond the
negative binomial assumption, providing more reliable statistical
inference.
y <- estimateDisp(y, design, robust=TRUE)
fit <- glmQLFit(y, design, robust=TRUE)
To identify marker genes for each cell cluster, we compare each cluster against all other clusters combined. This “one versus rest” approach reveals genes that are specifically up- or down-regulated in each cluster relative to the overall population.
A contrast is a linear combination of model coefficients that defines a specific comparison. For 7 clusters, we need 7 contrasts (one per cluster). Each contrast tests: “Is this cluster different from the average of all other clusters?”
Mathematically, if we want to compare cluster \(k\) against the average of the other 6 clusters, the contrast weights are:
This ensures the contrast sums to zero (a requirement for valid hypothesis testing) and compares cluster \(k\) to the mean of the remaining clusters.
The donor coefficients are set to 0 in the contrast because we’re not interested in donor differences; we only want to test cluster effects while controlling for donor variation.
ncls <- nlevels(cluster)
contr <- rbind( matrix(1/(1-ncls), ncls, ncls),
matrix(0, ncol(design)-ncls, ncls) )
diag(contr) <- 1
contr[1,] <- 0
rownames(contr) <- colnames(design)
colnames(contr) <- paste0("cluster", levels(cluster))
contr
cluster0 cluster1 cluster2 cluster3 cluster4
Int 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
cluster1 -0.1666667 1.0000000 -0.1666667 -0.1666667 -0.1666667
cluster2 -0.1666667 -0.1666667 1.0000000 -0.1666667 -0.1666667
cluster3 -0.1666667 -0.1666667 -0.1666667 1.0000000 -0.1666667
cluster4 -0.1666667 -0.1666667 -0.1666667 -0.1666667 1.0000000
cluster5 -0.1666667 -0.1666667 -0.1666667 -0.1666667 -0.1666667
cluster6 -0.1666667 -0.1666667 -0.1666667 -0.1666667 -0.1666667
N_0021_total 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
N_0064_total 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
N_0092_total 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
N_0093_total 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
N_0123_total 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
N_0169_total 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
N_0230.17_total 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
N_0233_total 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
N_0275_total 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
N_0288_total 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
N_0342_total 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
N_0372_total 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
cluster5 cluster6
Int 0.0000000 0.0000000
cluster1 -0.1666667 -0.1666667
cluster2 -0.1666667 -0.1666667
cluster3 -0.1666667 -0.1666667
cluster4 -0.1666667 -0.1666667
cluster5 1.0000000 -0.1666667
cluster6 -0.1666667 1.0000000
N_0021_total 0.0000000 0.0000000
N_0064_total 0.0000000 0.0000000
N_0092_total 0.0000000 0.0000000
N_0093_total 0.0000000 0.0000000
N_0123_total 0.0000000 0.0000000
N_0169_total 0.0000000 0.0000000
N_0230.17_total 0.0000000 0.0000000
N_0233_total 0.0000000 0.0000000
N_0275_total 0.0000000 0.0000000
N_0288_total 0.0000000 0.0000000
N_0342_total 0.0000000 0.0000000
N_0372_total 0.0000000 0.0000000
In this matrix, each column represents one comparison.
We perform a quasi-likelihood F-test for each contrast using
glmQLFTest(). The quasi-likelihood F-test is preferred over
the likelihood ratio test for bulk RNA-seq-like data because it accounts
for uncertainty in dispersion estimation, providing better control of
the false discovery rate when sample sizes are moderate.
qlf <- list()
for(i in 1:ncls){
qlf[[i]] <- glmQLFTest(fit, contrast=contr[,i])
qlf[[i]]$comparison <- paste0("cluster", levels(cluster)[i], "_vs_others")
}
length(qlf)
[1] 7
The topTags() function returns the most significant DE
genes, sorted by p-value. Here are the top 10 genes distinguishing
cluster 0 from all other clusters:
topTags(qlf[[1]], n=10L)
Coefficient: cluster0_vs_others
gene logFC logCPM F PValue FDR
FBLN1 FBLN1 5.983506 6.782442 759.4003 2.316922e-39 1.822722e-35
OGN OGN 5.726554 5.839392 607.2557 1.674505e-36 6.586667e-33
IGFBP6 IGFBP6 5.374590 6.786631 558.2772 9.989689e-35 2.619630e-31
DPT DPT 5.893382 6.312554 472.2406 7.413967e-34 1.458142e-30
CFD CFD 4.978584 8.900624 552.3340 8.403495e-33 1.322206e-29
SERPINF1 SERPINF1 5.169235 6.919634 595.2697 1.350858e-32 1.771200e-29
MFAP4 MFAP4 4.611371 5.947151 451.5254 1.779196e-32 1.999562e-29
CRABP2 CRABP2 3.948766 6.351154 449.3824 2.162987e-32 2.127027e-29
CLMP CLMP 5.951695 7.510977 502.3276 4.276494e-32 3.393508e-29
MMP2 MMP2 5.377426 6.789111 475.9323 4.313598e-32 3.393508e-29
The output columns are:
The decideTests() function classifies genes as
significantly up-regulated (1), down-regulated (-1), or not significant
(0) at FDR < 0.05. The table below shows how many genes fall into
each category for each cluster comparison:
dt <- lapply(lapply(qlf, decideTests), summary)
dt.all <- do.call("cbind", dt)
dt.all
cluster0_vs_others cluster1_vs_others cluster2_vs_others
Down 1478 790 1453
NotSig 3980 4852 4276
Up 2409 2225 2138
cluster3_vs_others cluster4_vs_others cluster5_vs_others
Down 1588 1605 249
NotSig 4408 4942 6573
Up 1871 1320 1045
cluster6_vs_others
Down 1410
NotSig 4880
Up 1577
The “Down” row indicates genes significantly lower in that cluster compared to others, while “Up” shows genes significantly higher. “NotSig” genes show no significant difference.
To visualise cluster-specific expression patterns, we extract the top 20 up-regulated genes (positive logFC) from each cluster comparison. These represent potential marker genes that characterise each cell population.
top <- 20
topMarkers <- list()
for(i in 1:ncls) {
ord <- order(qlf[[i]]$table$PValue, decreasing=FALSE)
up <- qlf[[i]]$table$logFC[ord] > 0
topMarkers[[i]] <- rownames(y)[ord[up][1:top]]
}
topMarkers <- unique(unlist(topMarkers))
topMarkers
[1] "FBLN1" "OGN" "IGFBP6" "DPT" "CFD" "SERPINF1"
[7] "MFAP4" "CRABP2" "CLMP" "MMP2" "SFRP2" "LUM"
[13] "GPC3" "PTGDS" "C1S" "GFPT2" "LRP1" "MEG8"
[19] "PCOLCE" "CCDC80" "PLVAP" "RBP7" "INHBB" "FLT1"
[25] "PECAM1" "SOX17" "EMCN" "S1PR1" "IFI27" "PCAT19"
[31] "RAPGEF4" "SELE" "ADGRL4" "ESAM" "MYCT1" "CDH5"
[37] "SPARCL1" "ADAMTS9" "CALCRL" "AQP1" "MYL9" "TPM2"
[43] "CRISPLD2" "ADAMTS4" "ACTA2" "TAGLN" "MT1A" "KCNE4"
[49] "ADIRF" "CALD1" "ADAMTS1" "CRYAB" "GJA4" "MCAM"
[55] "CPE" "PLN" "AXL" "NDUFA4L2" "STEAP4" "EFHD1"
[61] "HLA-DQB1" "HLA-DPA1" "ACSL1" "CD68" "C5AR1" "HLA-DPB1"
[67] "LAPTM5" "HLA-DRB1" "CXCL16" "IL4I1" "CD74" "KYNU"
[73] "C15orf48" "HLA-DQA1" "FCER1G" "C1QB" "SAMSN1" "MPP1"
[79] "SLC16A10" "TLR2" "KLRD1" "PIK3IP1" "LEPROTL1" "CCL5"
[85] "CLEC2D" "CD7" "IL7R" "PARP8" "KIAA1551" "PTPRC"
[91] "AKNA" "SARAF" "CRYBG1" "CXCR4" "RUNX3" "PPP2R5C"
[97] "SMAP2" "FYN" "CHST12" "CNOT6L" "KRT17" "KRT14"
[103] "KRT5" "SFN" "S100A2" "DST" "KRT6B" "LAMA3"
[109] "ACTG2" "S100A14" "LIMA1" "KRT7" "FHL2" "TPM1"
[115] "DMKN" "GDF15" "CD200" "HEY1" "CNKSR3" "PPFIBP1"
[121] "SCN3B" "GATA2" "CLDN5" "C2CD4B" "TFF3" "ANGPT2"
[127] "TSPAN12" "PRRG4" "BBC3" "RASGRP3" "ARL4A" "RAB32"
[133] "C6orf141" "RAI14" "PDPN"
The combined list of unique marker genes across all clusters provides a gene set that should discriminate between cell populations.
A heatmap of the marker genes allows us to visually confirm that these genes show cluster-specific expression patterns. We use log-transformed CPM values (log counts per million) to account for library size differences and apply row scaling (z-scores) to highlight relative expression patterns rather than absolute expression levels.
lcpm <- edgeR::cpm(y, log=TRUE)
annot <- data.frame(cluster=paste0("cluster ", cluster))
rownames(annot) <- colnames(y)
ann_colors <- list(cluster=2:8)
names(ann_colors$cluster) <- paste0("cluster ", levels(cluster))
pheatmap::pheatmap(lcpm[topMarkers, ], breaks=seq(-2,2,length.out=101),
color=colorRampPalette(c("blue","white","red"))(100), scale="row",
cluster_cols=TRUE, border_color="NA", fontsize_row=5,
treeheight_row=70, treeheight_col=70, cutree_cols=7,
clustering_method="ward.D2", show_colnames=FALSE,
annotation_col=annot, annotation_colors=ann_colors)

| Version | Author | Date |
|---|---|---|
| 8793f37 | Dave Tang | 2026-02-03 |
In this heatmap:
cutree_cols=7) to
highlight cluster separationGenes that are good markers should show high expression (red) in their target cluster and low expression (blue) in other clusters.
sessionInfo()
R version 4.5.0 (2025-04-11)
Platform: x86_64-pc-linux-gnu
Running under: Ubuntu 24.04.3 LTS
Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
time zone: Etc/UTC
tzcode source: system (glibc)
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] pheatmap_1.0.13 Seurat_5.3.0 SeuratObject_5.1.0 sp_2.2-0
[5] edgeR_4.6.3 limma_3.64.3 lubridate_1.9.4 forcats_1.0.0
[9] stringr_1.5.1 dplyr_1.1.4 purrr_1.0.4 readr_2.1.5
[13] tidyr_1.3.1 tibble_3.3.0 ggplot2_3.5.2 tidyverse_2.0.0
[17] workflowr_1.7.1
loaded via a namespace (and not attached):
[1] RColorBrewer_1.1-3 rstudioapi_0.17.1 jsonlite_2.0.0
[4] magrittr_2.0.3 spatstat.utils_3.1-5 farver_2.1.2
[7] rmarkdown_2.29 fs_1.6.6 vctrs_0.6.5
[10] ROCR_1.0-11 spatstat.explore_3.5-2 htmltools_0.5.8.1
[13] sass_0.4.10 sctransform_0.4.2 parallelly_1.45.0
[16] KernSmooth_2.23-26 bslib_0.9.0 htmlwidgets_1.6.4
[19] ica_1.0-3 plyr_1.8.9 plotly_4.11.0
[22] zoo_1.8-14 cachem_1.1.0 whisker_0.4.1
[25] igraph_2.1.4 mime_0.13 lifecycle_1.0.4
[28] pkgconfig_2.0.3 Matrix_1.7-3 R6_2.6.1
[31] fastmap_1.2.0 fitdistrplus_1.2-4 future_1.58.0
[34] shiny_1.11.1 digest_0.6.37 colorspace_2.1-1
[37] patchwork_1.3.0 ps_1.9.1 rprojroot_2.0.4
[40] tensor_1.5.1 RSpectra_0.16-2 irlba_2.3.5.1
[43] progressr_0.15.1 spatstat.sparse_3.1-0 timechange_0.3.0
[46] httr_1.4.7 polyclip_1.10-7 abind_1.4-8
[49] compiler_4.5.0 withr_3.0.2 fastDummies_1.7.5
[52] MASS_7.3-65 tools_4.5.0 lmtest_0.9-40
[55] httpuv_1.6.16 future.apply_1.20.0 goftest_1.2-3
[58] glue_1.8.0 callr_3.7.6 nlme_3.1-168
[61] promises_1.3.3 grid_4.5.0 Rtsne_0.17
[64] getPass_0.2-4 cluster_2.1.8.1 reshape2_1.4.4
[67] generics_0.1.4 gtable_0.3.6 spatstat.data_3.1-6
[70] tzdb_0.5.0 data.table_1.17.4 hms_1.1.3
[73] spatstat.geom_3.5-0 RcppAnnoy_0.0.22 ggrepel_0.9.6
[76] RANN_2.6.2 pillar_1.10.2 spam_2.11-1
[79] RcppHNSW_0.6.0 later_1.4.2 splines_4.5.0
[82] lattice_0.22-6 deldir_2.0-4 survival_3.8-3
[85] tidyselect_1.2.1 locfit_1.5-9.12 miniUI_0.1.2
[88] pbapply_1.7-4 knitr_1.50 git2r_0.36.2
[91] gridExtra_2.3 scattermore_1.2 xfun_0.52
[94] statmod_1.5.0 matrixStats_1.5.0 stringi_1.8.7
[97] lazyeval_0.2.2 yaml_2.3.10 evaluate_1.0.3
[100] codetools_0.2-20 cli_3.6.5 uwot_0.2.3
[103] xtable_1.8-4 reticulate_1.43.0 processx_3.8.6
[106] jquerylib_0.1.4 Rcpp_1.0.14 spatstat.random_3.4-1
[109] globals_0.18.0 png_0.1-8 spatstat.univar_3.1-4
[112] parallel_4.5.0 dotCall64_1.2 listenv_0.9.1
[115] viridisLite_0.4.2 scales_1.4.0 ggridges_0.5.6
[118] rlang_1.1.6 cowplot_1.2.0
Time taken to render notebook.
end_time <- Sys.time()
end_time - start_time
Time difference of 31.12815 secs
sessionInfo()
R version 4.5.0 (2025-04-11)
Platform: x86_64-pc-linux-gnu
Running under: Ubuntu 24.04.3 LTS
Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
time zone: Etc/UTC
tzcode source: system (glibc)
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] pheatmap_1.0.13 Seurat_5.3.0 SeuratObject_5.1.0 sp_2.2-0
[5] edgeR_4.6.3 limma_3.64.3 lubridate_1.9.4 forcats_1.0.0
[9] stringr_1.5.1 dplyr_1.1.4 purrr_1.0.4 readr_2.1.5
[13] tidyr_1.3.1 tibble_3.3.0 ggplot2_3.5.2 tidyverse_2.0.0
[17] workflowr_1.7.1
loaded via a namespace (and not attached):
[1] RColorBrewer_1.1-3 rstudioapi_0.17.1 jsonlite_2.0.0
[4] magrittr_2.0.3 spatstat.utils_3.1-5 farver_2.1.2
[7] rmarkdown_2.29 fs_1.6.6 vctrs_0.6.5
[10] ROCR_1.0-11 spatstat.explore_3.5-2 htmltools_0.5.8.1
[13] sass_0.4.10 sctransform_0.4.2 parallelly_1.45.0
[16] KernSmooth_2.23-26 bslib_0.9.0 htmlwidgets_1.6.4
[19] ica_1.0-3 plyr_1.8.9 plotly_4.11.0
[22] zoo_1.8-14 cachem_1.1.0 whisker_0.4.1
[25] igraph_2.1.4 mime_0.13 lifecycle_1.0.4
[28] pkgconfig_2.0.3 Matrix_1.7-3 R6_2.6.1
[31] fastmap_1.2.0 fitdistrplus_1.2-4 future_1.58.0
[34] shiny_1.11.1 digest_0.6.37 colorspace_2.1-1
[37] patchwork_1.3.0 ps_1.9.1 rprojroot_2.0.4
[40] tensor_1.5.1 RSpectra_0.16-2 irlba_2.3.5.1
[43] progressr_0.15.1 spatstat.sparse_3.1-0 timechange_0.3.0
[46] httr_1.4.7 polyclip_1.10-7 abind_1.4-8
[49] compiler_4.5.0 withr_3.0.2 fastDummies_1.7.5
[52] MASS_7.3-65 tools_4.5.0 lmtest_0.9-40
[55] httpuv_1.6.16 future.apply_1.20.0 goftest_1.2-3
[58] glue_1.8.0 callr_3.7.6 nlme_3.1-168
[61] promises_1.3.3 grid_4.5.0 Rtsne_0.17
[64] getPass_0.2-4 cluster_2.1.8.1 reshape2_1.4.4
[67] generics_0.1.4 gtable_0.3.6 spatstat.data_3.1-6
[70] tzdb_0.5.0 data.table_1.17.4 hms_1.1.3
[73] spatstat.geom_3.5-0 RcppAnnoy_0.0.22 ggrepel_0.9.6
[76] RANN_2.6.2 pillar_1.10.2 spam_2.11-1
[79] RcppHNSW_0.6.0 later_1.4.2 splines_4.5.0
[82] lattice_0.22-6 deldir_2.0-4 survival_3.8-3
[85] tidyselect_1.2.1 locfit_1.5-9.12 miniUI_0.1.2
[88] pbapply_1.7-4 knitr_1.50 git2r_0.36.2
[91] gridExtra_2.3 scattermore_1.2 xfun_0.52
[94] statmod_1.5.0 matrixStats_1.5.0 stringi_1.8.7
[97] lazyeval_0.2.2 yaml_2.3.10 evaluate_1.0.3
[100] codetools_0.2-20 cli_3.6.5 uwot_0.2.3
[103] xtable_1.8-4 reticulate_1.43.0 processx_3.8.6
[106] jquerylib_0.1.4 Rcpp_1.0.14 spatstat.random_3.4-1
[109] globals_0.18.0 png_0.1-8 spatstat.univar_3.1-4
[112] parallel_4.5.0 dotCall64_1.2 listenv_0.9.1
[115] viridisLite_0.4.2 scales_1.4.0 ggridges_0.5.6
[118] rlang_1.1.6 cowplot_1.2.0