Last updated: 2025-02-14
Checks: 7 0
Knit directory: muse/
This reproducible R Markdown analysis was created with workflowr (version 1.7.1). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.
Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.
Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.
The command set.seed(20200712)
was run prior to running
the code in the R Markdown file. Setting a seed ensures that any results
that rely on randomness, e.g. subsampling or permutations, are
reproducible.
Great job! Recording the operating system, R version, and package versions is critical for reproducibility.
Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.
Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.
Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.
The results in this page were generated with repository version e6f0a05. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.
Note that you need to be careful to ensure that all relevant files for
the analysis have been committed to Git prior to generating the results
(you can use wflow_publish
or
wflow_git_commit
). workflowr only checks the R Markdown
file, but you know if there are other scripts or data files that it
depends on. Below is the status of the Git repository when the results
were generated:
Ignored files:
Ignored: .Rproj.user/
Ignored: data/1M_neurons_filtered_gene_bc_matrices_h5.h5
Ignored: data/brain_counts/
Ignored: data/seurat_1m_neuron.rds
Ignored: r_packages_4.4.1/
Unstaged changes:
Modified: analysis/seurat_bpcells.Rmd
Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.
These are the previous versions of the repository in which changes were
made to the R Markdown (analysis/singler.Rmd
) and HTML
(docs/singler.html
) files. If you’ve configured a remote
Git repository (see ?wflow_git_remote
), click on the
hyperlinks in the table below to view the files as they were in that
past version.
File | Version | Author | Date | Message |
---|---|---|---|---|
Rmd | e6f0a05 | Dave Tang | 2025-02-14 | Using SingleR |
Performs unbiased cell type recognition from single-cell RNA sequencing data, by leveraging reference transcriptomic datasets of pure cell types to infer the cell of origin of each single cell independently.
Install SingleR.
if (!require("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("SingleR")
BiocManager::install("scRNAseq")
BiocManager::install("scuttle")
BiocManager::install("scran")
install.packages("viridis")
install.packages("pheatmap")
Following Using SingleR to annotate single-cell RNA-seq data.
SingleR is an automatic annotation method for single-cell RNA sequencing (scRNAseq) data (Aran et al. 2019). Given a reference dataset of samples (single-cell or bulk) with known labels, it labels new cells from a test dataset based on similarity to the reference. Thus, the burden of manually interpreting clusters and defining marker genes only has to be done once, for the reference dataset, and this biological knowledge can be propagated to new datasets in an automated manner.
The easiest way to use SingleR is to annotate cells against built-in references. In particular, the celldex package provides access to several reference datasets (mostly derived from bulk RNA-seq or microarray data) through dedicated retrieval functions. Here, we will use the Human Primary Cell Atlas (Mabbott et al. 2013), represented as a SummarizedExperiment object containing a matrix of log-expression values with sample-level labels.
suppressPackageStartupMessages(library(celldex))
hpca.se <- HumanPrimaryCellAtlasData()
hpca.se
class: SummarizedExperiment
dim: 19363 713
metadata(0):
assays(1): logcounts
rownames(19363): A1BG A1BG-AS1 ... ZZEF1 ZZZ3
rowData names(0):
colnames(713): GSM112490 GSM112491 ... GSM92233 GSM92234
colData names(3): label.main label.fine label.ont
Our test dataset consists of some human embryonic stem cells (La Manno et al. 2016) from the scRNAseq package. For the sake of speed, we will only label the first 100 cells from this dataset.
suppressPackageStartupMessages(library(scRNAseq))
hESCs <- LaMannoBrainData('human-es')
hESCs <- hESCs[,1:100]
We use our hpca.se reference to annotate each cell in hESCs via the SingleR() function. This identifies marker genes from the reference and uses them to compute assignment scores (based on the Spearman correlation across markers) for each cell in the test dataset against each label in the reference. The label with the highest score is the assigned to the test cell, possibly with further fine-tuning to resolve closely related labels.
suppressPackageStartupMessages(library(SingleR))
pred.hesc <- SingleR(
test = hESCs,
ref = hpca.se,
assay.type.test=1,
labels = hpca.se$label.main
)
Each row of the output DataFrame contains prediction results for a single cell. Labels are shown before (labels) and after pruning (pruned.labels), along with the associated scores.
pred.hesc
DataFrame with 100 rows and 4 columns
scores labels delta.next
<matrix> <character> <numeric>
1772122_301_C02 0.347652:0.139036:0.109547:... Neuroepithelial_cell 0.08332864
1772122_180_E05 0.361187:0.155395:0.134934:... Neurons 0.07283500
1772122_300_H02 0.446411:0.218052:0.190084:... Neuroepithelial_cell 0.13882912
1772122_180_B09 0.373512:0.172438:0.143537:... Neuroepithelial_cell 0.00317443
1772122_180_G04 0.357341:0.157275:0.126511:... Neuroepithelial_cell 0.09717938
... ... ... ...
1772122_299_E07 0.371989:0.202363:0.169379:... Neuroepithelial_cell 0.0837521
1772122_180_D02 0.353314:0.146049:0.115864:... Neuroepithelial_cell 0.0842804
1772122_300_D09 0.348789:0.129193:0.136732:... Neuroepithelial_cell 0.0595056
1772122_298_F09 0.332361:0.173357:0.141439:... Neuroepithelial_cell 0.1200606
1772122_302_A11 0.324928:0.127518:0.101609:... Astrocyte 0.0509478
pruned.labels
<character>
1772122_301_C02 Neuroepithelial_cell
1772122_180_E05 Neurons
1772122_300_H02 Neuroepithelial_cell
1772122_180_B09 Neuroepithelial_cell
1772122_180_G04 Neuroepithelial_cell
... ...
1772122_299_E07 Neuroepithelial_cell
1772122_180_D02 Neuroepithelial_cell
1772122_300_D09 Neuroepithelial_cell
1772122_298_F09 Neuroepithelial_cell
1772122_302_A11 Astrocyte
SingleR is workflow/package agnostic. The above example uses
SummarizedExperiment
objects, but the same functions will
accept any (log-)normalized expression matrix.
Here, we will use two human pancreas datasets from the scRNAseq package. The aim is to use one pre-labelled dataset to annotate the other unlabelled dataset. First, we set up the Muraro et al. (2016) dataset to be our reference.
suppressPackageStartupMessages(library(scuttle))
sceM <- MuraroPancreasData()
# One should normally do cell-based quality control at this point, but for
# brevity's sake, we will just remove the unlabelled libraries here.
sceM <- sceM[,!is.na(sceM$label)]
# SingleR() expects reference datasets to be normalized and log-transformed.
sceM <- logNormCounts(sceM)
We then set up our test dataset from Grun et al. (2016). To speed up this demonstration, we will subset to the first 100 cells.
sceG <- GrunPancreasData()
sceG <- sceG[,colSums(counts(sceG)) > 0] # Remove libraries with no counts.
sceG <- logNormCounts(sceG)
We then run SingleR() as described previously but with a marker detection mode that considers the variance of expression across cells. Here, we will use the Wilcoxon ranked sum test to identify the top markers for each pairwise comparison between labels. This is slower but more appropriate for single-cell data compared to the default marker detection algorithm (which may fail for low-coverage data where the median is frequently zero).
pred.grun <- SingleR(
test=sceG,
ref=sceM,
labels=sceM$label,
de.method="wilcox"
)
table(pred.grun$labels)
acinar alpha beta delta duct endothelial
657 245 276 57 367 34
epsilon mesenchymal pp unclear
1 41 35 5
plotScoreHeatmap()
displays the scores for all cells
across all reference labels, which allows users to inspect the
confidence of the predicted labels across the dataset. Ideally, each
cell (i.e., column of the heatmap) should have one score that is
obviously larger than the rest, indicating that it is unambiguously
assigned to a single label. A spread of similar scores for a given cell
indicates that the assignment is uncertain, though this may be
acceptable if the uncertainty is distributed across similar cell types
that cannot be easily resolved.
plotScoreHeatmap(pred.grun)
Another diagnostic is based on the per-cell “deltas”, i.e., the difference between the score for the assigned label and the median across all labels for each cell. Low deltas indicate that the assignment is uncertain, which is especially relevant if the cell’s true label does not exist in the reference. We can inspect these deltas across cells for each label using the plotDeltaDistribution() function.
plotDeltaDistribution(pred.grun, ncol = 3)
Warning: Groups with fewer than two datapoints have been dropped.
ℹ Set `drop = FALSE` to consider such groups for position adjustment purposes.
Warning in max(data$density, na.rm = TRUE): no non-missing arguments to max;
returning -Inf
Warning: Computation failed in `stat_ydensity()`.
Caused by error in `$<-.data.frame`:
! replacement has 1 row, data has 0
The pruneScores()
function will remove potentially
poor-quality or ambiguous assignments based on the deltas. The minimum
threshold on the deltas is defined using an outlier-based approach that
accounts for differences in the scale of the correlations in various
contexts - see ?pruneScores for more details. SingleR() will also report
the pruned scores automatically in the pruned.labels field where
low-quality assignments are replaced with NA.
summary(is.na(pred.grun$pruned.labels))
Mode FALSE TRUE
logical 1651 67
Finally, a simple yet effective diagnostic is to examine the expression of the marker genes for each label in the test dataset. We extract the identity of the markers from the metadata of the SingleR() results and use them in the plotMarkerHeatmap() function, as shown below for beta cell markers. If a cell in the test dataset is confidently assigned to a particular label, we would expect it to have strong expression of that label’s markers. At the very least, it should exhibit upregulation of those markers relative to cells assigned to other labels.
plotMarkerHeatmap(pred.grun, sceG, label="beta")
sessionInfo()
R version 4.4.1 (2024-06-14)
Platform: x86_64-pc-linux-gnu
Running under: Ubuntu 22.04.5 LTS
Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so; LAPACK version 3.10.0
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
time zone: Etc/UTC
tzcode source: system (glibc)
attached base packages:
[1] stats4 stats graphics grDevices utils datasets methods
[8] base
other attached packages:
[1] scuttle_1.16.0 SingleR_2.8.0
[3] scRNAseq_2.20.0 SingleCellExperiment_1.28.1
[5] celldex_1.16.0 SummarizedExperiment_1.36.0
[7] Biobase_2.66.0 GenomicRanges_1.58.0
[9] GenomeInfoDb_1.42.3 IRanges_2.40.1
[11] S4Vectors_0.44.0 BiocGenerics_0.52.0
[13] MatrixGenerics_1.18.1 matrixStats_1.4.1
[15] workflowr_1.7.1
loaded via a namespace (and not attached):
[1] RColorBrewer_1.1-3 rstudioapi_0.17.1
[3] jsonlite_1.8.9 magrittr_2.0.3
[5] GenomicFeatures_1.58.0 gypsum_1.2.0
[7] farver_2.1.2 rmarkdown_2.28
[9] fs_1.6.4 BiocIO_1.16.0
[11] zlibbioc_1.52.0 vctrs_0.6.5
[13] memoise_2.0.1 Rsamtools_2.22.0
[15] DelayedMatrixStats_1.28.1 RCurl_1.98-1.16
[17] htmltools_0.5.8.1 S4Arrays_1.6.0
[19] AnnotationHub_3.14.0 curl_5.2.3
[21] BiocNeighbors_2.0.1 Rhdf5lib_1.28.0
[23] SparseArray_1.6.1 rhdf5_2.50.2
[25] sass_0.4.9 alabaster.base_1.6.1
[27] bslib_0.8.0 alabaster.sce_1.6.0
[29] httr2_1.0.5 cachem_1.1.0
[31] GenomicAlignments_1.42.0 igraph_2.1.1
[33] whisker_0.4.1 lifecycle_1.0.4
[35] pkgconfig_2.0.3 rsvd_1.0.5
[37] Matrix_1.7-0 R6_2.5.1
[39] fastmap_1.2.0 GenomeInfoDbData_1.2.13
[41] digest_0.6.37 colorspace_2.1-1
[43] AnnotationDbi_1.68.0 ps_1.8.1
[45] rprojroot_2.0.4 dqrng_0.4.1
[47] irlba_2.3.5.1 ExperimentHub_2.14.0
[49] RSQLite_2.3.7 beachmat_2.22.0
[51] labeling_0.4.3 filelock_1.0.3
[53] fansi_1.0.6 httr_1.4.7
[55] abind_1.4-8 compiler_4.4.1
[57] withr_3.0.2 bit64_4.5.2
[59] BiocParallel_1.40.0 viridis_0.6.5
[61] DBI_1.2.3 highr_0.11
[63] HDF5Array_1.34.0 alabaster.ranges_1.6.0
[65] alabaster.schemas_1.6.0 rappdirs_0.3.3
[67] DelayedArray_0.32.0 bluster_1.16.0
[69] rjson_0.2.23 tools_4.4.1
[71] httpuv_1.6.15 glue_1.8.0
[73] restfulr_0.0.15 callr_3.7.6
[75] rhdf5filters_1.18.0 promises_1.3.0
[77] grid_4.4.1 getPass_0.2-4
[79] cluster_2.1.6 generics_0.1.3
[81] gtable_0.3.6 ensembldb_2.30.0
[83] metapod_1.14.0 BiocSingular_1.22.0
[85] ScaledMatrix_1.14.0 utf8_1.2.4
[87] XVector_0.46.0 BiocVersion_3.20.0
[89] pillar_1.9.0 stringr_1.5.1
[91] limma_3.62.2 later_1.3.2
[93] dplyr_1.1.4 BiocFileCache_2.14.0
[95] lattice_0.22-6 rtracklayer_1.66.0
[97] bit_4.5.0 tidyselect_1.2.1
[99] locfit_1.5-9.10 Biostrings_2.74.1
[101] knitr_1.48 git2r_0.35.0
[103] gridExtra_2.3 ProtGenerics_1.38.0
[105] edgeR_4.4.2 xfun_0.48
[107] statmod_1.5.0 pheatmap_1.0.12
[109] stringi_1.8.4 UCSC.utils_1.2.0
[111] lazyeval_0.2.2 yaml_2.3.10
[113] evaluate_1.0.1 codetools_0.2-20
[115] tibble_3.2.1 alabaster.matrix_1.6.1
[117] BiocManager_1.30.25 cli_3.6.3
[119] munsell_0.5.1 processx_3.8.4
[121] jquerylib_0.1.4 Rcpp_1.0.13
[123] dbplyr_2.5.0 png_0.1-8
[125] XML_3.99-0.17 parallel_4.4.1
[127] ggplot2_3.5.1 blob_1.2.4
[129] scran_1.34.0 AnnotationFilter_1.30.0
[131] sparseMatrixStats_1.18.0 bitops_1.0-9
[133] viridisLite_0.4.2 alabaster.se_1.6.0
[135] scales_1.3.0 crayon_1.5.3
[137] rlang_1.1.4 KEGGREST_1.46.0