Last updated: 2022-02-27
Checks: 7 0
Knit directory: BASS-analysis/
This reproducible R Markdown analysis was created with workflowr (version 1.7.0). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.
Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.
Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.
The command set.seed(0)
was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.
Great job! Recording the operating system, R version, and package versions is critical for reproducibility.
Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.
Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.
Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.
The results in this page were generated with repository version ada7b84. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.
Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish
or wflow_git_commit
). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:
working directory clean
Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.
These are the previous versions of the repository in which changes were made to the R Markdown (analysis/simu.Rmd
) and HTML (docs/simu.html
) files. If you’ve configured a remote Git repository (see ?wflow_git_remote
), click on the hyperlinks in the table below to view the files as they were in that past version.
File | Version | Author | Date | Message |
---|---|---|---|---|
Rmd | ada7b84 | zhengli09 | 2022-02-27 | Modify links |
html | 9a68511 | zhengli09 | 2022-02-27 | Build site. |
Rmd | 4ec8df1 | zhengli09 | 2022-02-27 | Publish the introduction, simulation analysis and STARmap analysis |
This page provides code to reproduce all the simulation results for BASS. Hopefully, this can also serve as a resource for building up your simulation studies and evaluate your statistical methods.
The simulation study relies on a few R packages and python softwares (e.g. SpaGCN and FICT), which need to be installed beforehand. The specific packages and all necessary functions to conduct the simulation study can be found at simulation.
source("code/simu_utils.R")
# data for inferring parameters for splatter
load("data/starmap_mpfc.RData") # starmap_cnts, starmap_info
cnts <- starmap_cnts[["20180417_BZ5_control"]]
info <- starmap_info[["20180417_BZ5_control"]]
xy <- as.matrix(info[, c("x", "y")])
starmap <- list(cnts = cnts, info = info)
# we perform 50 simulation replicates for each setting
# The random seeds used for each replicate are generated as below
NREP <- 50
set.seed(0)
seeds <- sample(20201230, NREP)
We generate the simulated data under the baseline simulation setting. Refer to the params file for a complete list of simulation settings.
# baseline settings for all four scenarios for the
# single tissue section analysis (L = 1)
scenarios <- 1:4 # simulation scenarios
C <- 4 # number of cell types
R <- 4 # number of spatial domains
J <- 200 # number of genes
de_prop <- 0.2 # proportion of DE genes for each cell type
de_facLoc <- 1.1 # DE gene strength
L <- 1 # number of tissue sections
# illustrate the data at the first replicate
rep <- 1
sim_dats <- lapply(scenarios, function(scenario){
simu(starmap = starmap, scenario = scenario, C = C, J = J, L = L,
batch_facLoc = 0, de_prop = de_prop, de_facLoc = de_facLoc,
de_facScale = 0.4, sim_seed = seeds[rep], debug = FALSE)
})
# The simulated data include a list of L gene expression count matrices,
# a list of L true cell type label vectors for evaluation purpose, and
# the random seed used for generating the data
scenario <- 3
sim_dat <- sim_dats[[scenario]]
# gene expression count matrix of the first tissue section under scenario 3
sim_dat[[1]][[1]][1:5, 1:5]
Cell1 Cell2 Cell3 Cell4 Cell5
Gene1 0 0 0 0 0
Gene2 16 0 0 0 0
Gene3 128 3 2 5 0
Gene4 17 0 0 1 2
Gene5 0 0 0 2 0
We visualize the cell type distributions under scenarios I-IV. You can refer to visualization for some useful plotting functions or you can write your own code for plotting.
library(cowplot)
source("code/viz.R")
p <- lapply(scenarios, function(scenario){
plotClusters(xy, labels = sim_dats[[scenario]][[2]][[1]],
title = paste("Scenario", c("I", "II", "III", "IV")[scenario]))
})
legend <- get_legend(p[[1]] +
theme(legend.position = "bottom") +
guides(color = guide_legend(title = "Cell type")))
p <- plot_grid(plotlist = p, ncol = 2)
plot_grid(p, legend, ncol = 1, rel_heights = c(1, 0.1))
Version | Author | Date |
---|---|---|
9a68511 | zhengli09 | 2022-02-27 |
We have wrapped up code for running different methods into different functions. Those functions take in simulated data and output adjusted random index (ARI) measuring the estimate cell type or spatial domain labels with the underlying truth. Functions for running different methods can be found at simulation. We take the simulated data from scenario 3 as an example to illustrate running different methods.
scenario <- 3
BASS_out <- run_BASS(sim_dats[[scenario]], xy, "ACCUR_EST", beta = 1, C, R,
init_method = "kmeans", cov_struc = "EEE")
***************************************
Bayesian Analytics for Spatial Segmentation (BASS)
Authors: Zheng Li, Xiang Zhou
Affiliate: Department of Biostatistics, University of Michigan
INPUT INFO:
- Number of samples: 1
- Number of spots/cells: 1049
- Number of genes: 200
- Potts interaction parameter estimation approach: ACCUR_EST
- Variance-covariance structure of gene expression features: EEE
To list all hyper-parameters, Type listAllHyper(BASS_object)
***************************************
***** Log-normalize gene expression data *****
***** Exclude genes with 0 expression *****
***** Reduce data dimension with PCA *****
......................................................................................
. Method Time (sec) Status .
......................................................................................
. ECR-ITERATIVE-1 17.481 Converged (3 iterations) .
......................................................................................
Relabelling all methods according to method ECR-ITERATIVE-1 ... done!
Retrieve the 1 permutation arrays by typing:
[...]$permutations$"ECR-ITERATIVE-1"
Retrieve the 1 best clusterings: [...]$clusters
Retrieve the 1 CPU times: [...]$timings
Retrieve the 1 X 1 similarity matrix: [...]$similarity
Label switching finished. Total time: 19.1 seconds.
......................................................................................
. Method Time (sec) Status .
......................................................................................
. ECR-ITERATIVE-1 10.205 Converged (2 iterations) .
......................................................................................
Relabelling all methods according to method ECR-ITERATIVE-1 ... done!
Retrieve the 1 permutation arrays by typing:
[...]$permutations$"ECR-ITERATIVE-1"
Retrieve the 1 best clusterings: [...]$clusters
Retrieve the 1 CPU times: [...]$timings
Retrieve the 1 X 1 similarity matrix: [...]$similarity
Label switching finished. Total time: 11.8 seconds.
BASS_out
$c_ari
[1] 0.9957436
$z_ari
[1] 0.9190263
$pi_est
[,1] [,2] [,3] [,4]
[1,] 0.5411765 0.0000000 0.2703297 0.2582583
[2,] 0.2000000 0.4829545 0.0000000 0.2762763
[3,] 0.2588235 0.2272727 0.4703297 0.0000000
[4,] 0.0000000 0.2897727 0.2593407 0.4654655
$mse_pi
[1] 0.05127128
Please note that running HMRF requires you to specify a path for python and will generate intermediate and final results into a specified folder. Please modify those directories accordingly in the run_HMRF
function if you want to use the function for your own analysis.
# Run HMRF
# Note that case and rep parameters below are used to generate a unique
# temporary directories for running HMRF.
HMRF_out <- run_HMRF(sim_dat, xy, ztrue = info$z, R, case = 3,
rep, usePCs = F, dosearchSEgenes = T)
external python path provided and will be used
Consider to install these (optional) packages to run all possible Giotto commands for spatial analyses: trendsceek multinet RTriangle FactoMiner
Giotto does not automatically install all these packages as they are not absolutely required and this reduces the number of dependencies
This is the single parameter version of binSpect
1. matrix binarization complete
2. spatial enrichment test completed
3. (optional) average expression of high expressing cells calculated
4. (optional) number of high expressing cells calculated
[1] "/net/mulan/home/zlisph//softwares/miniconda3/bin/python /net/mulan/home/zlisph/R/x86_64-pc-linux-gnu-library/4.0/Giotto/python/reader2.py -l \"/net/mulan/home/zlisph/BASS_pjt/2_simu/HMRF_tmp/case3/rep1/result_k4/spatial_cell_locations.txt\" -g \"/net/mulan/home/zlisph/BASS_pjt/2_simu/HMRF_tmp/case3/rep1/result_k4/spatial_genes.txt\" -n \"/net/mulan/home/zlisph/BASS_pjt/2_simu/HMRF_tmp/case3/rep1/result_k4/spatial_network.txt\" -e \"/net/mulan/home/zlisph/BASS_pjt/2_simu/HMRF_tmp/case3/rep1/result_k4/expression_matrix.txt\" -o \"/net/mulan/home/zlisph/BASS_pjt/2_simu/HMRF_tmp/case3/rep1/result_k4/result.spatial.zscore\" -a test -k 4 -b 0 2 26 -t 1e-10 -z none -s 100 -i 100"
# ARI for spatial domain detection across different
# spatial interaction parameter betas.
HMRF_out
0.0 2.0 4.0 6.0 8.0 10.0 12.0 14.0
0.1508578 0.1508578 0.1576335 0.1648563 0.2533123 0.3601968 0.4037333 0.4220421
16.0 18.0 20.0 22.0 24.0 26.0 28.0 30.0
0.4942260 0.4725915 0.5156322 0.5522958 0.5351690 0.5768655 0.6255737 0.5726423
32.0 34.0 36.0 38.0 40.0 42.0 44.0 46.0
0.5815424 0.5794783 0.5912112 0.5985681 0.6235051 0.6552240 0.6506761 0.6563171
48.0 50.0
0.6426422 0.6458700
Note that BayesSpace cannot identify any neighbors because it was developed for analyzing ordered lattice structure of spots from ST and 10x Visium platforms.
# Run BayesSpace
BayesSpace_out <- run_BayesSpace(sim_dat, xy, ztrue = info$z, R)
Neighbors were identified for 0 out of 1049 spots.
Fitting model...
Calculating labels using iterations 1000 through 10000.
# ARI for spatial domain detection
BayesSpace_out
[1] 0.1678712
Because SpaGCN was originally developed with Python, we wrapped up the code into a function and have it imported into R such that we can conveniently analyze the simulated data in R. The python function can be found at simulation.
# Run SpaGCN
source_python("code/run_SpaGCN.py")
SpaGCN_out <- run_SpaGCN(sim_dat, xy, info$z, R)
# ARI for spatial domain detection
SpaGCN_out
[1] 0.2550737
Because Seurat uses a resolution parameter to indirectly determine the number of clusters, we run Seurat on a range of resolution parameters and identify the first value that resulted in the desired number of cell types.
# Run Seurat
resolutions <- seq(0.3, 1.5, by = 0.1)
seu_out <- seu_cluster(sim_dat, resolutions)
# ARI for cell type clustering and the number of identified clusters
seu_out
[1] "0.939455349549594,4" "0.948825866306038,4" "0.948825866306038,4"
[4] "0.948825866306038,4" "0.951196703238985,4" "0.951196703238985,4"
[7] "0.956035282669879,4" "0.956035282669879,4" "0.956035282669879,4"
[10] "0.958501798605292,4" "0.958501798605292,4" "0.84056426004742,5"
[13] "0.762544765890606,6"
Note that when the number of cells exceeds 5,000, SC3 randomly selects 5,000 cells for clustering, trains a support vector machine with the cluster labels obtained for the 5,000 cells, and then predict the cluster labels of the remaining cells.
# Run SC3
# Note SC3 algorithm is slow
sc3_out <- sc3_cluster(sim_dat, C)
Setting SC3 parameters...
Calculating distances between the cells...
Performing transformations and calculating eigenvectors...
Performing k-means clustering...
Calculating consensus matrix...
# ARI for cell type clustering
sc3_out
[1] 0.9670091
Please install the software according to the FICT github and modify the paths to the FICT software and a temporary folder accordingly in the fict_cluster
function.
# Note that case and rep parameters below are used to generate a unique
# temporary directories for running FICT.
fict_out <- fict_cluster(sim_dat, xy, C, case = 3, rep)
# ARI for cell type clustering
fict_out
[1] 0.9829696
Put the link to all the simulation results here later and plot the two main figures on simulation analysis.
sessionInfo()
R version 4.1.2 (2021-11-01)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 18.04.6 LTS
Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/openblas/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/libopenblasp-r0.2.20.so
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] parallel stats4 stats graphics grDevices utils datasets
[8] methods base
other attached packages:
[1] cowplot_1.1.1 reticulate_1.22
[3] splatter_1.16.1 gtools_3.9.2
[5] scater_1.16.2 forcats_0.5.0
[7] stringr_1.4.0 dplyr_1.0.8
[9] purrr_0.3.4 readr_1.3.1
[11] tidyr_1.1.1 tibble_3.1.6
[13] ggplot2_3.3.5 tidyverse_1.3.0
[15] SC3_1.20.0 Seurat_3.2.3
[17] BayesSpace_1.2.1 SingleCellExperiment_1.14.1
[19] SummarizedExperiment_1.22.0 Biobase_2.48.0
[21] GenomicRanges_1.44.0 GenomeInfoDb_1.24.2
[23] IRanges_2.26.0 S4Vectors_0.30.2
[25] BiocGenerics_0.38.0 MatrixGenerics_1.4.3
[27] matrixStats_0.61.0 Giotto_1.0.4
[29] BASS_1.0 mclust_5.4.9
[31] GIGrvg_0.5 workflowr_1.7.0
loaded via a namespace (and not attached):
[1] rappdirs_0.3.3 scattermore_0.7
[3] maxLik_1.5-2 coda_0.19-4
[5] bit64_4.0.2 knitr_1.37
[7] irlba_2.3.3 DelayedArray_0.18.0
[9] data.table_1.14.2 rpart_4.1-15
[11] RCurl_1.98-1.5 doParallel_1.0.15
[13] generics_0.1.2 callr_3.7.0
[15] RSQLite_2.2.0 combinat_0.0-8
[17] RANN_2.6.1 proxy_0.4-26
[19] future_1.22.1 bit_4.0.4
[21] xml2_1.3.3 lubridate_1.7.9
[23] spatstat.data_2.1-0 httpuv_1.5.4
[25] assertthat_0.2.1 viridis_0.5.1
[27] xfun_0.29 hms_0.5.3
[29] jquerylib_0.1.4 evaluate_0.15
[31] promises_1.1.1 DEoptimR_1.0-9
[33] fansi_1.0.2 readxl_1.3.1
[35] dbplyr_1.4.4 igraph_1.2.7
[37] DBI_1.1.1 htmlwidgets_1.5.1
[39] ellipsis_0.3.2 backports_1.2.1
[41] deldir_1.0-6 sparseMatrixStats_1.4.2
[43] vctrs_0.3.8 ROCR_1.0-11
[45] abind_1.4-5 withr_2.4.3
[47] cachem_1.0.6 robustbase_0.93-9
[49] checkmate_2.0.0 sctransform_0.3.2
[51] scran_1.20.1 goftest_1.2-3
[53] cluster_2.1.2 lazyeval_0.2.2
[55] crayon_1.5.0 labeling_0.4.2
[57] edgeR_3.34.1 pkgconfig_2.0.3
[59] nlme_3.1-153 vipor_0.4.5
[61] rlang_1.0.1 globals_0.14.0
[63] lifecycle_1.0.1 miniUI_0.1.1.1
[65] sandwich_3.0-1 BiocFileCache_1.12.0
[67] modelr_0.1.8 rsvd_1.0.3
[69] cellranger_1.1.0 rprojroot_2.0.2
[71] polyclip_1.10-0 lmtest_0.9-38
[73] rngtools_1.5.2 Matrix_1.3-4
[75] Rhdf5lib_1.10.1 zoo_1.8-9
[77] reprex_0.3.0 beeswarm_0.4.0
[79] whisker_0.4 ggridges_0.5.3
[81] processx_3.5.2 pheatmap_1.0.12
[83] png_0.1-7 viridisLite_0.4.0
[85] bitops_1.0-7 getPass_0.2-2
[87] KernSmooth_2.23-20 blob_1.2.1
[89] DelayedMatrixStats_1.14.3 doRNG_1.8.2
[91] parallelly_1.28.1 lpSolve_5.6.15
[93] beachmat_2.8.1 scales_1.1.1
[95] memoise_2.0.1 magrittr_2.0.2
[97] plyr_1.8.6 ica_1.0-2
[99] zlibbioc_1.34.0 DirichletReg_0.7-1
[101] compiler_4.1.2 miscTools_0.6-26
[103] dqrng_0.3.0 RColorBrewer_1.1-2
[105] rrcov_1.6-0 fitdistrplus_1.1-6
[107] cli_3.2.0 XVector_0.32.0
[109] listenv_0.8.0 patchwork_1.1.1
[111] pbapply_1.5-0 ps_1.6.0
[113] Formula_1.2-4 MASS_7.3-54
[115] mgcv_1.8-38 tidyselect_1.1.2
[117] stringi_1.7.6 highr_0.9
[119] yaml_2.3.5 BiocSingular_1.4.0
[121] locfit_1.5-9.4 ggrepel_0.9.1
[123] grid_4.1.2 sass_0.4.0
[125] tools_4.1.2 future.apply_1.8.1
[127] label.switching_1.8 rstudioapi_0.13
[129] bluster_1.2.1 foreach_1.5.0
[131] git2r_0.28.0 metapod_1.0.0
[133] gridExtra_2.3 farver_2.1.0
[135] Rtsne_0.15 digest_0.6.29
[137] pracma_2.2.9 shiny_1.5.0
[139] Rcpp_1.0.8 broom_0.7.10
[141] scuttle_1.2.1 later_1.1.0.1
[143] RcppAnnoy_0.0.19 WriteXLS_6.3.0
[145] httr_1.4.2 Rdpack_2.1.2
[147] colorspace_2.0-3 rvest_0.3.6
[149] fs_1.5.2 tensor_1.5
[151] splines_4.1.2 uwot_0.1.10
[153] statmod_1.4.36 spatstat.utils_2.2-0
[155] xgboost_1.4.1.1 plotly_4.9.2.1
[157] xtable_1.8-4 jsonlite_1.8.0
[159] spatstat_1.64-1 R6_2.5.1
[161] pillar_1.7.0 htmltools_0.5.2
[163] mime_0.12 glue_1.6.2
[165] fastmap_1.1.0 BiocParallel_1.22.0
[167] BiocNeighbors_1.6.0 class_7.3-19
[169] codetools_0.2-18 pcaPP_1.9-74
[171] mvtnorm_1.1-3 utf8_1.2.2
[173] lattice_0.20-45 bslib_0.3.1
[175] curl_4.3.2 ggbeeswarm_0.6.0
[177] leiden_0.3.9 survival_3.2-13
[179] limma_3.48.3 rmarkdown_2.11
[181] munsell_0.5.0 e1071_1.7-9
[183] rhdf5_2.32.2 GenomeInfoDbData_1.2.6
[185] iterators_1.0.13 smfishHmrf_0.1
[187] haven_2.3.1 reshape2_1.4.4
[189] gtable_0.3.0 rbibutils_2.2.4