Introduction

Simulation study

Generate simulated data
Visualize of the simulated data
Running all methods

Run BASS
Run HMRF
Run BayesSpace
Run SpaGCN
Run Seurat
Run SC3
Run FICT

Multi-sample simulation

Find all the simulation results

Last updated: 2022-02-27

Checks: 7 0

Knit directory: BASS-analysis/

This reproducible R Markdown analysis was created with workflowr (version 1.7.0). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.

R Markdown file: up-to-date

Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.

Environment: empty

Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.

Seed: set.seed(0)

The command set.seed(0) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.

Session information: recorded

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.

Cache: none

Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.

File paths: relative

Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.

Repository version: ada7b84

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.

The results in this page were generated with repository version ada7b84. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


working directory clean

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.

These are the previous versions of the repository in which changes were made to the R Markdown (analysis/simu.Rmd) and HTML (docs/simu.html) files. If you’ve configured a remote Git repository (see ?wflow_git_remote), click on the hyperlinks in the table below to view the files as they were in that past version.

File	Version	Author	Date	Message
Rmd	ada7b84	zhengli09	2022-02-27	Modify links
html	9a68511	zhengli09	2022-02-27	Build site.
Rmd	4ec8df1	zhengli09	2022-02-27	Publish the introduction, simulation analysis and STARmap analysis

Introduction

This page provides code to reproduce all the simulation results for BASS. Hopefully, this can also serve as a resource for building up your simulation studies and evaluate your statistical methods.

Simulation study

Generate simulated data

The simulation study relies on a few R packages and python softwares (e.g. SpaGCN and FICT), which need to be installed beforehand. The specific packages and all necessary functions to conduct the simulation study can be found at simulation.

source("code/simu_utils.R")
# data for inferring parameters for splatter
load("data/starmap_mpfc.RData") # starmap_cnts, starmap_info
cnts <- starmap_cnts[["20180417_BZ5_control"]]
info <- starmap_info[["20180417_BZ5_control"]]
xy <- as.matrix(info[, c("x", "y")])
starmap <- list(cnts = cnts, info = info)

# we perform 50 simulation replicates for each setting
# The random seeds used for each replicate are generated as below
NREP <- 50
set.seed(0)
seeds <- sample(20201230, NREP)

We generate the simulated data under the baseline simulation setting. Refer to the params file for a complete list of simulation settings.

# baseline settings for all four scenarios for the 
# single tissue section analysis (L = 1)
scenarios <- 1:4 # simulation scenarios
C <- 4 # number of cell types
R <- 4 # number of spatial domains
J <- 200 # number of genes
de_prop <- 0.2 # proportion of DE genes for each cell type
de_facLoc <- 1.1 # DE gene strength
L <- 1 # number of tissue sections

# illustrate the data at the first replicate
rep <- 1
sim_dats <- lapply(scenarios, function(scenario){
  simu(starmap = starmap, scenario = scenario, C = C, J = J, L = L, 
       batch_facLoc = 0, de_prop = de_prop, de_facLoc = de_facLoc, 
       de_facScale = 0.4, sim_seed = seeds[rep], debug = FALSE)
})

# The simulated data include a list of L gene expression count matrices,
# a list of L true cell type label vectors for evaluation purpose, and 
# the random seed used for generating the data
scenario <- 3
sim_dat <- sim_dats[[scenario]]
# gene expression count matrix of the first tissue section under scenario 3
sim_dat[[1]][[1]][1:5, 1:5]

      Cell1 Cell2 Cell3 Cell4 Cell5
Gene1     0     0     0     0     0
Gene2    16     0     0     0     0
Gene3   128     3     2     5     0
Gene4    17     0     0     1     2
Gene5     0     0     0     2     0

Visualize of the simulated data

We visualize the cell type distributions under scenarios I-IV. You can refer to visualization for some useful plotting functions or you can write your own code for plotting.

library(cowplot)
source("code/viz.R")
p <- lapply(scenarios, function(scenario){
  plotClusters(xy, labels = sim_dats[[scenario]][[2]][[1]],
               title = paste("Scenario", c("I", "II", "III", "IV")[scenario]))
})
legend <- get_legend(p[[1]] + 
  theme(legend.position = "bottom") +
  guides(color = guide_legend(title = "Cell type")))
p <- plot_grid(plotlist = p, ncol = 2)
plot_grid(p, legend, ncol = 1, rel_heights = c(1, 0.1))

Version	Author	Date
9a68511	zhengli09	2022-02-27

Running all methods

We have wrapped up code for running different methods into different functions. Those functions take in simulated data and output adjusted random index (ARI) measuring the estimate cell type or spatial domain labels with the underlying truth. Functions for running different methods can be found at simulation. We take the simulated data from scenario 3 as an example to illustrate running different methods.

scenario <- 3

Run BASS

BASS_out <- run_BASS(sim_dats[[scenario]], xy, "ACCUR_EST", beta = 1, C, R, 
                     init_method = "kmeans", cov_struc = "EEE")

***************************************
  Bayesian Analytics for Spatial Segmentation (BASS)
  Authors: Zheng Li, Xiang Zhou
  Affiliate: Department of Biostatistics, University of Michigan
  INPUT INFO:
    - Number of samples: 1 
    - Number of spots/cells: 1049 
    - Number of genes: 200 
    - Potts interaction parameter estimation approach: ACCUR_EST 
    - Variance-covariance structure of gene expression features: EEE 
  To list all hyper-parameters, Type listAllHyper(BASS_object)
***************************************
***** Log-normalize gene expression data *****
***** Exclude genes with 0 expression *****
***** Reduce data dimension with PCA *****

    ......................................................................................
    . Method                         Time (sec)           Status                         . 
    ......................................................................................
    . ECR-ITERATIVE-1                17.481               Converged (3 iterations)       . 
    ......................................................................................

    Relabelling all methods according to method ECR-ITERATIVE-1 ... done!
    Retrieve the 1 permutation arrays by typing:
        [...]$permutations$"ECR-ITERATIVE-1"
    Retrieve the 1 best clusterings: [...]$clusters
    Retrieve the 1 CPU times: [...]$timings
    Retrieve the 1 X 1 similarity matrix: [...]$similarity
    Label switching finished. Total time: 19.1 seconds. 

    ......................................................................................
    . Method                         Time (sec)           Status                         . 
    ......................................................................................
    . ECR-ITERATIVE-1                10.205               Converged (2 iterations)       . 
    ......................................................................................

    Relabelling all methods according to method ECR-ITERATIVE-1 ... done!
    Retrieve the 1 permutation arrays by typing:
        [...]$permutations$"ECR-ITERATIVE-1"
    Retrieve the 1 best clusterings: [...]$clusters
    Retrieve the 1 CPU times: [...]$timings
    Retrieve the 1 X 1 similarity matrix: [...]$similarity
    Label switching finished. Total time: 11.8 seconds.

BASS_out

$c_ari
[1] 0.9957436

$z_ari
[1] 0.9190263

$pi_est
          [,1]      [,2]      [,3]      [,4]
[1,] 0.5411765 0.0000000 0.2703297 0.2582583
[2,] 0.2000000 0.4829545 0.0000000 0.2762763
[3,] 0.2588235 0.2272727 0.4703297 0.0000000
[4,] 0.0000000 0.2897727 0.2593407 0.4654655

$mse_pi
[1] 0.05127128

Run HMRF

Please note that running HMRF requires you to specify a path for python and will generate intermediate and final results into a specified folder. Please modify those directories accordingly in the run_HMRF function if you want to use the function for your own analysis.

# Run HMRF
# Note that case and rep parameters below are used to generate a unique
# temporary directories for running HMRF.
HMRF_out <- run_HMRF(sim_dat, xy, ztrue = info$z, R, case = 3, 
                     rep, usePCs = F, dosearchSEgenes = T)


 external python path provided and will be used 
Consider to install these (optional) packages to run all possible Giotto commands for spatial analyses:  trendsceek multinet RTriangle FactoMiner
 Giotto does not automatically install all these packages as they are not absolutely required and this reduces the number of dependencies
 This is the single parameter version of binSpect
 1. matrix binarization complete 

 2. spatial enrichment test completed 

 3. (optional) average expression of high expressing cells calculated 

 4. (optional) number of high expressing cells calculated 
[1] "/net/mulan/home/zlisph//softwares/miniconda3/bin/python /net/mulan/home/zlisph/R/x86_64-pc-linux-gnu-library/4.0/Giotto/python/reader2.py -l \"/net/mulan/home/zlisph/BASS_pjt/2_simu/HMRF_tmp/case3/rep1/result_k4/spatial_cell_locations.txt\" -g \"/net/mulan/home/zlisph/BASS_pjt/2_simu/HMRF_tmp/case3/rep1/result_k4/spatial_genes.txt\" -n \"/net/mulan/home/zlisph/BASS_pjt/2_simu/HMRF_tmp/case3/rep1/result_k4/spatial_network.txt\" -e \"/net/mulan/home/zlisph/BASS_pjt/2_simu/HMRF_tmp/case3/rep1/result_k4/expression_matrix.txt\" -o \"/net/mulan/home/zlisph/BASS_pjt/2_simu/HMRF_tmp/case3/rep1/result_k4/result.spatial.zscore\" -a test -k 4 -b 0 2 26 -t 1e-10 -z none -s 100 -i 100"

# ARI for spatial domain detection across different
# spatial interaction parameter betas.
HMRF_out

      0.0       2.0       4.0       6.0       8.0      10.0      12.0      14.0 
0.1508578 0.1508578 0.1576335 0.1648563 0.2533123 0.3601968 0.4037333 0.4220421 
     16.0      18.0      20.0      22.0      24.0      26.0      28.0      30.0 
0.4942260 0.4725915 0.5156322 0.5522958 0.5351690 0.5768655 0.6255737 0.5726423 
     32.0      34.0      36.0      38.0      40.0      42.0      44.0      46.0 
0.5815424 0.5794783 0.5912112 0.5985681 0.6235051 0.6552240 0.6506761 0.6563171 
     48.0      50.0 
0.6426422 0.6458700

Run BayesSpace

Note that BayesSpace cannot identify any neighbors because it was developed for analyzing ordered lattice structure of spots from ST and 10x Visium platforms.

# Run BayesSpace
BayesSpace_out <- run_BayesSpace(sim_dat, xy, ztrue = info$z, R)

Neighbors were identified for 0 out of 1049 spots.

Fitting model...

Calculating labels using iterations 1000 through 10000.

# ARI for spatial domain detection
BayesSpace_out

[1] 0.1678712

Run SpaGCN

Because SpaGCN was originally developed with Python, we wrapped up the code into a function and have it imported into R such that we can conveniently analyze the simulated data in R. The python function can be found at simulation.

# Run SpaGCN
source_python("code/run_SpaGCN.py")
SpaGCN_out <- run_SpaGCN(sim_dat, xy, info$z, R)
# ARI for spatial domain detection
SpaGCN_out

[1] 0.2550737

Run Seurat

Because Seurat uses a resolution parameter to indirectly determine the number of clusters, we run Seurat on a range of resolution parameters and identify the first value that resulted in the desired number of cell types.

# Run Seurat
resolutions <- seq(0.3, 1.5, by = 0.1)
seu_out <- seu_cluster(sim_dat, resolutions)
# ARI for cell type clustering and the number of identified clusters 
seu_out

 [1] "0.939455349549594,4" "0.948825866306038,4" "0.948825866306038,4"
 [4] "0.948825866306038,4" "0.951196703238985,4" "0.951196703238985,4"
 [7] "0.956035282669879,4" "0.956035282669879,4" "0.956035282669879,4"
[10] "0.958501798605292,4" "0.958501798605292,4" "0.84056426004742,5" 
[13] "0.762544765890606,6"

Run SC3

Note that when the number of cells exceeds 5,000, SC3 randomly selects 5,000 cells for clustering, trains a support vector machine with the cluster labels obtained for the 5,000 cells, and then predict the cluster labels of the remaining cells.

# Run SC3
# Note SC3 algorithm is slow
sc3_out <- sc3_cluster(sim_dat, C)

Setting SC3 parameters...

Calculating distances between the cells...

Performing transformations and calculating eigenvectors...

Performing k-means clustering...

Calculating consensus matrix...

# ARI for cell type clustering
sc3_out

[1] 0.9670091

Run FICT

Please install the software according to the FICT github and modify the paths to the FICT software and a temporary folder accordingly in the fict_cluster function.

# Note that case and rep parameters below are used to generate a unique
# temporary directories for running FICT.
fict_out <- fict_cluster(sim_dat, xy, C, case = 3, rep)
# ARI for cell type clustering
fict_out

[1] 0.9829696

Multi-sample simulation

Find all the simulation results

Put the link to all the simulation results here later and plot the two main figures on simulation analysis.

sessionInfo()

R version 4.1.2 (2021-11-01)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 18.04.6 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/openblas/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/libopenblasp-r0.2.20.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
 [1] cowplot_1.1.1               reticulate_1.22            
 [3] splatter_1.16.1             gtools_3.9.2               
 [5] scater_1.16.2               forcats_0.5.0              
 [7] stringr_1.4.0               dplyr_1.0.8                
 [9] purrr_0.3.4                 readr_1.3.1                
[11] tidyr_1.1.1                 tibble_3.1.6               
[13] ggplot2_3.3.5               tidyverse_1.3.0            
[15] SC3_1.20.0                  Seurat_3.2.3               
[17] BayesSpace_1.2.1            SingleCellExperiment_1.14.1
[19] SummarizedExperiment_1.22.0 Biobase_2.48.0             
[21] GenomicRanges_1.44.0        GenomeInfoDb_1.24.2        
[23] IRanges_2.26.0              S4Vectors_0.30.2           
[25] BiocGenerics_0.38.0         MatrixGenerics_1.4.3       
[27] matrixStats_0.61.0          Giotto_1.0.4               
[29] BASS_1.0                    mclust_5.4.9               
[31] GIGrvg_0.5                  workflowr_1.7.0            

loaded via a namespace (and not attached):
  [1] rappdirs_0.3.3            scattermore_0.7          
  [3] maxLik_1.5-2              coda_0.19-4              
  [5] bit64_4.0.2               knitr_1.37               
  [7] irlba_2.3.3               DelayedArray_0.18.0      
  [9] data.table_1.14.2         rpart_4.1-15             
 [11] RCurl_1.98-1.5            doParallel_1.0.15        
 [13] generics_0.1.2            callr_3.7.0              
 [15] RSQLite_2.2.0             combinat_0.0-8           
 [17] RANN_2.6.1                proxy_0.4-26             
 [19] future_1.22.1             bit_4.0.4                
 [21] xml2_1.3.3                lubridate_1.7.9          
 [23] spatstat.data_2.1-0       httpuv_1.5.4             
 [25] assertthat_0.2.1          viridis_0.5.1            
 [27] xfun_0.29                 hms_0.5.3                
 [29] jquerylib_0.1.4           evaluate_0.15            
 [31] promises_1.1.1            DEoptimR_1.0-9           
 [33] fansi_1.0.2               readxl_1.3.1             
 [35] dbplyr_1.4.4              igraph_1.2.7             
 [37] DBI_1.1.1                 htmlwidgets_1.5.1        
 [39] ellipsis_0.3.2            backports_1.2.1          
 [41] deldir_1.0-6              sparseMatrixStats_1.4.2  
 [43] vctrs_0.3.8               ROCR_1.0-11              
 [45] abind_1.4-5               withr_2.4.3              
 [47] cachem_1.0.6              robustbase_0.93-9        
 [49] checkmate_2.0.0           sctransform_0.3.2        
 [51] scran_1.20.1              goftest_1.2-3            
 [53] cluster_2.1.2             lazyeval_0.2.2           
 [55] crayon_1.5.0              labeling_0.4.2           
 [57] edgeR_3.34.1              pkgconfig_2.0.3          
 [59] nlme_3.1-153              vipor_0.4.5              
 [61] rlang_1.0.1               globals_0.14.0           
 [63] lifecycle_1.0.1           miniUI_0.1.1.1           
 [65] sandwich_3.0-1            BiocFileCache_1.12.0     
 [67] modelr_0.1.8              rsvd_1.0.3               
 [69] cellranger_1.1.0          rprojroot_2.0.2          
 [71] polyclip_1.10-0           lmtest_0.9-38            
 [73] rngtools_1.5.2            Matrix_1.3-4             
 [75] Rhdf5lib_1.10.1           zoo_1.8-9                
 [77] reprex_0.3.0              beeswarm_0.4.0           
 [79] whisker_0.4               ggridges_0.5.3           
 [81] processx_3.5.2            pheatmap_1.0.12          
 [83] png_0.1-7                 viridisLite_0.4.0        
 [85] bitops_1.0-7              getPass_0.2-2            
 [87] KernSmooth_2.23-20        blob_1.2.1               
 [89] DelayedMatrixStats_1.14.3 doRNG_1.8.2              
 [91] parallelly_1.28.1         lpSolve_5.6.15           
 [93] beachmat_2.8.1            scales_1.1.1             
 [95] memoise_2.0.1             magrittr_2.0.2           
 [97] plyr_1.8.6                ica_1.0-2                
 [99] zlibbioc_1.34.0           DirichletReg_0.7-1       
[101] compiler_4.1.2            miscTools_0.6-26         
[103] dqrng_0.3.0               RColorBrewer_1.1-2       
[105] rrcov_1.6-0               fitdistrplus_1.1-6       
[107] cli_3.2.0                 XVector_0.32.0           
[109] listenv_0.8.0             patchwork_1.1.1          
[111] pbapply_1.5-0             ps_1.6.0                 
[113] Formula_1.2-4             MASS_7.3-54              
[115] mgcv_1.8-38               tidyselect_1.1.2         
[117] stringi_1.7.6             highr_0.9                
[119] yaml_2.3.5                BiocSingular_1.4.0       
[121] locfit_1.5-9.4            ggrepel_0.9.1            
[123] grid_4.1.2                sass_0.4.0               
[125] tools_4.1.2               future.apply_1.8.1       
[127] label.switching_1.8       rstudioapi_0.13          
[129] bluster_1.2.1             foreach_1.5.0            
[131] git2r_0.28.0              metapod_1.0.0            
[133] gridExtra_2.3             farver_2.1.0             
[135] Rtsne_0.15                digest_0.6.29            
[137] pracma_2.2.9              shiny_1.5.0              
[139] Rcpp_1.0.8                broom_0.7.10             
[141] scuttle_1.2.1             later_1.1.0.1            
[143] RcppAnnoy_0.0.19          WriteXLS_6.3.0           
[145] httr_1.4.2                Rdpack_2.1.2             
[147] colorspace_2.0-3          rvest_0.3.6              
[149] fs_1.5.2                  tensor_1.5               
[151] splines_4.1.2             uwot_0.1.10              
[153] statmod_1.4.36            spatstat.utils_2.2-0     
[155] xgboost_1.4.1.1           plotly_4.9.2.1           
[157] xtable_1.8-4              jsonlite_1.8.0           
[159] spatstat_1.64-1           R6_2.5.1                 
[161] pillar_1.7.0              htmltools_0.5.2          
[163] mime_0.12                 glue_1.6.2               
[165] fastmap_1.1.0             BiocParallel_1.22.0      
[167] BiocNeighbors_1.6.0       class_7.3-19             
[169] codetools_0.2-18          pcaPP_1.9-74             
[171] mvtnorm_1.1-3             utf8_1.2.2               
[173] lattice_0.20-45           bslib_0.3.1              
[175] curl_4.3.2                ggbeeswarm_0.6.0         
[177] leiden_0.3.9              survival_3.2-13          
[179] limma_3.48.3              rmarkdown_2.11           
[181] munsell_0.5.0             e1071_1.7-9              
[183] rhdf5_2.32.2              GenomeInfoDbData_1.2.6   
[185] iterators_1.0.13          smfishHmrf_0.1           
[187] haven_2.3.1               reshape2_1.4.4           
[189] gtable_0.3.0              rbibutils_2.2.4

Simulation

Zheng Li

2022-02-06