SymSim is an R package made to simulate single cell RNA-seq data. It can be used to generate a single population of cells with similar statistical properties to real data, or to generate multiple discrete or continuous populations of cells, where users can input a tree to represent relationships between multiple populations. SymSim has the following applications:
Each cell has an EVF vector which defines the identify of the cell. Each gene has an gene effect vector of the same length as the EVF vector, and can be thought of as the weights of EVFs. The EVFs allow us to simulate discrete or continuous populations. We use the kinetic model to sample the true transcript count for each gene in each cell, and the parameters of the kinetic model (k_on, k_off, s) are calculated from external variability factors (EVFs) and gene-specific effects (or gene effect). The gene effect vectors allow us to simulate differentially expressed genes or co-expressed gene modules. From true transcript counts to observed counts (which can be read counts or UMI counts) we consider capture efficiency, amplification bias, sequencing depth and batch effects. At different steps, users have control on the amount of extrinsic variation, intrinsic variation and technical variation.
Main functions
SimulateTrueCounts( ) and True2ObservedCounts( ) are the two major functions to generate datasets. SimulateTrueCounts generates true transcript counts for the given number of genes and cells, where the cells can come from one single population, or multiple discrete populations, or continuous populations. True2ObservedCounts then simulates the library preparation and sequencing procedures, and convert the true transcript counts into observed read counts or UMI counts.
The input parameters of the package allow users to control intrinsic variation, technical variation and biological variation in the data. Intrinsic variation is modeled through the kinetic model; technical variation takes into account dropouts, amplification biases including length bias and GC bias, and batch effects; biological variation is modeled by EVFs (extrinsic variation factors).
SimulateTrueCounts( ) returns a list of four elements:
1 a matrix containing the true transcript counts;
2 gene level meta information;
3 cell level meta information, including a matrix of EVFs and a vector of cell identity (for example, the population the cell belongs to);
4 the parameters kon, koff and s used to simulation the true counts.
True2ObservedCounts( ) returns a list of two elements:
1 a matrix containing the observed read counts or UMI counts;
2 cell level meta information;
DivideBatches( ) takes the output of True2ObservedCounts( ) as input and split the data into multiple batches while adding batch effect for each batch. The batch information of each cell is in the output cell level meta information.
We provide functions for users to retrieve goldstandard information to benchmark computational methods for clustering, differential expression and trajectory inference. The “true” cluster information for each cell can be simply obtained from the output of function SimulateTrueCounts( ). Function getDEgenes( ) can be used to obtain differential expression information for genes, given two populations. Function getTrajectoryGenes( ) returns the simulated trajectory information of cells.
For a single population, BestMatchParams( ) returns parameters to generate datasets with similar statistical properties as a given input real dataset.
Description of input parameters for these functions can be found at the end of this document.
Find simulations which match a given experimental dataset from our database
Given an experimental dataset, users can find the simulation which match the experimental dataset in terms of statistics including the mean expression, the fano factor and the percentage of expressing cells of genes. The function BestMatchParams() returns the parameter configurations which generate the top matching simulations. In the following example, we obtain the top 5 best matching parameter configurations for respectively the cortex and the Th17 datasets. The function also generates Q-Q plots of mean, percent-non-zero, standard deviation (sd) of genes between the given experimental dataset and simulated dataset with top ranking. Given the cortex dataset:
## gene_effects_sd Sigma scale_s alpha_mean alpha_sd depth_mean depth_sd
## 11355 2 0.2 0.4 0.05 0.025 3e+05 150000
## 11356 2 0.2 0.4 0.05 0.025 3e+05 150000
## 7204 2 0.2 0.5 0.03 0.009 3e+05 90000
## 11344 2 0.2 0.4 0.05 0.015 3e+05 90000
## 7225 2 0.2 0.5 0.03 0.009 5e+05 150000
## nPCR1 prop_hge mean_hge protocol gene_effect_prob dist
## 11355 14 0.01 3 UMI 0.3 0.2846441
## 11356 14 0.02 3 UMI 0.3 0.3194173
## 7204 10 0.02 4 UMI 0.3 0.3478997
## 11344 14 0.02 3 UMI 0.3 0.3489926
## 7225 10 0.03 4 UMI 0.3 0.3538994
Given the Th17 dataset:
## gene_effects_sd Sigma scale_s alpha_mean alpha_sd depth_mean depth_sd
## 1651 1 0.2 0.5 0.15 0.045 2e+06 6e+05
## 1649 1 0.2 0.5 0.15 0.045 2e+06 6e+05
## 5110 2 0.2 0.5 0.15 0.045 2e+06 6e+05
## 7480 2 0.1 0.3 0.20 0.060 2e+06 6e+05
## 1654 1 0.2 0.5 0.15 0.045 2e+06 6e+05
## nPCR1 prop_hge mean_hge protocol gene_effect_prob dist
## 1651 18 0.020 4 nonUMI 0.3 0.8500673
## 1649 18 0.015 5 nonUMI 0.3 0.8535507
## 5110 18 0.025 4 nonUMI 0.3 0.8593795
## 7480 18 0.020 4 nonUMI 0.3 0.8649737
## 1654 18 0.025 4 nonUMI 0.3 0.8699528