Last updated: 2019-01-11

workflowr checks: (Click a bullet for more information)
Expand here to see past versions:

Introduction

Koen Van den Berge, Fanny Perraudeau, Charlotte Soneson, Michael I. Love, Davide Risso, Jean-Philippe Vert, Mark D. Robinson, Sandrine Dudiot, and Lieven Clement. Observation weights unlock bulk RNA-seq tools for zero inflation and single-cell application. Genome Biology (2018) 19:24.


Analysis

Load functions/packages.

source("/project2/gilad/joycehsiao/dsc-log-fold-change/dsc/code/zinbwaveZinger/zingeRsimulationFunctions/simulationHelpFunctions_v7_diffInZero.R")

Load pre-computed object. See here for steps. Note that the cluster assignment here is different from that in the Seurat tutorial.

df <- readRDS("../dsc/data/pbmc.rds")

counts <- data.frame(assay(df))

counts_sub <- assay(df)[,which(colData(df)$seurat == 1 | colData(df)$seurat == 2)]
dim(counts_sub)

cellType <- colData(df)$seurat[which(colData(df)$seurat == 1 | colData(df)$seurat == 2)]

saveRDS(cellType, "../dsc/data/pbmc_counts_sub_celltype.rds")
saveRDS(counts, file = "../dsc/data/pbmc_counts.rds")
saveRDS(counts_sub, file = "../dsc/data/pbmc_counts_sub.rds")

Below was adpated from Berge et al., 2018 code (https://github.com/statOmics/zinbwaveZinger), specifically zinbwaveZinger/zinbwaveSimulations/tenX_sims_fc2/tenX_sims_fc2.Rmd.

Compute parameters.

paramsTenx <- getDatasetMoMPositive(counts = counts_sub)
saveRDS(paramsTenx, file = "../output/sim_power_berge_pbmc.Rmd/paramsTenx.rds")

Simulate data.

paramsTenx <- readRDS(file = "../output/sim_power_berge_pbmc.Rmd/paramsTenx.rds")
tenXData <- readRDS(file = "../dsc/data/pbmc_counts_sub.rds")

nSamples <- ncol(counts_sub)
grp <- as.factor(rep(1:2, each = nSamples/2)) #two-group comparison
nTags <- 10000 #nr of features
set.seed(11)
DEind <- sample(1:nTags,floor(nTags*.1),replace=FALSE) #10% DE
fcSim <- (2 + rexp(length(DEind), rate = 1/2)) #fold changes
libSizes <- sample(colSums(tenXData),nSamples,replace=TRUE) #library sizes
simData <- NBsimSingleCell(foldDiff = fcSim, ind = DEind,
                           dataset = tenXData, nTags = nTags,
                           group = grp,
                           verbose = TRUE, params = paramsTenx,
                           lib.size = libSizes, cpm="AveLogCPM", normalizeLambda=TRUE,
                           min.dispersion=1e-3)
simData$counts[1:5,1:5]
saveRDS(simData, file = "../output/sim_power_berge_pbmc.Rmd/simData.rds")
saveRDS(simData, file = "../dsc/data/pbmc_simdata_berge.rds")

Evalute some properties of the simulate data.

simData <- readRDS("output/sim_power_berge_pbmc.Rmd/simData.rds")

tenXData <- readRDS("dsc/data/pbmc_counts_sub.rds")
cellType <- readRDS("dsc/data/pbmc_counts_sub_celltype.rds")

dOrig <- suppressWarnings(edgeR::calcNormFactors(DGEList(tenXData)))
dOrig <- estimateDisp(dOrig, design=model.matrix(~cellType))
d <- suppressWarnings(edgeR::calcNormFactors(DGEList(simData$counts)))
d <- estimateDisp(d, design=model.matrix(~simData$group))
par(mfrow=c(1,2))
plotBCV(dOrig,ylim=c(0,12), xlim=c(9,16))
plotBCV(d,ylim=c(0,12), xlim=c(9,16))

Association of library size with zeros

par(mfrow=c(1,2))
plot(x=log(colSums(tenXData)), y=colMeans(tenXData==0), xlab="Log library size", ylab="Fraction of zeros", xlim=c(5.5,10), ylim=c(0,.9))
points(x=log(colSums(simData$counts)), y=colMeans(simData$counts==0), col=2)


plot(x=colSums(tenXData), y=colMeans(tenXData==0), xlab="library size", ylab="Fraction of zeros", xlim=c(300,3000), ylim=c(.2,1))
points(x=colSums(simData$counts), y=colMeans(simData$counts==0), col=2)

Session information

sessionInfo()
R version 3.5.1 (2018-07-02)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Scientific Linux 7.4 (Nitrogen)

Matrix products: default
BLAS/LAPACK: /software/openblas-0.2.19-el7-x86_64/lib/libopenblas_haswellp-r0.2.19.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] mgcv_1.8-25  nlme_3.1-137 edgeR_3.24.0 limma_3.38.3

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.0        knitr_1.20        whisker_0.3-2    
 [4] magrittr_1.5      workflowr_1.1.1   splines_3.5.1    
 [7] lattice_0.20-38   stringr_1.3.1     tools_3.5.1      
[10] grid_3.5.1        R.oo_1.22.0       git2r_0.23.0     
[13] htmltools_0.3.6   yaml_2.2.0        rprojroot_1.3-2  
[16] digest_0.6.18     Matrix_1.2-15     R.utils_2.7.0    
[19] evaluate_0.12     rmarkdown_1.10    stringi_1.2.4    
[22] compiler_3.5.1    backports_1.1.2   R.methodsS3_1.7.1
[25] locfit_1.5-9.1   

This reproducible R Markdown analysis was created with workflowr 1.1.1