Single cell differential expression analysis

Introduction

NULL sc-dataset DE

Summary - NULL

RUV methods for RNA-seq

Summary - RUV

Last updated: 2020-01-06

Checks: 7 0

Knit directory: misc/

This reproducible R Markdown analysis was created with workflowr (version 1.5.0). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.

R Markdown file: up-to-date

Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.

Environment: empty

Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.

Seed: set.seed(20191122)

The command set.seed(20191122) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.

Session information: recorded

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.

Cache: none

Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.

File paths: relative

Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.

Repository version: 7808f0c

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility. The version displayed above was the version of the Git repository at the time these results were generated.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Ignored files:
    Ignored:    .Rhistory
    Ignored:    .Rproj.user/

Untracked files:
    Untracked:  analysis/MAST.Rmd

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.

These are the previous versions of the R Markdown and HTML files. If you’ve configured a remote Git repository (see ?wflow_git_remote), click on the hyperlinks in the table below to view them.

File	Version	Author	Date	Message
Rmd	7808f0c	Dongyue Xie	2020-01-06	wflow_publish(“analysis/scde.Rmd”)
html	42c073c	Dongyue Xie	2020-01-06	Build site.
Rmd	fb35a52	Dongyue Xie	2020-01-06	wflow_publish(“analysis/scde.Rmd”)

Introduction

Two review papers on single cell differential expression analysis I found: Wang et al, and Sonenson and Robinson.

One review paper on RNA-seq bulk data differential expression analysis: Sonenson and Delorenzi.

Single cell RNA seq datasets: conquer

What are UMI data sets?, Full length vs UMI

The data object contains: TPM, counts, length-scaled TPMs, the average length of the transcripts expressed in each sample for each gene.

Task:

Whether we need to RUV? Apply some single cell DE method like MAST, D3E, scDD, SCDE, DEsngle etc, maybe also try methods for RNA seq data like edgeR, Deseq2, voomlimma.

– in paper Sonenson and Robinson(2018), the did an experiment on null datasets and found that for unfiltered data sets, many methods struggled to correctly control the type I error.

If need to RUV, how does current methods like SVA and MOUTHWASH work?
If current method does not work, why? How to improve?

For this GSE45719 datasets: mouse, total 291 cells, among which 50 cells are 16-cell stage blastomere and 50 cells are Mid blastocyst cell (92-94h post-fertilization).

To create NULL datasets, we can use only 16-cell stage cells and randomly partition the cells into two groups.

NULL sc-dataset DE

library(ggplot2)
suppressPackageStartupMessages(library(SummarizedExperiment))
suppressPackageStartupMessages(library(MultiAssayExperiment))
datax<- readRDS("~/Downloads/GSE45719.rds")
datax_gene = experiments(datax)[["gene"]]
#head(assays(datax_gene)[["count"]])

cell16_idx = 1:50

cell16_readcounts = (assays(datax_gene)[["count"]])[,1:50]



#filter out genes that are not expressed

cell16_reads = cell16_readcounts[rowSums(cell16_readcounts)!=0,]

cell16=cell16_reads 

dim(cell16)

[1] 27517    50

sum(cell16==0)/prod(dim(cell16))

[1] 0.5125835

#

# task 1
# randomly partition the data into 2 groups, then simply use a two sample t test

# first partition
set.seed(12345)
group1_idx = sample(1:ncol(cell16),ncol(cell16)/2)
group1 = cell16[,group1_idx]
group2 = cell16[,-group1_idx]
## for each gene, run a two-sample t test

p_values1 = c()
for(i in 1:nrow(cell16)){
  p_values1[i] = t.test(group1[i,],group2[i,],alternative='two.sided')$p.value
}
hist(p_values1,breaks = 15)

Version	Author	Date
42c073c	Dongyue Xie	2020-01-06

summary(p_values1)

     Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
0.0003378 0.2663421 0.3651364 0.4512739 0.6531414 1.0000000

A lot of p-values are around 0.3-0.35. Take a look at gene 12:

as.numeric(cell16[12,group1_idx])

 [1] 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

as.numeric(cell16[12,-group1_idx])

 [1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

t.test(cell16[12,group1_idx],cell16[12,-group1_idx],alternative = 'two.sided')$p.value

[1] 0.3272869

Only one cell has non-zero read counts among two groups. To apply two-sample t-test, we should try to avoid this situation. So choose the top 1000 expressed genes:

#Choose the top 1000 expressed genes 
top_expressed_genes = (order(rowSums(cell16_reads),decreasing = TRUE))[1:1000]
cell16 = cell16_reads[top_expressed_genes,]

dim(cell16)

[1] 1000   50

sum(cell16==0)/prod(dim(cell16))

[1] 0.00876

#

# task 1
# randomly partition the data into 2 groups, then simply use a two sample t test

# first partition
set.seed(12345)
group1_idx = sample(1:ncol(cell16),ncol(cell16)/2)
group1 = cell16[,group1_idx]
group2 = cell16[,-group1_idx]
## for each gene, run a two-sample t test

p_values1 = c()
for(i in 1:nrow(cell16)){
  p_values1[i] = t.test(group1[i,],group2[i,],alternative='two.sided')$p.value
}

ks.test(p_values1,'punif',0,1)


    One-sample Kolmogorov-Smirnov test

data:  p_values1
D = 0.090775, p-value = 1.392e-07
alternative hypothesis: two-sided

hist(p_values1,breaks = 15)

Version	Author	Date
42c073c	Dongyue Xie	2020-01-06

# second partition

group1_idx = sample(1:ncol(cell16),ncol(cell16)/2)
group1 = cell16[,group1_idx]
group2 = cell16[,-group1_idx]
## for each gene, run a two-sample t test

p_values1 = c()
for(i in 1:nrow(cell16)){
  p_values1[i] = t.test(group1[i,],group2[i,],alternative='two.sided')$p.value
}
ks.test(p_values1,'punif',0,1)


    One-sample Kolmogorov-Smirnov test

data:  p_values1
D = 0.24902, p-value < 2.2e-16
alternative hypothesis: two-sided

hist(p_values1,breaks = 15)

Version	Author	Date
42c073c	Dongyue Xie	2020-01-06

# third partition

group1_idx = sample(1:ncol(cell16),ncol(cell16)/2)
group1 = cell16[,group1_idx]
group2 = cell16[,-group1_idx]
## for each gene, run a two-sample t test

p_values1 = c()
for(i in 1:nrow(cell16)){
  p_values1[i] = t.test(group1[i,],group2[i,],alternative='two.sided')$p.value
}
ks.test(p_values1,'punif',0,1)


    One-sample Kolmogorov-Smirnov test

data:  p_values1
D = 0.044142, p-value = 0.0406
alternative hypothesis: two-sided

hist(p_values1,breaks = 15)

Version	Author	Date
42c073c	Dongyue Xie	2020-01-06

Summary - NULL

If not accounting for 0’s in scData, then the null distribution of p-values is certainly not uniform. Mainly because two-sample t test is not applicable.

If removing genes with two many 0’s, then the p-value distributions deviate from uniform.

RUV methods for RNA-seq

First apply on NULL data then add signals to genes using Poisson thinning.

library(vicar)

set.seed(12345)
group1_idx = sample(1:ncol(cell16),ncol(cell16)/2)
group1 = cell16[,group1_idx]
group2 = cell16[,-group1_idx]

group_indicator = rep(0,ncol(cell16))
group_indicator[group1_idx] = 1

X = model.matrix(~group_indicator)

Y = t(cell16)

num_sv     <- sva::num.sv(dat = t(Y), mod = X, method = "be")
num_sv_l   <- sva::num.sv(dat = t(Y), mod = X, method = "leek")

num_sv

[1] 4

num_sv_l

[1] 48

Two approaches give very different number of surrogate variables! Try to use 4 SVs in the following analysis.

mout = mouthwash(Y,X,k=num_sv,cov_of_interest = 2,include_intercept = FALSE)

Running mouthwash on 50 x 2 matrix X and 50 x 1000 matrix Y.
 - Computing independent basis using QR decomposition.
 - Computation took 0.011 seconds.
 - Running additional preprocessing steps.
 - Computation took 0.001 seconds.
 - Running second step of mouthwash:
    + Estimating model parameters using EM.
    + Computation took 2.67 seconds.
    + Generating adaptive shrinkage (ash) output.
    + Computation took 0.216 seconds.
 - Second step took 3.693 seconds.
 - Estimating additional hidden confounders.
 - Computation took 0.016 seconds.

mout$pi0

[1] 0.9922996

library(cate)
#library(leapp)

cate_cate   <- cate::cate.fit(X.primary = X[, 2, drop = FALSE], X.nuis = X[, -2, drop = FALSE],
                              Y = Y, r = num_sv, adj.method = "rr")

# this method is vey slow!
#leapp_leapp <- leapp::leapp(data = t(Y), pred.prim = X[, 2, drop = FALSE], 
#                            pred.covar = X[, -2, drop = FALSE], num.fac = num_sv)

sva_sva     <- sva::sva(dat = t(Y), mod = X, mod0 = X[, -2, drop = FALSE], n.sv = num_sv)

Number of significant surrogate variables is:  4 
Iteration (out of 5 ):1  2  3  4  5

X.sva <- cbind(X, sva_sva$sv)
lmout <- limma::lmFit(object = t(Y), design = X.sva)
eout  <- limma::eBayes(lmout)
svaout           <- list()
svaout$betahat   <- lmout$coefficients[, 2]
svaout$sebetahat <- lmout$stdev.unscaled[, 2] * sqrt(eout$s2.post)
svaout$pvalues   <- eout$p.value[, 2]

hist(svaout$pvalues,breaks=15)

Version	Author	Date
42c073c	Dongyue Xie	2020-01-06

ks.test(svaout$pvalues,'punif',0,1)


    One-sample Kolmogorov-Smirnov test

data:  svaout$pvalues
D = 0.032173, p-value = 0.2518
alternative hypothesis: two-sided

hist(cate_cate$beta.p.value,breaks = 15)

Version	Author	Date
42c073c	Dongyue Xie	2020-01-06

ks.test(cate_cate$beta.p.value,'punif',0,1)


    One-sample Kolmogorov-Smirnov test

data:  cate_cate$beta.p.value
D = 0.085223, p-value = 9.83e-07
alternative hypothesis: two-sided

Add some signal to the NULL dataset.

library(seqgendiff)
#tt = thin_diff(round(cell16), design_fixed = X[,2,drop=FALSE])
set.seed(12345)
thinout = thin_2group(round(cell16),0.9,signal_fun = stats::rexp,signal_params = list(rate=0.5))

#check null groups

group1 = cell16[,which(thinout$designmat==1)]
group2 = cell16[,which(thinout$designmat==0)]
## for each gene, run a two-sample t test

p_values1 = c()
for(i in 1:nrow(cell16)){
  p_values1[i] = t.test(group1[i,],group2[i,],alternative='two.sided')$p.value
}
ks.test(p_values1,'punif',0,1)


    One-sample Kolmogorov-Smirnov test

data:  p_values1
D = 0.1421, p-value < 2.2e-16
alternative hypothesis: two-sided

hist(p_values1,breaks = 15)

Version	Author	Date
42c073c	Dongyue Xie	2020-01-06

Y = t(thinout$mat)

X = model.matrix(~thinout$designmat)

num_sv     <- sva::num.sv(dat = t(Y), mod = X, method = "be")
num_sv_l   <- sva::num.sv(dat = t(Y), mod = X, method = "leek")

num_sv

[1] 4

num_sv_l

[1] 48

mean(abs(thinout$coef) < 10^-6)

[1] 0.9

mout = mouthwash(Y,X,k=num_sv,cov_of_interest = 2,include_intercept = FALSE)

Running mouthwash on 50 x 2 matrix X and 50 x 1000 matrix Y.
 - Computing independent basis using QR decomposition.
 - Computation took 0.015 seconds.
 - Running additional preprocessing steps.
 - Computation took 0.001 seconds.
 - Running second step of mouthwash:
    + Estimating model parameters using EM.
    + Computation took 2.345 seconds.
    + Generating adaptive shrinkage (ash) output.
    + Computation took 0.098 seconds.
 - Second step took 3.217 seconds.
 - Estimating additional hidden confounders.
 - Computation took 0.019 seconds.

mout$pi0

[1] 0.806857

bout <- backwash(Y = Y, X = X, k = num_sv, cov_of_interest = 2, include_intercept = FALSE)

 - Computing independent basis using QR decomposition.
 - Computation took 0.016 seconds.
 - Running additional preprocessing steps.
 - Computation took 0.001 seconds.
 - Running second step of backwash:
    + Initializing parameters for EM algorithm.
    + Computation took 0.179 seconds.
    + Running one round of parameter updates.
    + Computation took 0.029 seconds.
    + Estimating model parameters using EM.
    + Computation took 9.455 seconds.
    + Generating posterior statistics.
    + Computation took 0.003 seconds.
 - Second step took 11.349 seconds.
 - Generating final backwash outputs.
 - Computation took 0.013 seconds.

bout$pi0

[1] 0.8127092

cate_cate = cate::cate.fit(X.primary = X[, 2, drop = FALSE], X.nuis = X[, -2, drop = FALSE],
                              Y = Y, r = num_sv, adj.method = "rr")

sva_sva     <- sva::sva(dat = t(Y), mod = X, mod0 = X[, -2, drop = FALSE], n.sv = num_sv)

Number of significant surrogate variables is:  4 
Iteration (out of 5 ):1  2  3  4  5

X.sva <- cbind(X, sva_sva$sv)
lmout <- limma::lmFit(object = t(Y), design = X.sva)
eout  <- limma::eBayes(lmout)
svaout           <- list()
svaout$betahat   <- lmout$coefficients[, 2]
svaout$sebetahat <- lmout$stdev.unscaled[, 2] * sqrt(eout$s2.post)
svaout$pvalues   <- eout$p.value[, 2]

which_null = c(1*(abs(thinout$coef) < 10^-6))



# plot ROC curve
roc_out <- list(
  pROC::roc(response = which_null, predictor = c(mout$result$lfdr)),
  pROC::roc(response = which_null, predictor = c(bout$result$lfdr)),
  pROC::roc(response = which_null, predictor = c(cate_cate$beta.p.value)),
  pROC::roc(response = which_null, predictor = c(svaout$pvalues)))
name_vec <- c("MOUTHWASH", 'BACKWASH',"CATErr", "SVA")
names(roc_out) <- name_vec

sout <- lapply(roc_out, function(x) { data.frame(TPR = x$sensitivities, FPR = 1 - x$specificities)})
for (index in 1:length(sout)) {
  sout[[index]]$Method <- name_vec[index]
}
longdat <- do.call(rbind, sout)

shortdat <- dplyr::filter(longdat, Method == "MOUTHWASH" | Method == "BACKWASH" |
                            Method == "CATErr" | Method == "SVA" | Method == "LEAPP")
ggplot(data = shortdat, mapping = aes(x = FPR, y = TPR, col = Method)) +
  geom_path() + theme_bw() + ggtitle("ROC Curves")

Version	Author	Date
42c073c	Dongyue Xie	2020-01-06

auc_vec <- sapply(roc_out, FUN = function(x) { x$auc })
knitr::kable(sort(auc_vec, decreasing = TRUE), col.names = "AUC", digits = 3)

	AUC
SVA	0.943
CATErr	0.918
BACKWASH	0.887
MOUTHWASH	0.887

# estimate pi0
method_list <- list()
method_list$CATErr           <- list()
method_list$CATErr$betahat   <- c(cate_cate$beta)
method_list$CATErr$sebetahat <- c(sqrt(cate_cate$beta.cov.row * c(cate_cate$beta.cov.col)) / sqrt(nrow(X)))

method_list$SVA             <- list()
method_list$SVA$betahat     <- c(svaout$betahat)
method_list$SVA$sebetahat   <- c(svaout$sebetahat)

ashfit <- lapply(method_list, FUN = function(x) { ashr::ash(x$betahat, x$sebetahat)})
api0 <- sapply(ashfit, FUN = ashr::get_pi0)
api0 <- c(api0, MOUTHWASH = mout$pi0)
api0 <- c(api0, BACKWASH = bout$pi0)

knitr::kable(sort(api0, decreasing = TRUE), col.names = "Estimate of Pi0")

E	stimate of Pi0
BACKWASH	0.8127092
SVA	0.8126893
MOUTHWASH	0.8068570
CATErr	0.4149302

Weaker signal: rexp(,rate=1)

set.seed(12345)
thinout = thin_2group(round(cell16),0.9,signal_fun = stats::rexp,signal_params = list(rate=1))

Y = t(thinout$mat)

X = model.matrix(~thinout$designmat)


mout = mouthwash(Y,X,k=num_sv,cov_of_interest = 2,include_intercept = FALSE)

Running mouthwash on 50 x 2 matrix X and 50 x 1000 matrix Y.
 - Computing independent basis using QR decomposition.
 - Computation took 0.015 seconds.
 - Running additional preprocessing steps.
 - Computation took 0 seconds.
 - Running second step of mouthwash:
    + Estimating model parameters using EM.
    + Computation took 6.007 seconds.
    + Generating adaptive shrinkage (ash) output.
    + Computation took 0.103 seconds.
 - Second step took 7.091 seconds.
 - Estimating additional hidden confounders.
 - Computation took 0.019 seconds.

mout$pi0

[1] 0.8178142

bout <- backwash(Y = Y, X = X, k = num_sv, cov_of_interest = 2, include_intercept = FALSE)

 - Computing independent basis using QR decomposition.
 - Computation took 0.017 seconds.
 - Running additional preprocessing steps.
 - Computation took 0 seconds.
 - Running second step of backwash:
    + Initializing parameters for EM algorithm.
    + Computation took 0.198 seconds.
    + Running one round of parameter updates.
    + Computation took 0.029 seconds.
    + Estimating model parameters using EM.
    + Computation took 8.311 seconds.
    + Generating posterior statistics.
    + Computation took 0.004 seconds.
 - Second step took 10.307 seconds.
 - Generating final backwash outputs.
 - Computation took 0.013 seconds.

bout$pi0

[1] 0.8210579

cate_cate = cate::cate.fit(X.primary = X[, 2, drop = FALSE], X.nuis = X[, -2, drop = FALSE],
                              Y = Y, r = num_sv, adj.method = "rr")

sva_sva     <- sva::sva(dat = t(Y), mod = X, mod0 = X[, -2, drop = FALSE], n.sv = num_sv)

Number of significant surrogate variables is:  4 
Iteration (out of 5 ):1  2  3  4  5

X.sva <- cbind(X, sva_sva$sv)
lmout <- limma::lmFit(object = t(Y), design = X.sva)
eout  <- limma::eBayes(lmout)
svaout           <- list()
svaout$betahat   <- lmout$coefficients[, 2]
svaout$sebetahat <- lmout$stdev.unscaled[, 2] * sqrt(eout$s2.post)
svaout$pvalues   <- eout$p.value[, 2]

which_null = c(1*(abs(thinout$coef) < 10^-6))

roc_out <- list(
  pROC::roc(response = which_null, predictor = c(mout$result$lfdr)),
  pROC::roc(response = which_null, predictor = c(bout$result$lfdr)),
  pROC::roc(response = which_null, predictor = c(cate_cate$beta.p.value)),
  pROC::roc(response = which_null, predictor = c(svaout$pvalues)))
name_vec <- c("MOUTHWASH", 'BACKWASH',"CATErr", "SVA")
names(roc_out) <- name_vec

sout <- lapply(roc_out, function(x) { data.frame(TPR = x$sensitivities, FPR = 1 - x$specificities)})
for (index in 1:length(sout)) {
  sout[[index]]$Method <- name_vec[index]
}
longdat <- do.call(rbind, sout)

shortdat <- dplyr::filter(longdat, Method == "MOUTHWASH" | Method == "BACKWASH" |
                            Method == "CATErr" | Method == "SVA" | Method == "LEAPP")
ggplot(data = shortdat, mapping = aes(x = FPR, y = TPR, col = Method)) +
  geom_path() + theme_bw() + ggtitle("ROC Curves")

Version	Author	Date
42c073c	Dongyue Xie	2020-01-06

auc_vec <- sapply(roc_out, FUN = function(x) { x$auc })
knitr::kable(sort(auc_vec, decreasing = TRUE), col.names = "AUC", digits = 3)

	AUC
SVA	0.888
CATErr	0.859
BACKWASH	0.805
MOUTHWASH	0.805

method_list <- list()
method_list$CATErr           <- list()
method_list$CATErr$betahat   <- c(cate_cate$beta)
method_list$CATErr$sebetahat <- c(sqrt(cate_cate$beta.cov.row * c(cate_cate$beta.cov.col)) / sqrt(nrow(X)))

method_list$SVA             <- list()
method_list$SVA$betahat     <- c(svaout$betahat)
method_list$SVA$sebetahat   <- c(svaout$sebetahat)

ashfit <- lapply(method_list, FUN = function(x) { ashr::ash(x$betahat, x$sebetahat)})
api0 <- sapply(ashfit, FUN = ashr::get_pi0)
api0 <- c(api0, MOUTHWASH = mout$pi0)
api0 <- c(api0, BACKWASH = bout$pi0)

knitr::kable(sort(api0, decreasing = TRUE), col.names = "Estimate of Pi0")

E	stimate of Pi0
SVA	0.8212793
BACKWASH	0.8210579
MOUTHWASH	0.8178142
CATErr	0.4781507

Stronger signal: rexp(,rate = 0.2)

set.seed(12345)
thinout = thin_2group(round(cell16),0.9,signal_fun = stats::rexp,signal_params = list(rate=0.2))



Y = t(thinout$mat)

X = model.matrix(~thinout$designmat)



mean(abs(thinout$coef) < 10^-6)

[1] 0.9

mout = mouthwash(Y,X,k=num_sv,cov_of_interest = 2,include_intercept = FALSE)

Running mouthwash on 50 x 2 matrix X and 50 x 1000 matrix Y.
 - Computing independent basis using QR decomposition.
 - Computation took 0.011 seconds.
 - Running additional preprocessing steps.
 - Computation took 0.001 seconds.
 - Running second step of mouthwash:
    + Estimating model parameters using EM.
    + Computation took 2.24 seconds.
    + Generating adaptive shrinkage (ash) output.
    + Computation took 0.109 seconds.
 - Second step took 3.353 seconds.
 - Estimating additional hidden confounders.
 - Computation took 0.02 seconds.

mout$pi0

[1] 0.7604766

bout <- backwash(Y = Y, X = X, k = num_sv, cov_of_interest = 2, include_intercept = FALSE)

 - Computing independent basis using QR decomposition.
 - Computation took 0.015 seconds.
 - Running additional preprocessing steps.
 - Computation took 0.001 seconds.
 - Running second step of backwash:
    + Initializing parameters for EM algorithm.
    + Computation took 0.182 seconds.
    + Running one round of parameter updates.
    + Computation took 0.03 seconds.
    + Estimating model parameters using EM.
    + Computation took 18.932 seconds.
    + Generating posterior statistics.
    + Computation took 0.003 seconds.
 - Second step took 20.758 seconds.
 - Generating final backwash outputs.
 - Computation took 0.013 seconds.

bout$pi0

[1] 0.8159548

cate_cate = cate::cate.fit(X.primary = X[, 2, drop = FALSE], X.nuis = X[, -2, drop = FALSE],
                              Y = Y, r = num_sv, adj.method = "rr")

sva_sva     <- sva::sva(dat = t(Y), mod = X, mod0 = X[, -2, drop = FALSE], n.sv = num_sv)

Number of significant surrogate variables is:  4 
Iteration (out of 5 ):1  2  3  4  5

X.sva <- cbind(X, sva_sva$sv)
lmout <- limma::lmFit(object = t(Y), design = X.sva)
eout  <- limma::eBayes(lmout)
svaout           <- list()
svaout$betahat   <- lmout$coefficients[, 2]
svaout$sebetahat <- lmout$stdev.unscaled[, 2] * sqrt(eout$s2.post)
svaout$pvalues   <- eout$p.value[, 2]

which_null = c(1*(abs(thinout$coef) < 10^-6))

roc_out <- list(
  pROC::roc(response = which_null, predictor = c(mout$result$lfdr)),
  pROC::roc(response = which_null, predictor = c(bout$result$lfdr)),
  pROC::roc(response = which_null, predictor = c(cate_cate$beta.p.value)),
  pROC::roc(response = which_null, predictor = c(svaout$pvalues)))
name_vec <- c("MOUTHWASH", 'BACKWASH',"CATErr", "SVA")
names(roc_out) <- name_vec

sout <- lapply(roc_out, function(x) { data.frame(TPR = x$sensitivities, FPR = 1 - x$specificities)})
for (index in 1:length(sout)) {
  sout[[index]]$Method <- name_vec[index]
}
longdat <- do.call(rbind, sout)

shortdat <- dplyr::filter(longdat, Method == "MOUTHWASH" | Method == "BACKWASH" |
                            Method == "CATErr" | Method == "SVA" | Method == "LEAPP")
ggplot(data = shortdat, mapping = aes(x = FPR, y = TPR, col = Method)) +
  geom_path() + theme_bw() + ggtitle("ROC Curves")

Version	Author	Date
42c073c	Dongyue Xie	2020-01-06

auc_vec <- sapply(roc_out, FUN = function(x) { x$auc })
knitr::kable(sort(auc_vec, decreasing = TRUE), col.names = "AUC", digits = 3)

	AUC
SVA	0.974
MOUTHWASH	0.953
CATErr	0.951
BACKWASH	0.950

method_list <- list()
method_list$CATErr           <- list()
method_list$CATErr$betahat   <- c(cate_cate$beta)
method_list$CATErr$sebetahat <- c(sqrt(cate_cate$beta.cov.row * c(cate_cate$beta.cov.col)) / sqrt(nrow(X)))

method_list$SVA             <- list()
method_list$SVA$betahat     <- c(svaout$betahat)
method_list$SVA$sebetahat   <- c(svaout$sebetahat)

ashfit <- lapply(method_list, FUN = function(x) { ashr::ash(x$betahat, x$sebetahat)})
api0 <- sapply(ashfit, FUN = ashr::get_pi0)
api0 <- c(api0, MOUTHWASH = mout$pi0)
api0 <- c(api0, BACKWASH = bout$pi0)

knitr::kable(sort(api0, decreasing = TRUE), col.names = "Estimate of Pi0")

E	stimate of Pi0
SVA	0.8503571
BACKWASH	0.8159548
MOUTHWASH	0.7604766
CATErr	0.5565965

Summary - RUV

Applying SVA on NULL data gives uniformly distributed p-values. Cate with robust regression didn’t make p-values uniformly distributed but considerabley more small p-values. MOUTHWASH outputs $\hat\pi_0$ very close to 1.
Adding signals to NULL dataset.
What i didnn’t look at: two many 0’s in the dataset.

sessionInfo()

R version 3.6.1 (2019-07-05)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS High Sierra 10.13.6

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
 [1] seqgendiff_1.1.1            cate_1.1                   
 [3] vicar_0.1-10                MultiAssayExperiment_1.10.4
 [5] SummarizedExperiment_1.14.1 DelayedArray_0.10.0        
 [7] BiocParallel_1.18.1         matrixStats_0.55.0         
 [9] Biobase_2.44.0              GenomicRanges_1.36.1       
[11] GenomeInfoDb_1.20.0         IRanges_2.18.3             
[13] S4Vectors_0.22.1            BiocGenerics_0.30.0        
[15] ggplot2_3.2.1              

loaded via a namespace (and not attached):
 [1] nlme_3.1-141           svd_0.5                bitops_1.0-6          
 [4] fs_1.3.1               bit64_0.9-7            doParallel_1.0.15     
 [7] rprojroot_1.3-2        tools_3.6.1            backports_1.1.5       
[10] R6_2.4.0               DBI_1.0.0              lazyeval_0.2.2        
[13] mgcv_1.8-29            colorspace_1.4-1       withr_2.1.2           
[16] gridExtra_2.3          tidyselect_0.2.5       bit_1.1-14            
[19] compiler_3.6.1         git2r_0.26.1           labeling_0.3          
[22] scales_1.0.0           SQUAREM_2017.10-1      genefilter_1.66.0     
[25] mixsqp_0.1-97          stringr_1.4.0          esaBcv_1.2.1          
[28] digest_0.6.21          rmarkdown_1.16         XVector_0.24.0        
[31] pscl_1.5.2             pkgconfig_2.0.3        htmltools_0.4.0       
[34] highr_0.8              ruv_0.9.7.1            limma_3.40.6          
[37] rlang_0.4.0            RSQLite_2.1.2          leapp_1.2             
[40] dplyr_0.8.3            RCurl_1.95-4.12        magrittr_1.5          
[43] GenomeInfoDbData_1.2.1 Matrix_1.2-17          Rcpp_1.0.2            
[46] munsell_0.5.0          pROC_1.15.3            stringi_1.4.3         
[49] whisker_0.4            yaml_2.2.0             MASS_7.3-51.4         
[52] zlibbioc_1.30.0        plyr_1.8.4             grid_3.6.1            
[55] blob_1.2.0             promises_1.1.0         crayon_1.3.4          
[58] lattice_0.20-38        splines_3.6.1          annotate_1.62.0       
[61] zeallot_0.1.0          knitr_1.25             pillar_1.4.2          
[64] corpcor_1.6.9          codetools_0.2-16       XML_3.98-1.20         
[67] glue_1.3.1             evaluate_0.14          vctrs_0.2.0           
[70] httpuv_1.5.2           foreach_1.4.7          gtable_0.3.0          
[73] purrr_0.3.2            assertthat_0.2.1       ashr_2.2-38           
[76] xfun_0.10              xtable_1.8-4           later_1.0.0           
[79] survival_2.44-1.1      truncnorm_1.0-8        tibble_2.1.3          
[82] iterators_1.0.12       AnnotationDbi_1.46.1   memoise_1.1.0         
[85] workflowr_1.5.0        sva_3.32.1