Last updated: 2020-01-06
Checks: 7 0
Knit directory: misc/
This reproducible R Markdown analysis was created with workflowr (version 1.5.0). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.
Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.
Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.
The command set.seed(20191122)
was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.
Great job! Recording the operating system, R version, and package versions is critical for reproducibility.
Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.
Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.
Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility. The version displayed above was the version of the Git repository at the time these results were generated.
Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish
or wflow_git_commit
). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:
Ignored files:
Ignored: .Rhistory
Ignored: .Rproj.user/
Untracked files:
Untracked: analysis/MAST.Rmd
Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.
These are the previous versions of the R Markdown and HTML files. If you’ve configured a remote Git repository (see ?wflow_git_remote
), click on the hyperlinks in the table below to view them.
File | Version | Author | Date | Message |
---|---|---|---|---|
Rmd | 7808f0c | Dongyue Xie | 2020-01-06 | wflow_publish(“analysis/scde.Rmd”) |
html | 42c073c | Dongyue Xie | 2020-01-06 | Build site. |
Rmd | fb35a52 | Dongyue Xie | 2020-01-06 | wflow_publish(“analysis/scde.Rmd”) |
Two review papers on single cell differential expression analysis I found: Wang et al, and Sonenson and Robinson.
One review paper on RNA-seq bulk data differential expression analysis: Sonenson and Delorenzi.
Single cell RNA seq datasets: conquer
What are UMI data sets?, Full length vs UMI
The data object contains: TPM, counts, length-scaled TPMs, the average length of the transcripts expressed in each sample for each gene.
Task:
– in paper Sonenson and Robinson(2018), the did an experiment on null datasets and found that for unfiltered data sets, many methods struggled to correctly control the type I error.
If need to RUV, how does current methods like SVA and MOUTHWASH work?
If current method does not work, why? How to improve?
For this GSE45719 datasets: mouse, total 291 cells, among which 50 cells are 16-cell stage blastomere and 50 cells are Mid blastocyst cell (92-94h post-fertilization).
To create NULL datasets, we can use only 16-cell stage cells and randomly partition the cells into two groups.
library(ggplot2)
suppressPackageStartupMessages(library(SummarizedExperiment))
suppressPackageStartupMessages(library(MultiAssayExperiment))
datax<- readRDS("~/Downloads/GSE45719.rds")
datax_gene = experiments(datax)[["gene"]]
#head(assays(datax_gene)[["count"]])
cell16_idx = 1:50
cell16_readcounts = (assays(datax_gene)[["count"]])[,1:50]
#filter out genes that are not expressed
cell16_reads = cell16_readcounts[rowSums(cell16_readcounts)!=0,]
cell16=cell16_reads
dim(cell16)
[1] 27517 50
sum(cell16==0)/prod(dim(cell16))
[1] 0.5125835
#
# task 1
# randomly partition the data into 2 groups, then simply use a two sample t test
# first partition
set.seed(12345)
group1_idx = sample(1:ncol(cell16),ncol(cell16)/2)
group1 = cell16[,group1_idx]
group2 = cell16[,-group1_idx]
## for each gene, run a two-sample t test
p_values1 = c()
for(i in 1:nrow(cell16)){
p_values1[i] = t.test(group1[i,],group2[i,],alternative='two.sided')$p.value
}
hist(p_values1,breaks = 15)
Version | Author | Date |
---|---|---|
42c073c | Dongyue Xie | 2020-01-06 |
summary(p_values1)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.0003378 0.2663421 0.3651364 0.4512739 0.6531414 1.0000000
A lot of p-values are around 0.3-0.35. Take a look at gene 12:
as.numeric(cell16[12,group1_idx])
[1] 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
as.numeric(cell16[12,-group1_idx])
[1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
t.test(cell16[12,group1_idx],cell16[12,-group1_idx],alternative = 'two.sided')$p.value
[1] 0.3272869
Only one cell has non-zero read counts among two groups. To apply two-sample t-test, we should try to avoid this situation. So choose the top 1000 expressed genes:
#Choose the top 1000 expressed genes
top_expressed_genes = (order(rowSums(cell16_reads),decreasing = TRUE))[1:1000]
cell16 = cell16_reads[top_expressed_genes,]
dim(cell16)
[1] 1000 50
sum(cell16==0)/prod(dim(cell16))
[1] 0.00876
#
# task 1
# randomly partition the data into 2 groups, then simply use a two sample t test
# first partition
set.seed(12345)
group1_idx = sample(1:ncol(cell16),ncol(cell16)/2)
group1 = cell16[,group1_idx]
group2 = cell16[,-group1_idx]
## for each gene, run a two-sample t test
p_values1 = c()
for(i in 1:nrow(cell16)){
p_values1[i] = t.test(group1[i,],group2[i,],alternative='two.sided')$p.value
}
ks.test(p_values1,'punif',0,1)
One-sample Kolmogorov-Smirnov test
data: p_values1
D = 0.090775, p-value = 1.392e-07
alternative hypothesis: two-sided
hist(p_values1,breaks = 15)
Version | Author | Date |
---|---|---|
42c073c | Dongyue Xie | 2020-01-06 |
# second partition
group1_idx = sample(1:ncol(cell16),ncol(cell16)/2)
group1 = cell16[,group1_idx]
group2 = cell16[,-group1_idx]
## for each gene, run a two-sample t test
p_values1 = c()
for(i in 1:nrow(cell16)){
p_values1[i] = t.test(group1[i,],group2[i,],alternative='two.sided')$p.value
}
ks.test(p_values1,'punif',0,1)
One-sample Kolmogorov-Smirnov test
data: p_values1
D = 0.24902, p-value < 2.2e-16
alternative hypothesis: two-sided
hist(p_values1,breaks = 15)
Version | Author | Date |
---|---|---|
42c073c | Dongyue Xie | 2020-01-06 |
# third partition
group1_idx = sample(1:ncol(cell16),ncol(cell16)/2)
group1 = cell16[,group1_idx]
group2 = cell16[,-group1_idx]
## for each gene, run a two-sample t test
p_values1 = c()
for(i in 1:nrow(cell16)){
p_values1[i] = t.test(group1[i,],group2[i,],alternative='two.sided')$p.value
}
ks.test(p_values1,'punif',0,1)
One-sample Kolmogorov-Smirnov test
data: p_values1
D = 0.044142, p-value = 0.0406
alternative hypothesis: two-sided
hist(p_values1,breaks = 15)
Version | Author | Date |
---|---|---|
42c073c | Dongyue Xie | 2020-01-06 |
If not accounting for 0’s in scData, then the null distribution of p-values is certainly not uniform. Mainly because two-sample t test is not applicable.
If removing genes with two many 0’s, then the p-value distributions deviate from uniform.
First apply on NULL data then add signals to genes using Poisson thinning.
library(vicar)
set.seed(12345)
group1_idx = sample(1:ncol(cell16),ncol(cell16)/2)
group1 = cell16[,group1_idx]
group2 = cell16[,-group1_idx]
group_indicator = rep(0,ncol(cell16))
group_indicator[group1_idx] = 1
X = model.matrix(~group_indicator)
Y = t(cell16)
num_sv <- sva::num.sv(dat = t(Y), mod = X, method = "be")
num_sv_l <- sva::num.sv(dat = t(Y), mod = X, method = "leek")
num_sv
[1] 4
num_sv_l
[1] 48
Two approaches give very different number of surrogate variables! Try to use 4 SVs in the following analysis.
mout = mouthwash(Y,X,k=num_sv,cov_of_interest = 2,include_intercept = FALSE)
Running mouthwash on 50 x 2 matrix X and 50 x 1000 matrix Y.
- Computing independent basis using QR decomposition.
- Computation took 0.011 seconds.
- Running additional preprocessing steps.
- Computation took 0.001 seconds.
- Running second step of mouthwash:
+ Estimating model parameters using EM.
+ Computation took 2.67 seconds.
+ Generating adaptive shrinkage (ash) output.
+ Computation took 0.216 seconds.
- Second step took 3.693 seconds.
- Estimating additional hidden confounders.
- Computation took 0.016 seconds.
mout$pi0
[1] 0.9922996
library(cate)
#library(leapp)
cate_cate <- cate::cate.fit(X.primary = X[, 2, drop = FALSE], X.nuis = X[, -2, drop = FALSE],
Y = Y, r = num_sv, adj.method = "rr")
# this method is vey slow!
#leapp_leapp <- leapp::leapp(data = t(Y), pred.prim = X[, 2, drop = FALSE],
# pred.covar = X[, -2, drop = FALSE], num.fac = num_sv)
sva_sva <- sva::sva(dat = t(Y), mod = X, mod0 = X[, -2, drop = FALSE], n.sv = num_sv)
Number of significant surrogate variables is: 4
Iteration (out of 5 ):1 2 3 4 5
X.sva <- cbind(X, sva_sva$sv)
lmout <- limma::lmFit(object = t(Y), design = X.sva)
eout <- limma::eBayes(lmout)
svaout <- list()
svaout$betahat <- lmout$coefficients[, 2]
svaout$sebetahat <- lmout$stdev.unscaled[, 2] * sqrt(eout$s2.post)
svaout$pvalues <- eout$p.value[, 2]
hist(svaout$pvalues,breaks=15)
Version | Author | Date |
---|---|---|
42c073c | Dongyue Xie | 2020-01-06 |
ks.test(svaout$pvalues,'punif',0,1)
One-sample Kolmogorov-Smirnov test
data: svaout$pvalues
D = 0.032173, p-value = 0.2518
alternative hypothesis: two-sided
hist(cate_cate$beta.p.value,breaks = 15)
Version | Author | Date |
---|---|---|
42c073c | Dongyue Xie | 2020-01-06 |
ks.test(cate_cate$beta.p.value,'punif',0,1)
One-sample Kolmogorov-Smirnov test
data: cate_cate$beta.p.value
D = 0.085223, p-value = 9.83e-07
alternative hypothesis: two-sided
Add some signal to the NULL dataset.
library(seqgendiff)
#tt = thin_diff(round(cell16), design_fixed = X[,2,drop=FALSE])
set.seed(12345)
thinout = thin_2group(round(cell16),0.9,signal_fun = stats::rexp,signal_params = list(rate=0.5))
#check null groups
group1 = cell16[,which(thinout$designmat==1)]
group2 = cell16[,which(thinout$designmat==0)]
## for each gene, run a two-sample t test
p_values1 = c()
for(i in 1:nrow(cell16)){
p_values1[i] = t.test(group1[i,],group2[i,],alternative='two.sided')$p.value
}
ks.test(p_values1,'punif',0,1)
One-sample Kolmogorov-Smirnov test
data: p_values1
D = 0.1421, p-value < 2.2e-16
alternative hypothesis: two-sided
hist(p_values1,breaks = 15)
Version | Author | Date |
---|---|---|
42c073c | Dongyue Xie | 2020-01-06 |
Y = t(thinout$mat)
X = model.matrix(~thinout$designmat)
num_sv <- sva::num.sv(dat = t(Y), mod = X, method = "be")
num_sv_l <- sva::num.sv(dat = t(Y), mod = X, method = "leek")
num_sv
[1] 4
num_sv_l
[1] 48
mean(abs(thinout$coef) < 10^-6)
[1] 0.9
mout = mouthwash(Y,X,k=num_sv,cov_of_interest = 2,include_intercept = FALSE)
Running mouthwash on 50 x 2 matrix X and 50 x 1000 matrix Y.
- Computing independent basis using QR decomposition.
- Computation took 0.015 seconds.
- Running additional preprocessing steps.
- Computation took 0.001 seconds.
- Running second step of mouthwash:
+ Estimating model parameters using EM.
+ Computation took 2.345 seconds.
+ Generating adaptive shrinkage (ash) output.
+ Computation took 0.098 seconds.
- Second step took 3.217 seconds.
- Estimating additional hidden confounders.
- Computation took 0.019 seconds.
mout$pi0
[1] 0.806857
bout <- backwash(Y = Y, X = X, k = num_sv, cov_of_interest = 2, include_intercept = FALSE)
- Computing independent basis using QR decomposition.
- Computation took 0.016 seconds.
- Running additional preprocessing steps.
- Computation took 0.001 seconds.
- Running second step of backwash:
+ Initializing parameters for EM algorithm.
+ Computation took 0.179 seconds.
+ Running one round of parameter updates.
+ Computation took 0.029 seconds.
+ Estimating model parameters using EM.
+ Computation took 9.455 seconds.
+ Generating posterior statistics.
+ Computation took 0.003 seconds.
- Second step took 11.349 seconds.
- Generating final backwash outputs.
- Computation took 0.013 seconds.
bout$pi0
[1] 0.8127092
cate_cate = cate::cate.fit(X.primary = X[, 2, drop = FALSE], X.nuis = X[, -2, drop = FALSE],
Y = Y, r = num_sv, adj.method = "rr")
sva_sva <- sva::sva(dat = t(Y), mod = X, mod0 = X[, -2, drop = FALSE], n.sv = num_sv)
Number of significant surrogate variables is: 4
Iteration (out of 5 ):1 2 3 4 5
X.sva <- cbind(X, sva_sva$sv)
lmout <- limma::lmFit(object = t(Y), design = X.sva)
eout <- limma::eBayes(lmout)
svaout <- list()
svaout$betahat <- lmout$coefficients[, 2]
svaout$sebetahat <- lmout$stdev.unscaled[, 2] * sqrt(eout$s2.post)
svaout$pvalues <- eout$p.value[, 2]
which_null = c(1*(abs(thinout$coef) < 10^-6))
# plot ROC curve
roc_out <- list(
pROC::roc(response = which_null, predictor = c(mout$result$lfdr)),
pROC::roc(response = which_null, predictor = c(bout$result$lfdr)),
pROC::roc(response = which_null, predictor = c(cate_cate$beta.p.value)),
pROC::roc(response = which_null, predictor = c(svaout$pvalues)))
name_vec <- c("MOUTHWASH", 'BACKWASH',"CATErr", "SVA")
names(roc_out) <- name_vec
sout <- lapply(roc_out, function(x) { data.frame(TPR = x$sensitivities, FPR = 1 - x$specificities)})
for (index in 1:length(sout)) {
sout[[index]]$Method <- name_vec[index]
}
longdat <- do.call(rbind, sout)
shortdat <- dplyr::filter(longdat, Method == "MOUTHWASH" | Method == "BACKWASH" |
Method == "CATErr" | Method == "SVA" | Method == "LEAPP")
ggplot(data = shortdat, mapping = aes(x = FPR, y = TPR, col = Method)) +
geom_path() + theme_bw() + ggtitle("ROC Curves")
Version | Author | Date |
---|---|---|
42c073c | Dongyue Xie | 2020-01-06 |
auc_vec <- sapply(roc_out, FUN = function(x) { x$auc })
knitr::kable(sort(auc_vec, decreasing = TRUE), col.names = "AUC", digits = 3)
AUC | |
---|---|
SVA | 0.943 |
CATErr | 0.918 |
BACKWASH | 0.887 |
MOUTHWASH | 0.887 |
# estimate pi0
method_list <- list()
method_list$CATErr <- list()
method_list$CATErr$betahat <- c(cate_cate$beta)
method_list$CATErr$sebetahat <- c(sqrt(cate_cate$beta.cov.row * c(cate_cate$beta.cov.col)) / sqrt(nrow(X)))
method_list$SVA <- list()
method_list$SVA$betahat <- c(svaout$betahat)
method_list$SVA$sebetahat <- c(svaout$sebetahat)
ashfit <- lapply(method_list, FUN = function(x) { ashr::ash(x$betahat, x$sebetahat)})
api0 <- sapply(ashfit, FUN = ashr::get_pi0)
api0 <- c(api0, MOUTHWASH = mout$pi0)
api0 <- c(api0, BACKWASH = bout$pi0)
knitr::kable(sort(api0, decreasing = TRUE), col.names = "Estimate of Pi0")
E | stimate of Pi0 |
---|---|
BACKWASH | 0.8127092 |
SVA | 0.8126893 |
MOUTHWASH | 0.8068570 |
CATErr | 0.4149302 |
Weaker signal: rexp(,rate=1)
set.seed(12345)
thinout = thin_2group(round(cell16),0.9,signal_fun = stats::rexp,signal_params = list(rate=1))
Y = t(thinout$mat)
X = model.matrix(~thinout$designmat)
mout = mouthwash(Y,X,k=num_sv,cov_of_interest = 2,include_intercept = FALSE)
Running mouthwash on 50 x 2 matrix X and 50 x 1000 matrix Y.
- Computing independent basis using QR decomposition.
- Computation took 0.015 seconds.
- Running additional preprocessing steps.
- Computation took 0 seconds.
- Running second step of mouthwash:
+ Estimating model parameters using EM.
+ Computation took 6.007 seconds.
+ Generating adaptive shrinkage (ash) output.
+ Computation took 0.103 seconds.
- Second step took 7.091 seconds.
- Estimating additional hidden confounders.
- Computation took 0.019 seconds.
mout$pi0
[1] 0.8178142
bout <- backwash(Y = Y, X = X, k = num_sv, cov_of_interest = 2, include_intercept = FALSE)
- Computing independent basis using QR decomposition.
- Computation took 0.017 seconds.
- Running additional preprocessing steps.
- Computation took 0 seconds.
- Running second step of backwash:
+ Initializing parameters for EM algorithm.
+ Computation took 0.198 seconds.
+ Running one round of parameter updates.
+ Computation took 0.029 seconds.
+ Estimating model parameters using EM.
+ Computation took 8.311 seconds.
+ Generating posterior statistics.
+ Computation took 0.004 seconds.
- Second step took 10.307 seconds.
- Generating final backwash outputs.
- Computation took 0.013 seconds.
bout$pi0
[1] 0.8210579
cate_cate = cate::cate.fit(X.primary = X[, 2, drop = FALSE], X.nuis = X[, -2, drop = FALSE],
Y = Y, r = num_sv, adj.method = "rr")
sva_sva <- sva::sva(dat = t(Y), mod = X, mod0 = X[, -2, drop = FALSE], n.sv = num_sv)
Number of significant surrogate variables is: 4
Iteration (out of 5 ):1 2 3 4 5
X.sva <- cbind(X, sva_sva$sv)
lmout <- limma::lmFit(object = t(Y), design = X.sva)
eout <- limma::eBayes(lmout)
svaout <- list()
svaout$betahat <- lmout$coefficients[, 2]
svaout$sebetahat <- lmout$stdev.unscaled[, 2] * sqrt(eout$s2.post)
svaout$pvalues <- eout$p.value[, 2]
which_null = c(1*(abs(thinout$coef) < 10^-6))
roc_out <- list(
pROC::roc(response = which_null, predictor = c(mout$result$lfdr)),
pROC::roc(response = which_null, predictor = c(bout$result$lfdr)),
pROC::roc(response = which_null, predictor = c(cate_cate$beta.p.value)),
pROC::roc(response = which_null, predictor = c(svaout$pvalues)))
name_vec <- c("MOUTHWASH", 'BACKWASH',"CATErr", "SVA")
names(roc_out) <- name_vec
sout <- lapply(roc_out, function(x) { data.frame(TPR = x$sensitivities, FPR = 1 - x$specificities)})
for (index in 1:length(sout)) {
sout[[index]]$Method <- name_vec[index]
}
longdat <- do.call(rbind, sout)
shortdat <- dplyr::filter(longdat, Method == "MOUTHWASH" | Method == "BACKWASH" |
Method == "CATErr" | Method == "SVA" | Method == "LEAPP")
ggplot(data = shortdat, mapping = aes(x = FPR, y = TPR, col = Method)) +
geom_path() + theme_bw() + ggtitle("ROC Curves")
Version | Author | Date |
---|---|---|
42c073c | Dongyue Xie | 2020-01-06 |
auc_vec <- sapply(roc_out, FUN = function(x) { x$auc })
knitr::kable(sort(auc_vec, decreasing = TRUE), col.names = "AUC", digits = 3)
AUC | |
---|---|
SVA | 0.888 |
CATErr | 0.859 |
BACKWASH | 0.805 |
MOUTHWASH | 0.805 |
method_list <- list()
method_list$CATErr <- list()
method_list$CATErr$betahat <- c(cate_cate$beta)
method_list$CATErr$sebetahat <- c(sqrt(cate_cate$beta.cov.row * c(cate_cate$beta.cov.col)) / sqrt(nrow(X)))
method_list$SVA <- list()
method_list$SVA$betahat <- c(svaout$betahat)
method_list$SVA$sebetahat <- c(svaout$sebetahat)
ashfit <- lapply(method_list, FUN = function(x) { ashr::ash(x$betahat, x$sebetahat)})
api0 <- sapply(ashfit, FUN = ashr::get_pi0)
api0 <- c(api0, MOUTHWASH = mout$pi0)
api0 <- c(api0, BACKWASH = bout$pi0)
knitr::kable(sort(api0, decreasing = TRUE), col.names = "Estimate of Pi0")
E | stimate of Pi0 |
---|---|
SVA | 0.8212793 |
BACKWASH | 0.8210579 |
MOUTHWASH | 0.8178142 |
CATErr | 0.4781507 |
Stronger signal: rexp(,rate = 0.2)
set.seed(12345)
thinout = thin_2group(round(cell16),0.9,signal_fun = stats::rexp,signal_params = list(rate=0.2))
Y = t(thinout$mat)
X = model.matrix(~thinout$designmat)
mean(abs(thinout$coef) < 10^-6)
[1] 0.9
mout = mouthwash(Y,X,k=num_sv,cov_of_interest = 2,include_intercept = FALSE)
Running mouthwash on 50 x 2 matrix X and 50 x 1000 matrix Y.
- Computing independent basis using QR decomposition.
- Computation took 0.011 seconds.
- Running additional preprocessing steps.
- Computation took 0.001 seconds.
- Running second step of mouthwash:
+ Estimating model parameters using EM.
+ Computation took 2.24 seconds.
+ Generating adaptive shrinkage (ash) output.
+ Computation took 0.109 seconds.
- Second step took 3.353 seconds.
- Estimating additional hidden confounders.
- Computation took 0.02 seconds.
mout$pi0
[1] 0.7604766
bout <- backwash(Y = Y, X = X, k = num_sv, cov_of_interest = 2, include_intercept = FALSE)
- Computing independent basis using QR decomposition.
- Computation took 0.015 seconds.
- Running additional preprocessing steps.
- Computation took 0.001 seconds.
- Running second step of backwash:
+ Initializing parameters for EM algorithm.
+ Computation took 0.182 seconds.
+ Running one round of parameter updates.
+ Computation took 0.03 seconds.
+ Estimating model parameters using EM.
+ Computation took 18.932 seconds.
+ Generating posterior statistics.
+ Computation took 0.003 seconds.
- Second step took 20.758 seconds.
- Generating final backwash outputs.
- Computation took 0.013 seconds.
bout$pi0
[1] 0.8159548
cate_cate = cate::cate.fit(X.primary = X[, 2, drop = FALSE], X.nuis = X[, -2, drop = FALSE],
Y = Y, r = num_sv, adj.method = "rr")
sva_sva <- sva::sva(dat = t(Y), mod = X, mod0 = X[, -2, drop = FALSE], n.sv = num_sv)
Number of significant surrogate variables is: 4
Iteration (out of 5 ):1 2 3 4 5
X.sva <- cbind(X, sva_sva$sv)
lmout <- limma::lmFit(object = t(Y), design = X.sva)
eout <- limma::eBayes(lmout)
svaout <- list()
svaout$betahat <- lmout$coefficients[, 2]
svaout$sebetahat <- lmout$stdev.unscaled[, 2] * sqrt(eout$s2.post)
svaout$pvalues <- eout$p.value[, 2]
which_null = c(1*(abs(thinout$coef) < 10^-6))
roc_out <- list(
pROC::roc(response = which_null, predictor = c(mout$result$lfdr)),
pROC::roc(response = which_null, predictor = c(bout$result$lfdr)),
pROC::roc(response = which_null, predictor = c(cate_cate$beta.p.value)),
pROC::roc(response = which_null, predictor = c(svaout$pvalues)))
name_vec <- c("MOUTHWASH", 'BACKWASH',"CATErr", "SVA")
names(roc_out) <- name_vec
sout <- lapply(roc_out, function(x) { data.frame(TPR = x$sensitivities, FPR = 1 - x$specificities)})
for (index in 1:length(sout)) {
sout[[index]]$Method <- name_vec[index]
}
longdat <- do.call(rbind, sout)
shortdat <- dplyr::filter(longdat, Method == "MOUTHWASH" | Method == "BACKWASH" |
Method == "CATErr" | Method == "SVA" | Method == "LEAPP")
ggplot(data = shortdat, mapping = aes(x = FPR, y = TPR, col = Method)) +
geom_path() + theme_bw() + ggtitle("ROC Curves")
Version | Author | Date |
---|---|---|
42c073c | Dongyue Xie | 2020-01-06 |
auc_vec <- sapply(roc_out, FUN = function(x) { x$auc })
knitr::kable(sort(auc_vec, decreasing = TRUE), col.names = "AUC", digits = 3)
AUC | |
---|---|
SVA | 0.974 |
MOUTHWASH | 0.953 |
CATErr | 0.951 |
BACKWASH | 0.950 |
method_list <- list()
method_list$CATErr <- list()
method_list$CATErr$betahat <- c(cate_cate$beta)
method_list$CATErr$sebetahat <- c(sqrt(cate_cate$beta.cov.row * c(cate_cate$beta.cov.col)) / sqrt(nrow(X)))
method_list$SVA <- list()
method_list$SVA$betahat <- c(svaout$betahat)
method_list$SVA$sebetahat <- c(svaout$sebetahat)
ashfit <- lapply(method_list, FUN = function(x) { ashr::ash(x$betahat, x$sebetahat)})
api0 <- sapply(ashfit, FUN = ashr::get_pi0)
api0 <- c(api0, MOUTHWASH = mout$pi0)
api0 <- c(api0, BACKWASH = bout$pi0)
knitr::kable(sort(api0, decreasing = TRUE), col.names = "Estimate of Pi0")
E | stimate of Pi0 |
---|---|
SVA | 0.8503571 |
BACKWASH | 0.8159548 |
MOUTHWASH | 0.7604766 |
CATErr | 0.5565965 |
Applying SVA on NULL data gives uniformly distributed p-values. Cate with robust regression didn’t make p-values uniformly distributed but considerabley more small p-values. MOUTHWASH outputs \(\hat\pi_0\) very close to 1.
Adding signals to NULL dataset.
What i didnn’t look at: two many 0’s in the dataset.
sessionInfo()
R version 3.6.1 (2019-07-05)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS High Sierra 10.13.6
Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] parallel stats4 stats graphics grDevices utils datasets
[8] methods base
other attached packages:
[1] seqgendiff_1.1.1 cate_1.1
[3] vicar_0.1-10 MultiAssayExperiment_1.10.4
[5] SummarizedExperiment_1.14.1 DelayedArray_0.10.0
[7] BiocParallel_1.18.1 matrixStats_0.55.0
[9] Biobase_2.44.0 GenomicRanges_1.36.1
[11] GenomeInfoDb_1.20.0 IRanges_2.18.3
[13] S4Vectors_0.22.1 BiocGenerics_0.30.0
[15] ggplot2_3.2.1
loaded via a namespace (and not attached):
[1] nlme_3.1-141 svd_0.5 bitops_1.0-6
[4] fs_1.3.1 bit64_0.9-7 doParallel_1.0.15
[7] rprojroot_1.3-2 tools_3.6.1 backports_1.1.5
[10] R6_2.4.0 DBI_1.0.0 lazyeval_0.2.2
[13] mgcv_1.8-29 colorspace_1.4-1 withr_2.1.2
[16] gridExtra_2.3 tidyselect_0.2.5 bit_1.1-14
[19] compiler_3.6.1 git2r_0.26.1 labeling_0.3
[22] scales_1.0.0 SQUAREM_2017.10-1 genefilter_1.66.0
[25] mixsqp_0.1-97 stringr_1.4.0 esaBcv_1.2.1
[28] digest_0.6.21 rmarkdown_1.16 XVector_0.24.0
[31] pscl_1.5.2 pkgconfig_2.0.3 htmltools_0.4.0
[34] highr_0.8 ruv_0.9.7.1 limma_3.40.6
[37] rlang_0.4.0 RSQLite_2.1.2 leapp_1.2
[40] dplyr_0.8.3 RCurl_1.95-4.12 magrittr_1.5
[43] GenomeInfoDbData_1.2.1 Matrix_1.2-17 Rcpp_1.0.2
[46] munsell_0.5.0 pROC_1.15.3 stringi_1.4.3
[49] whisker_0.4 yaml_2.2.0 MASS_7.3-51.4
[52] zlibbioc_1.30.0 plyr_1.8.4 grid_3.6.1
[55] blob_1.2.0 promises_1.1.0 crayon_1.3.4
[58] lattice_0.20-38 splines_3.6.1 annotate_1.62.0
[61] zeallot_0.1.0 knitr_1.25 pillar_1.4.2
[64] corpcor_1.6.9 codetools_0.2-16 XML_3.98-1.20
[67] glue_1.3.1 evaluate_0.14 vctrs_0.2.0
[70] httpuv_1.5.2 foreach_1.4.7 gtable_0.3.0
[73] purrr_0.3.2 assertthat_0.2.1 ashr_2.2-38
[76] xfun_0.10 xtable_1.8-4 later_1.0.0
[79] survival_2.44-1.1 truncnorm_1.0-8 tibble_2.1.3
[82] iterators_1.0.12 AnnotationDbi_1.46.1 memoise_1.1.0
[85] workflowr_1.5.0 sva_3.32.1