Last updated: 2019-06-19
Checks: 6 1
Knit directory: Comparative_eQTL/analysis/
This reproducible R Markdown analysis was created with workflowr (version 1.4.0). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.
The R Markdown file has unstaged changes. To know which version of the R Markdown file created these results, you’ll want to first commit it to the Git repo. If you’re still working on the analysis, you can ignore this warning. When you’re finished, you can run wflow_publish
to commit the R Markdown file and build the HTML.
Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.
The command set.seed(20190319)
was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.
Great job! Recording the operating system, R version, and package versions is critical for reproducibility.
Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.
Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.
Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility. The version displayed above was the version of the Git repository at the time these results were generated.
Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish
or wflow_git_commit
). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:
Ignored files:
Ignored: .DS_Store
Ignored: .Rhistory
Ignored: .Rproj.user/
Ignored: analysis/.DS_Store
Ignored: analysis/20190521_eQTL_CrossSpeciesEnrichment_cache/
Ignored: analysis/figure/
Ignored: analysis_temp/.DS_Store
Ignored: code/.DS_Store
Ignored: code/snakemake_workflow/.DS_Store
Ignored: data/.DS_Store
Ignored: data/PastAnalysesDataToKeep/.DS_Store
Ignored: docs/.DS_Store
Ignored: docs/assets/.DS_Store
Untracked files:
Untracked: analysis/20190617_ExpressionVarianceOfEgenes.Rmd
Untracked: docs/figure/20190613_PowerAnalysis.Rmd/
Unstaged changes:
Modified: analysis/20190613_PowerAnalysis.Rmd
Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.
These are the previous versions of the R Markdown and HTML files. If you’ve configured a remote Git repository (see ?wflow_git_remote
), click on the hyperlinks in the table below to view them.
File | Version | Author | Date | Message |
---|---|---|---|---|
Rmd | aa8067e | Benjmain Fair | 2019-06-15 | added power analysis rmd |
library(tidyverse)
library(knitr)
library("edgeR")
library(corrplot)
library(gplots)
library(pROC)
library(qvalue)
library(reshape2)
39 Chimp heart RNA-seq datasets (from this project as well as from Pavlovic et al 2018) as well as 10 human heart RNA-seq datasets (from Pavlovic et al) and 39 randomly selected GTEx left ventricle heart RNA-seq datasets were trimmed to same read length (single end, 75bp) and aligned to the respective genomes. Gene counts were obtained with subread software using gene annotations based only on orthologous exons (Pavlovic et al 2018). Here I will perform differential gene expression analysis to understand the relationship between read depth and number of individuals (samples) needed to identify cross-species differentially expressed genes.
CountTableChimp <- read.table(gzfile('../data/PowerAnalysisCountTable.Chimp.subread.txt.gz'), header=T, check.names=FALSE, skip=1)
colnames(CountTableChimp) <- paste0("C.", colnames(CountTableChimp))
kable(CountTableChimp[1:10,1:10])
C.Geneid | C.Chr | C.Start | C.End | C.Strand | C.Length | C.317 | C.295 | C.522 | C.495 |
---|---|---|---|---|---|---|---|---|---|
ENSG00000273443 | 1;1;1 | 175684;176538;176679 | 176285;176555;176765 | -;-;- | 707 | 1 | 4 | 1 | 1 |
ENSG00000217801 | 1;1;1;1;1;1;1 | 176957;177673;177953;178638;179598;179901;180611 | 177043;177864;178052;178718;179678;180035;180758 | +;+;+;+;+;+;+ | 824 | 6 | 12 | 6 | 15 |
ENSG00000237330 | 1 | 186042 | 186330 | - | 289 | 0 | 5 | 1 | 0 |
ENSG00000223823 | 1 | 313681 | 313774 | + | 94 | 0 | 0 | 0 | 0 |
ENSG00000207730 | 1 | 355180 | 355274 | + | 95 | 0 | 0 | 0 | 1 |
ENSG00000207607 | 1 | 355943 | 356032 | + | 90 | 0 | 0 | 0 | 3 |
ENSG00000198976 | 1 | 357158 | 357240 | + | 83 | 0 | 0 | 0 | 0 |
ENSG00000272141 | 1 | 357510 | 358497 | + | 988 | 0 | 1 | 0 | 23 |
ENSG00000205231 | 1;1;1;1 | 361286;362148;362543;362711 | 362104;362486;362644;362741 | -;-;-;- | 1291 | 0 | 2 | 3 | 11 |
ENSG00000186891 | 1;1;1;1;1 | 387554;388538;389740;389995;390258 | 387860;388847;389827;390192;390717 | +;+;+;+;+ | 1363 | 11 | 15 | 10 | 148 |
CountTableHuman <- read.table(gzfile('../data/PowerAnalysisCountTable.Human.subread.txt.gz'), header=T, check.names=FALSE, skip=1)
colnames(CountTableHuman) <- paste0("H.", colnames(CountTableHuman))
kable(CountTableHuman[1:10,1:10])
H.Geneid | H.Chr | H.Start | H.End | H.Strand | H.Length | H.62765 | H.SRR601645 | H.SRR612335 | H.SRR607313 |
---|---|---|---|---|---|---|---|---|---|
ENSG00000188976 | 1 | 959215 | 959309 | - | 95 | 22 | 25 | 37 | 33 |
ENSG00000188157 | 1 | 1041478 | 1041702 | + | 225 | 14 | 16 | 46 | 12 |
ENSG00000273443 | 1;1;1 | 1062208;1063061;1063202 | 1062808;1063078;1063288 | -;-;- | 706 | 0 | 1 | 1 | 1 |
ENSG00000217801 | 1;1;1;1;1;1;1 | 1063480;1064210;1064490;1066587;1067566;1067869;1068577 | 1063566;1064401;1064589;1066667;1067646;1068003;1068724 | +;+;+;+;+;+;+ | 824 | 1 | 2 | 2 | 1 |
ENSG00000237330 | 1 | 1074016 | 1074307 | - | 292 | 0 | 0 | 0 | 0 |
ENSG00000223823 | 1 | 1137017 | 1137110 | + | 94 | 0 | 0 | 0 | 0 |
ENSG00000207730 | 1 | 1167104 | 1167198 | + | 95 | 0 | 0 | 0 | 1 |
ENSG00000207607 | 1 | 1167863 | 1167952 | + | 90 | 0 | 0 | 0 | 0 |
ENSG00000198976 | 1 | 1169005 | 1169087 | + | 83 | 0 | 0 | 0 | 0 |
ENSG00000272141 | 1 | 1169357 | 1170343 | + | 987 | 0 | 0 | 0 | 2 |
CombinedTable <- inner_join(CountTableChimp[,c(1,7:length(CountTableChimp))], CountTableHuman[,c(1,7:length(CountTableHuman))], by=c("C.Geneid"="H.Geneid")) %>%
column_to_rownames("C.Geneid") %>% as.matrix()
#Plot depth per sample
CombinedTable %>% colSums() %>% as.data.frame() %>%
rownames_to_column("Sample") %>%
mutate(Species=substr(Sample, 1,1)) %>%
ggplot(aes(x=Sample, y=., fill=Species)) +
geom_col() +
scale_y_continuous(expand = c(0, 0)) +
theme_bw() +
theme(axis.text.x = element_text(angle = 90, hjust = 1, size=6))
cpm <- cpm(CombinedTable, log=TRUE, prior.count=0.5)
kable(cpm[1:10,1:10])
C.317 | C.295 | C.522 | C.495 | C.4X0550 | C.4x523 | C.4x0025 | C.476 | C.4X0267 | C.456 | |
---|---|---|---|---|---|---|---|---|---|---|
ENSG00000273443 | -4.496834 | -3.842369 | -4.653722 | -3.9703396 | -4.744091 | -5.884495 | -2.9860056 | -5.884495 | -4.0955499 | -3.581712 |
ENSG00000217801 | -2.465058 | -2.512162 | -2.701554 | -0.4739515 | -5.884495 | -2.189634 | -2.4670233 | -5.098787 | -4.0955499 | -4.027420 |
ENSG00000237330 | -5.884495 | -3.592258 | -4.653722 | -5.8844951 | -4.744091 | -5.884495 | -5.8844951 | -4.592964 | -5.8844951 | -4.027420 |
ENSG00000223823 | -5.884495 | -5.884495 | -5.884495 | -5.8844951 | -5.884495 | -5.884495 | -5.8844951 | -5.884495 | -5.8844951 | -5.884495 |
ENSG00000207730 | -5.884495 | -5.884495 | -5.884495 | -3.9703396 | -5.884495 | -5.884495 | -5.8844951 | -5.884495 | -5.8844951 | -5.884495 |
ENSG00000207607 | -5.884495 | -5.884495 | -5.884495 | -2.6662108 | -5.884495 | -5.884495 | -5.8844951 | -5.884495 | -5.8844951 | -5.884495 |
ENSG00000198976 | -5.884495 | -5.884495 | -5.884495 | -5.8844951 | -5.884495 | -5.884495 | -5.8844951 | -5.884495 | -5.8844951 | -5.884495 |
ENSG00000272141 | -5.884495 | -5.052918 | -5.884495 | 0.1308736 | -5.884495 | -5.884495 | -5.8844951 | -5.884495 | -4.7288685 | -5.884495 |
ENSG00000205231 | -5.884495 | -4.528769 | -3.550847 | -0.9091288 | -5.884495 | -3.793966 | -5.8844951 | -4.592964 | -5.8844951 | -5.884495 |
ENSG00000186891 | -1.653220 | -2.218369 | -2.029574 | 2.7978037 | -2.179150 | 1.491629 | -0.3548249 | -1.932277 | -0.6428715 | -2.206169 |
# Heatmap of correlation. Species segregate as expected.
SpeciesFactor <- colnames(cpm) %>% substr(1,1) %>% factor() %>% unclass() %>% as.character()
cor(cpm, method = c("spearman")) %>%
heatmap.2(trace="none", ColSideColors=SpeciesFactor)
Unsurprisingly, the samples with the lowest read depth in the human cohort are clear outliers. This might change once I filter out the more lowly expressed genes.
d0 <- DGEList(CombinedTable)
#Calculate normalization factors
d0 <- calcNormFactors(d0)
#Note: calcNormFactors doesn’t normalize the data, it just calculates normalization factors for use downstream.
#Filter low-expressed genes
cutoff <- 2
drop <- which(apply(cpm(d0), 1, max) < cutoff)
d <- d0[-drop,]
dim(d) # number of genes left
[1] 16885 89
plotMDS(d, col = as.numeric(SpeciesFactor))
plotMDS(d, col = as.numeric(SpeciesFactor), dim=c(3,4))
cor(cpm(d, log=T, prior.count=0.5), method = c("spearman")) %>%
heatmap.2(trace="none", ColSideColors=SpeciesFactor)
mm <- model.matrix(~0 + SpeciesFactor)
y <- voom(d, mm, plot = T, normalize.method="cyclicloess")
In fact some of these samples seemed to have gotten worse. I’ll just throw these out of future analysis.
Anway, now that voom calculated mean/variance relationship from count values, I will now normalize for gene length differences between species (Convert cpm to rpkm). As per the voom publication (Law et al 2014) log-cpm values output by voom can be converted to log-rpkm by subtracting the log2 gene length in kilobases. For this I will make a matrix of gene lengths based on chimp orthologous exon length and human orthologous exon lengths to subtract the correct length for each species from the count matrix output by voom.
rep.col<-function(x,n){
matrix(rep(x,each=n), ncol=n, byrow=TRUE)
}
GeneLengths <- inner_join(CountTableChimp[,c("C.Geneid", "C.Length")], CountTableHuman[,c("H.Geneid", "H.Length")], by=c("C.Geneid"="H.Geneid"))
head(kable(GeneLengths))
[1] "C.Geneid C.Length H.Length"
[2] "---------------- --------- ---------"
[3] "ENSG00000273443 707 706"
[4] "ENSG00000217801 824 824"
[5] "ENSG00000237330 289 292"
[6] "ENSG00000223823 94 94"
ggplot(GeneLengths, aes(x=log10(C.Length), y=log10(H.Length))) + geom_point()
# accounting for the length differences probably will have negligble effect on results anyway. # Will probably calculate DE genes both ways (cpm and rpkm) to verify
GeneLengthMatrix <- cbind(
rep.col(log2(GeneLengths$C.Length/1000), length(CountTableChimp)-6),
rep.col(log2(GeneLengths$H.Length/1000), length(CountTableHuman)-6))
rownames(GeneLengthMatrix) <- GeneLengths$C.Geneid
kable(GeneLengthMatrix[1:10,1:10])
ENSG00000273443 | -0.5002179 | -0.5002179 | -0.5002179 | -0.5002179 | -0.5002179 | -0.5002179 | -0.5002179 | -0.5002179 | -0.5002179 | -0.5002179 |
ENSG00000217801 | -0.2792838 | -0.2792838 | -0.2792838 | -0.2792838 | -0.2792838 | -0.2792838 | -0.2792838 | -0.2792838 | -0.2792838 | -0.2792838 |
ENSG00000237330 | -1.7908586 | -1.7908586 | -1.7908586 | -1.7908586 | -1.7908586 | -1.7908586 | -1.7908586 | -1.7908586 | -1.7908586 | -1.7908586 |
ENSG00000223823 | -3.4111954 | -3.4111954 | -3.4111954 | -3.4111954 | -3.4111954 | -3.4111954 | -3.4111954 | -3.4111954 | -3.4111954 | -3.4111954 |
ENSG00000207730 | -3.3959287 | -3.3959287 | -3.3959287 | -3.3959287 | -3.3959287 | -3.3959287 | -3.3959287 | -3.3959287 | -3.3959287 | -3.3959287 |
ENSG00000207607 | -3.4739312 | -3.4739312 | -3.4739312 | -3.4739312 | -3.4739312 | -3.4739312 | -3.4739312 | -3.4739312 | -3.4739312 | -3.4739312 |
ENSG00000198976 | -3.5907449 | -3.5907449 | -3.5907449 | -3.5907449 | -3.5907449 | -3.5907449 | -3.5907449 | -3.5907449 | -3.5907449 | -3.5907449 |
ENSG00000272141 | -0.0174171 | -0.0174171 | -0.0174171 | -0.0174171 | -0.0174171 | -0.0174171 | -0.0174171 | -0.0174171 | -0.0174171 | -0.0174171 |
ENSG00000205231 | 0.3684890 | 0.3684890 | 0.3684890 | 0.3684890 | 0.3684890 | 0.3684890 | 0.3684890 | 0.3684890 | 0.3684890 | 0.3684890 |
ENSG00000186891 | 0.4467856 | 0.4467856 | 0.4467856 | 0.4467856 | 0.4467856 | 0.4467856 | 0.4467856 | 0.4467856 | 0.4467856 | 0.4467856 |
#subtract gene log2(kb) from log2(cpm)
y$E <- y$E - GeneLengthMatrix[rownames(y$E),]
#Now do model fitting, significance testing
fit<- lmFit(y, mm)
plotSA(fit)
head(coef(fit))
SpeciesFactor1 SpeciesFactor2
ENSG00000186891 -0.7260522 -3.3031577
ENSG00000186827 2.2324152 1.5156598
ENSG00000078808 4.9302282 5.2145994
ENSG00000176022 3.0643080 2.8875023
ENSG00000184163 -0.1127561 -0.7211328
ENSG00000260179 -2.3125083 -3.2671324
head(y$E[,1])
ENSG00000186891 ENSG00000186827 ENSG00000078808 ENSG00000176022
-1.824917 2.322460 4.919776 2.812206
ENSG00000184163 ENSG00000260179
-1.423121 -2.764688
contr <- makeContrasts(DE=SpeciesFactor1-SpeciesFactor2, levels = mm)
tmp <- contrasts.fit(fit, contrasts=contr)
FC.NullInterval <- log2(1.0)
True.efit <- treat(tmp, lfc = FC.NullInterval)
summary(decideTests(True.efit))
DE
Down 5922
NotSig 5116
Up 5847
TrueResponse <- decideTests(True.efit)
plotMD(True.efit, column=1, zero.weights = F)
Ok, seems like the above workflow for identifying DE genes is set up reasonably… Now let’s repeat with less samples and see what changes. First, I will wrap all of the above analysis into a function with a sample size parameter:
DE.Subsampled <-function(ChimpCountTableFile, HumanCountTableFile, SubsampleSize, FC.NullInterval, drop)
#if SubsampleSize parameter == 0, use full table, otherwise, subsample from it
{
FullChimpData <- read.table(gzfile(ChimpCountTableFile), header=T, check.names=FALSE, skip=1)
FullHumanData <- read.table(gzfile(HumanCountTableFile), header=T, check.names=FALSE, skip=1)
if (SubsampleSize==0){
CountTableChimp <- FullChimpData
colnames(CountTableChimp) <- paste0("C.", colnames(CountTableChimp))
CountTableHuman <- FullHumanData
colnames(CountTableHuman) <- paste0("H.", colnames(CountTableHuman))
} else {
CountTableChimp <- FullChimpData %>% dplyr::select(c(1:6, sample(7:length(FullChimpData), SubsampleSize)))
colnames(CountTableChimp) <- paste0("C.", colnames(CountTableChimp))
CountTableHuman <- FullHumanData %>% dplyr::select(c(1:6, sample(7:length(FullHumanData), SubsampleSize)))
colnames(CountTableHuman) <- paste0("H.", colnames(CountTableHuman))
}
CombinedTable <- inner_join(CountTableChimp[,c(1,7:length(CountTableChimp))], CountTableHuman[,c(1,7:length(CountTableHuman))], by=c("C.Geneid"="H.Geneid")) %>%
column_to_rownames("C.Geneid") %>% as.matrix()
SpeciesFactor <- colnames(CombinedTable) %>% substr(1,1) %>% factor() %>% unclass() %>% as.character()
d0 <- DGEList(CombinedTable)
d0 <- calcNormFactors(d0)
d <- d0[-drop,]
mm <- model.matrix(~0 + SpeciesFactor)
y <- voom(d, mm, normalize.method="cyclicloess", plot=F)
GeneLengths <- inner_join(CountTableChimp[,c("C.Geneid", "C.Length")], CountTableHuman[,c("H.Geneid", "H.Length")], by=c("C.Geneid"="H.Geneid"))
GeneLengthMatrix <- cbind(
rep.col(log2(GeneLengths$C.Length/1000), length(CountTableChimp)-6),
rep.col(log2(GeneLengths$H.Length/1000), length(CountTableHuman)-6))
rownames(GeneLengthMatrix) <- GeneLengths$C.Geneid
y$E <- y$E - GeneLengthMatrix[rownames(y$E),]
fit<- lmFit(y, mm)
contr <- makeContrasts(DE=SpeciesFactor1-SpeciesFactor2, levels = mm)
tmp <- contrasts.fit(fit, contrasts=contr)
efit <- treat(tmp, lfc = FC.NullInterval)
return(efit)
}
Now that I have a function to do all of the DE gene analysis, use the function with a smaller sample size and check results:
#20 samples
Subsampled.DE.results <- DE.Subsampled('../data/PowerAnalysisCountTable.Chimp.subread.txt.gz',
'../data/PowerAnalysisCountTable.Human.subread.txt.gz',
20, 0, drop)
summary(decideTests(Subsampled.DE.results))
DE
Down 4101
NotSig 8565
Up 4219
RocCurveData <- coords(roc(response=as.vector(abs(TrueResponse)), predictor=Subsampled.DE.results$p.value, plot=F))
Setting levels: control = 0, case = 1
Warning in roc.default(response = as.vector(abs(TrueResponse)), predictor
= Subsampled.DE.results$p.value, : Deprecated use a matrix as predictor.
Unexpected results may be produced, please pass a numeric vector.
Setting direction: controls > cases
Warning in coords.roc(roc(response = as.vector(abs(TrueResponse)),
predictor = Subsampled.DE.results$p.value, : An upcoming version of pROC
will set the 'transpose' argument to FALSE by default. Set transpose =
TRUE explicitly to keep the current behavior, or transpose = FALSE to
adopt the new one and silence this warning. Type help(coords_transpose) for
additional information.
plot(1-RocCurveData["specificity",], RocCurveData["sensitivity",])
SubSampledResponse <- decideTests(Subsampled.DE.results)
# distribution of effect sizes for true positives
hist(abs(True.efit$coefficients[TrueResponse==SubSampledResponse & SubSampledResponse!=0]), main="|effect size| of true positives")
median(abs(True.efit$coefficients[TrueResponse==SubSampledResponse & SubSampledResponse!=0]))
[1] 0.9495753
#5 samples
Subsampled.DE.results <- DE.Subsampled('../data/PowerAnalysisCountTable.Chimp.subread.txt.gz',
'../data/PowerAnalysisCountTable.Human.subread.txt.gz',
5, 0, drop)
summary(decideTests(Subsampled.DE.results))
DE
Down 1650
NotSig 13471
Up 1764
RocCurveData <- coords(roc(response=as.vector(abs(TrueResponse)), predictor=Subsampled.DE.results$p.value, plot=F))
Setting levels: control = 0, case = 1
Warning in roc.default(response = as.vector(abs(TrueResponse)), predictor
= Subsampled.DE.results$p.value, : Deprecated use a matrix as predictor.
Unexpected results may be produced, please pass a numeric vector.
Setting direction: controls > cases
Warning in coords.roc(roc(response = as.vector(abs(TrueResponse)),
predictor = Subsampled.DE.results$p.value, : An upcoming version of pROC
will set the 'transpose' argument to FALSE by default. Set transpose =
TRUE explicitly to keep the current behavior, or transpose = FALSE to
adopt the new one and silence this warning. Type help(coords_transpose) for
additional information.
plot(1-RocCurveData["specificity",], RocCurveData["sensitivity",])
SubSampledResponse <- decideTests(Subsampled.DE.results)
# distribution of effect sizes for true positives
hist(abs(True.efit$coefficients[TrueResponse==SubSampledResponse & SubSampledResponse!=0]), main="|effect size| of true positives")
median(abs(True.efit$coefficients[TrueResponse==SubSampledResponse & SubSampledResponse!=0]))
[1] 1.488754
Now I will systematically do this for varying sample sizes and make some final plots
#Note there is some randomness in subsampling samples. So much so that sometimes it effects results wherein 4 samples might yield more DE genes than 2 if you get unluck and pick "bad" samples in the 4
set.seed(0)
#Interval for null hypothesis
FC.NullInterval <- log2(1.0)
#True results are those using all samples
True.efit <- DE.Subsampled('../data/PowerAnalysisCountTable.Chimp.subread.txt.gz',
'../data/PowerAnalysisCountTable.Human.subread.txt.gz',
0, FC.NullInterval, drop)
SampleSizes <- c(2,4,8,16,24,32,39)
FDRLevels <- c(0.01, 0.05, 0.1)
RocCurveDataToPlot <- data.frame()
DEGeneCountToPlot <- matrix(nrow=length(SampleSizes), ncol=length(FDRLevels))
rownames(DEGeneCountToPlot) <- SampleSizes
EffectSizesToPlot <- data.frame()
for (i in seq_along(SampleSizes)){
paste0("processing ", SampleSizes[i])
Results <- DE.Subsampled('../data/PowerAnalysisCountTable.Chimp.subread.txt.gz',
'../data/PowerAnalysisCountTable.Human.subread.txt.gz',
SampleSizes[i], FC.NullInterval, drop)
RocCurveData <- as.data.frame(coords(roc(response=as.vector(abs(TrueResponse)), quiet=T, predictor=as.numeric(Results$p.value), plot=F), transpose=F))
RocCurveData$samplesize <- SampleSizes[i]
RocCurveDataToPlot <- rbind(RocCurveDataToPlot, RocCurveData)
for (j in seq_along(FDRLevels)){
SubSampledResponse <- decideTests(Results, p.value=FDRLevels[j])
DEGeneCountToPlot[i,j] <- sum(table(SubSampledResponse)[c("-1","1")])
if (length(table(TrueResponse==SubSampledResponse & SubSampledResponse!=0)) > 1){
EffectSizes.df <-data.frame(EffectSizes=abs(True.efit$coefficients[TrueResponse==SubSampledResponse & SubSampledResponse!=0]), FDR=FDRLevels[j], SampleSize=SampleSizes[i])
EffectSizesToPlot <- rbind(EffectSizesToPlot, EffectSizes.df)
}
}
}
ggplot(RocCurveDataToPlot, aes(x=1-specificity, y=sensitivity, color=factor(samplesize))) +
geom_line() +
theme_bw()
DEGeneCountToPlot.df<-as.data.frame(DEGeneCountToPlot)
colnames(DEGeneCountToPlot.df) <- FDRLevels
DEGeneCountToPlot.df$SampleSize <- rownames(DEGeneCountToPlot.df)
DEGeneCountToPlot.df$SampleSize
[1] "2" "4" "8" "16" "24" "32" "39"
DEGeneCountToPlot.df[is.na(DEGeneCountToPlot.df)] <- 0
DEGeneCountToPlot.df %>% melt() %>%
dplyr::rename(Number.DE.genes=value, FDR=variable) %>%
ggplot(aes(x=as.numeric(SampleSize), y=Number.DE.genes, color=FDR)) +
geom_line() +
theme_bw()
Using SampleSize as id variables
ggplot(EffectSizesToPlot, aes(x=factor(SampleSize), y=EffectSizes, color=factor(FDR))) +
# geom_violin()
geom_boxplot() +
# scale_y_continuous(limits=c(0,5))
theme_bw()
# ggplot(EffectSizesToPlot, aes(x=as.numeric(SampleSize), y=median(EffectSizes), color=factor(FDR), group=factor(FDR))) +
# geom_line() +
# # scale_y_continuous(limits=c(0,5))
# theme_bw()
data.frame(coefficients=as.numeric(True.efit$coefficients), pval=as.numeric(True.efit$p.value), signif=decideTests(True.efit)) %>%
ggplot(aes(x=coefficients, y=-log10(pval), color=factor(DE))) +
geom_point() +
scale_x_continuous(limits=c(-5,5))
Warning: Removed 198 rows containing missing values (geom_point).
I probably also want to explore how read depth plays into this, especially considering some samples were much more sparse than others. Here I’ll try to write a function to subsample without replacement the read counts from the original count table.
sample_species <- function(counts,n) {
num_species <- length(counts)
total_count <- sum(counts)
samples <- sample(1:total_count,n,replace=F)
samples <- samples[order(samples)]
result <- array(0,num_species)
total <- 0
for (i in 1:num_species) {
result[i] <- length(which(samples > total & samples <= total+counts[i]))
total <- total+counts[i]
}
return(result)
# return(apply(t(counts),1,sample_species,1500) )
}
sample_count_table <- function(counts, n){
return(apply(t(counts),1,sample_species,n) )
}
A<- sample_count_table(CombinedTable,2200000)
This is still a work in progress… It seems that sampling without replcement from a large count table actually takes an unreasonably long time (and probably uses a lot of memory)… Sampling without replacement could be faster in theory, but I’m not sure if that is still reasonable. There are some third party packages out there that do this, maybe I will try them out. Though, I don’t see any that let you pick a number of reads to subsample as opposed to a proportion of reads to subsample (in other words, it wouldn’t be easy to normalize read depth across samples). Maybe it will actually just be easier to realign everything at various read depths and make new count table files for each read depth if I want to incorporate normalizing and adjusting read depths into this analysis.
Update: What I turned out doing is sampling from the original alignment bam file to varying depths, normalizing such that each sample has an equal number of aligned reads. Then count tables were generated (one table for chimp, one for human), and the jist of this differential expression pipeline in R was wrapped into an Rscript that takes as input the count table, and as output spits out figures for ROC curves and such for DE power after randomly subsampling the samples (columns) of the count table at various sample sizes. This script was then incorporated into the snakemake for this repo.
sessionInfo()
R version 3.5.1 (2018-07-02)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS 10.14
Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] reshape2_1.4.3 qvalue_2.14.1 pROC_1.15.0 gplots_3.0.1.1
[5] corrplot_0.84 edgeR_3.24.3 limma_3.38.3 knitr_1.23
[9] forcats_0.4.0 stringr_1.4.0 dplyr_0.8.1 purrr_0.3.2
[13] readr_1.3.1 tidyr_0.8.3 tibble_2.1.3 ggplot2_3.1.1
[17] tidyverse_1.2.1
loaded via a namespace (and not attached):
[1] gtools_3.8.1 tidyselect_0.2.5 locfit_1.5-9.1
[4] xfun_0.7 splines_3.5.1 haven_2.1.0
[7] lattice_0.20-38 colorspace_1.4-1 generics_0.0.2
[10] htmltools_0.3.6 yaml_2.2.0 rlang_0.3.4
[13] pillar_1.4.1 glue_1.3.1 withr_2.1.2
[16] modelr_0.1.4 readxl_1.3.1 plyr_1.8.4
[19] munsell_0.5.0 gtable_0.3.0 workflowr_1.4.0
[22] cellranger_1.1.0 rvest_0.3.4 caTools_1.17.1.2
[25] evaluate_0.14 labeling_0.3 highr_0.8
[28] broom_0.5.2 Rcpp_1.0.1 KernSmooth_2.23-15
[31] scales_1.0.0 backports_1.1.4 gdata_2.18.0
[34] jsonlite_1.6 fs_1.3.1 hms_0.4.2
[37] digest_0.6.19 stringi_1.4.3 grid_3.5.1
[40] rprojroot_1.3-2 bitops_1.0-6 cli_1.1.0
[43] tools_3.5.1 magrittr_1.5 lazyeval_0.2.2
[46] crayon_1.3.4 whisker_0.3-2 pkgconfig_2.0.2
[49] xml2_1.2.0 lubridate_1.7.4 assertthat_0.2.1
[52] rmarkdown_1.13 httr_1.4.0 rstudioapi_0.10
[55] R6_2.4.0 nlme_3.1-140 git2r_0.25.2
[58] compiler_3.5.1