Last updated: 2020-02-17
Checks: 7 0
Knit directory: 20170327_Psen2S4Ter_RNASeq/
This reproducible R Markdown analysis was created with workflowr (version 1.6.0). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.
Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.
Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.
The command set.seed(20200119)
was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.
Great job! Recording the operating system, R version, and package versions is critical for reproducibility.
Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.
Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.
Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility. The version displayed above was the version of the Git repository at the time these results were generated.
Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish
or wflow_git_commit
). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:
Ignored files:
Ignored: .Rhistory
Ignored: .Rproj.user/
Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.
These are the previous versions of the R Markdown and HTML files. If you’ve configured a remote Git repository (see ?wflow_git_remote
), click on the hyperlinks in the table below to view them.
File | Version | Author | Date | Message |
---|---|---|---|---|
html | 68033fe | Steve Ped | 2020-01-26 | Corrected chunk labels after publishing DE analysis |
Rmd | 25922d6 | Steve Ped | 2020-01-26 | Added first pass of enrichment |
Rmd | 96d8cc7 | Steve Ped | 2020-01-25 | Compiled after data export & added compression to output |
html | 96d8cc7 | Steve Ped | 2020-01-25 | Compiled after data export & added compression to output |
Rmd | 07a6053 | Steve Ped | 2020-01-25 | Corrected paths |
Rmd | 09180e5 | Steve Ped | 2020-01-25 | Added data export |
Rmd | fca554f | Steve Ped | 2020-01-25 | Tweaked explanations & labels |
Rmd | be95a60 | Steve Ped | 2020-01-25 | Finished first pass of DE Analysis |
html | be95a60 | Steve Ped | 2020-01-25 | Finished first pass of DE Analysis |
html | f04bb47 | Steve Ped | 2020-01-25 | Minor typos |
Rmd | 7251bd2 | Steve Ped | 2020-01-25 | Tweaks |
Rmd | 7fea156 | Steve Ped | 2020-01-25 | Added rRNA checks & heatmaps |
html | 9bff516 | Steve Ped | 2020-01-24 | Added analysis without CQN |
Rmd | 7b680b2 | Steve Ped | 2020-01-24 | Added analysis without CQN |
html | 95ccc90 | Steve Ped | 2020-01-22 | Compiled DE so far |
Rmd | 1858f49 | Steve Ped | 2020-01-22 | Started presenting actual DE genes |
html | 1858f49 | Steve Ped | 2020-01-22 | Started presenting actual DE genes |
Rmd | 10a285b | Steve Ped | 2020-01-22 | Expanded description in index and tried reusing an image |
html | 10a285b | Steve Ped | 2020-01-22 | Expanded description in index and tried reusing an image |
html | aabb570 | Steve Ped | 2020-01-21 | Rebuilt to get rid of warnings |
Rmd | 71b8832 | Steve Ped | 2020-01-21 | Added DE QC for GC bias |
html | 71b8832 | Steve Ped | 2020-01-21 | Added DE QC for GC bias |
Rmd | e825637 | Steve Ped | 2020-01-21 | Minor updates to DE plots |
html | 01512da | Steve Ped | 2020-01-21 | Added initial DE analysis to index |
Rmd | fbb6242 | Steve Ped | 2020-01-21 | Paused DE analysis |
Rmd | c560637 | Steve Ped | 2020-01-20 | Started DE analysis |
library(ngsReports)
library(tidyverse)
library(magrittr)
library(edgeR)
library(AnnotationHub)
library(ensembldb)
library(scales)
library(pander)
library(cqn)
library(ggrepel)
library(pheatmap)
library(RColorBrewer)
if (interactive()) setwd(here::here())
theme_set(theme_bw())
panderOptions("big.mark", ",")
panderOptions("table.split.table", Inf)
panderOptions("table.style", "rmarkdown")
ah <- AnnotationHub() %>%
subset(species == "Danio rerio") %>%
subset(rdataclass == "EnsDb")
ensDb <- ah[["AH74989"]]
grTrans <- transcripts(ensDb)
trLengths <- exonsBy(ensDb, "tx") %>%
width() %>%
vapply(sum, integer(1))
mcols(grTrans)$length <- trLengths[names(grTrans)]
gcGene <- grTrans %>%
mcols() %>%
as.data.frame() %>%
dplyr::select(gene_id, tx_id, gc_content, length) %>%
as_tibble() %>%
group_by(gene_id) %>%
summarise(
gc_content = sum(gc_content*length) / sum(length),
length = ceiling(median(length))
)
grGenes <- genes(ensDb)
mcols(grGenes) %<>%
as.data.frame() %>%
left_join(gcGene) %>%
as.data.frame() %>%
DataFrame()
Similarly to the Quality Assessment steps, GRanges
objects were formed at the gene and transcript levels, to enable estimation of GC content and length for each transcript and gene. GC content and transcript length are available for each transcript, and for gene-level estimates, GC content was taken as the sum of all GC bases divided by the sum of all transcript lengths, effectively averaging across all transcripts. Gene length was defined as the median transcript length.
samples <- read_csv("data/samples.csv") %>%
distinct(sampleName, .keep_all = TRUE) %>%
dplyr::select(sample = sampleName, sampleID, genotype) %>%
mutate(
genotype = factor(genotype, levels = c("WT", "Het", "Hom")),
mutant = genotype %in% c("Het", "Hom"),
homozygous = genotype == "Hom"
)
genoCols <- samples$genotype %>%
levels() %>%
length() %>%
brewer.pal("Set1") %>%
setNames(levels(samples$genotype))
Sample metadata was also loaded, with only the sampleID and genotype being retained. All other fields were considered irrelevant.
minCPM <- 1.5
minSamples <- 4
dgeList <- file.path("data", "2_alignedData", "featureCounts", "genes.out") %>%
read_delim(delim = "\t") %>%
set_names(basename(names(.))) %>%
as.data.frame() %>%
column_to_rownames("Geneid") %>%
as.matrix() %>%
set_colnames(str_remove(colnames(.), "Aligned.sortedByCoord.out.bam")) %>%
.[rowSums(cpm(.) >= minCPM) >= minCPM,] %>%
DGEList(
samples = tibble(sample = colnames(.)) %>%
left_join(samples),
genes = grGenes[rownames(.)] %>%
as.data.frame() %>%
dplyr::select(
chromosome = seqnames, start, end,
gene_id, gene_name, gene_biotype, description,
entrezid, gc_content, length
)
) %>%
.[!grepl("rRNA", .$genes$gene_biotype),] %>%
calcNormFactors()
Gene-level count data as output by featureCounts
, was loaded and formed into a DGEList
object. During this process, genes were removed if:
gene_biotype
was any type of rRNA
.These filtering steps returned gene-level counts for 16,640 genes, with total library sizes between 11,852,141 and 16,997,219 reads assigned to genes. It was noted that these library sizes were about 1.5-fold larger than the transcript-level counts used for the QA steps.
cpm(dgeList, log = TRUE) %>%
as.data.frame() %>%
pivot_longer(
cols = everything(),
names_to = "sample",
values_to = "logCPM"
) %>%
split(f = .$sample) %>%
lapply(function(x){
d <- density(x$logCPM)
tibble(
sample = unique(x$sample),
x = d$x,
y = d$y
)
}) %>%
bind_rows() %>%
left_join(samples) %>%
ggplot(aes(x, y, colour = genotype, group = sample)) +
geom_line() +
scale_colour_manual(
values = genoCols
) +
labs(
x = "logCPM",
y = "Density",
colour = "Genotype"
)
contLabeller <- as_labeller(
c(
HetVsWT = "S4Ter/+ Vs +/+",
HomVsWT = "S4Ter/S4Ter Vs +/+",
HomVsHet = "S4Ter/S4Ter Vs S4Ter/+",
Hom = "S4Ter/S4Ter",
Het = "S4Ter/+",
WT = "+/+",
mutant = "S4Ter Vs WT",
homozygous = "S4Ter/S4Ter Vs S4Ter/+"
)
)
geneLabeller <- structure(grGenes$gene_name, names = grGenes$gene_id) %>%
as_labeller()
Labeller functions for genotypes, contrasts and gene names were additionally defined for simpler plotting using ggplot2
.
In order to asses the impact of the potential rRNA contaminant, correlations between the initial estimate of contamination and each gene’s expression pattern were also investigated. Conceptually, genes which are highly correlated with rRNA may be physically associated with these complexes. Similary, genes with high negative correlations may be actively selected against in the presence of rRNA complexes.
rawFqc <- list.files(
path = "data/0_rawData/FastQC/",
pattern = "zip",
full.names = TRUE
) %>%
FastqcDataList()
gc <- getModule(rawFqc, "Per_sequence_GC") %>%
group_by(Filename) %>%
mutate(Freq = Count / sum(Count)) %>%
ungroup()
rawGC <- gc %>%
dplyr::filter(GC_Content > 70) %>%
group_by(Filename) %>%
summarise(Freq = sum(Freq)) %>%
arrange(desc(Freq)) %>%
mutate(sample = str_remove(Filename, "_R[12].fastq.gz")) %>%
group_by(sample) %>%
summarise(Freq = mean(Freq)) %>%
left_join(samples)
riboVec <- structure(rawGC$Freq, names = rawGC$sample)
riboCors <- cpm(dgeList, log = TRUE) %>%
apply(1, function(x){
cor(x, riboVec[names(x)])
})
pca <- dgeList %>%
cpm(log = TRUE) %>%
t() %>%
prcomp()
pcaVars <- percent_format(0.1)(summary(pca)$importance["Proportion of Variance",])
pca$x %>%
as.data.frame() %>%
rownames_to_column("sample") %>%
left_join(samples) %>%
as_tibble() %>%
ggplot(aes(PC1, PC2, colour = genotype, fill = genotype)) +
geom_point() +
geom_text_repel(aes(label = sampleID), show.legend = FALSE) +
stat_ellipse(geom = "polygon", alpha = 0.05, show.legend = FALSE) +
guides(fill = FALSE) +
scale_colour_manual(
values = genoCols
) +
labs(
x = paste0("PC1 (", pcaVars[["PC1"]], ")"),
y = paste0("PC2 (", pcaVars[["PC2"]], ")"),
colour = "Genotype"
)
A Principal Component Analysis (PCA) was also performed using logCPM values from each sample. Both mutant genotypes appear to cluster together, however it has previously been noted that GC content appears to track closely with PC1, as a result of variable rRNA removal.
Version | Author | Date |
---|---|---|
10a285b | Steve Ped | 2020-01-22 |
The initial expectation was to perform three pairwise comparisons as described in the above figure. However, given there was a strong similarity between mutants, the model matrix was defined as containing an intercept (i.e. baseline expression in wild-type), with additional columns defining presence of any mutant alleles, and the final column capturing the difference between mutants. This avoids any issues with the use of hard cutoffs when combing lists, such as would occur when comparing separate results from both mutant genotypes against wild-type samples.
d <- model.matrix(~mutant + homozygous, data = dgeList$samples) %>%
set_colnames(str_remove(colnames(.), "TRUE"))
pheatmap(
d,
cluster_cols = FALSE,
cluster_rows = FALSE,
color = c("white", "grey50"),
annotation_row = dgeList$samples["genotype"],
annotation_colors = list(genotype = genoCols),
legend = FALSE
)
Version | Author | Date |
---|---|---|
7251bd2 | Steve Ped | 2020-01-25 |
As GC content and length was noted as being of concern for this dataset, conditional-quantile normalisation was performed using the cqn
package. This adds a gene and sample-level offset for each count which takes into account any systemic bias, such as that identified previously as an artefact of variable rRNA removal. The resultant glm.offset
values were added to the original DGEList
object, and all dispersion estimates were calculated.
gcCqn <- cqn(
counts = dgeList$counts,
x = dgeList$genes$gc_content,
lengths = dgeList$genes$length,
sizeFactors = dgeList$samples$lib.size
)
par(mfrow = c(1, 2))
cqnplot(gcCqn, n = 1, xlab = "GC Content", col = genoCols)
cqnplot(gcCqn, n = 2, xlab = "Length", col = genoCols)
legend("bottomright", legend = levels(samples$genotype), col = genoCols, lty = 1)
par(mfrow = c(1, 1))
dgeList$offset <- gcCqn$glm.offset
dgeList %<>% estimateDisp(design = d)
minLfc <- log2(2)
alpha <- 0.01
fit <- glmFit(dgeList)
topTables <- colnames(d)[2:3] %>%
sapply(function(x){
glmLRT(fit, coef = x) %>%
topTags(n = Inf) %>%
.[["table"]] %>%
as_tibble() %>%
arrange(PValue) %>%
dplyr::select(
gene_id, gene_name, logFC, logCPM, PValue, FDR, everything()
) %>%
mutate(
coef = x,
bonfP = p.adjust(PValue, "bonf"),
DE = case_when(
bonfP < alpha ~ TRUE,
FDR < alpha & abs(logFC) > minLfc ~ TRUE
),
DE = ifelse(is.na(DE), FALSE, DE)
)
}, simplify = FALSE)
Models were fit using the negative-binomial approaches of glmFit()
. Top Tables were then obtained using likelihood-ratio tests in glmLRT()
. These test the standard \(H_0\) that the true value of the estimated model coefficient is zero. These model coefficients effectively estimate:
For enrichment testing, genes were initially considered to be DE using:
As fewer genes were detected in the comparisons between homozygous and heterozygous mutants, a simple FDR of 0.05 was subsequently chosen for this comparison.
topTables$homozygous %<>%
mutate(DE = FDR < 0.05)
Using these criteria, the following initial DE gene-sets were defined:
topTables %>%
lapply(dplyr::filter, DE) %>%
vapply(nrow, integer(1)) %>%
set_names(
case_when(
names(.) == "mutant" ~ "psen2 mutant",
names(.) == "homozygous" ~ "HomVsHet"
)
) %>%
pander()
psen2 mutant | HomVsHet |
---|---|
615 | 7 |
deCols <- c(
`FALSE` = rgb(0.5, 0.5, 0.5, 0.4),
`TRUE` = rgb(1, 0, 0, 0.7)
)
Given the previously identified concerns about variable rRNA removal, correlations were calculated between each gene’s expression values and the proportion of the raw libraries with > 70% GC. Those with the strongest +ve correlation, and for which the FDR is < 0.01 are shown below.
Most notably, the most correlated RNA with these rRNA proportions was eif1b
which is the initiation factor for translation. As this is a protein coding gene with the protein product interacting with the ribosome, the reason for this correlation at the RNA level raises difficult questions which cannot be answered here.
topTables$mutant %>%
mutate(riboCors = riboCors[gene_id]) %>%
dplyr::filter(
FDR < alpha
) %>%
dplyr::select(gene_id, gene_name, logFC, logCPM, FDR, riboCors, DE) %>%
arrange(desc(riboCors)) %>%
mutate(FDR = case_when(
FDR >= 0.0001 ~ sprintf("%.4f", FDR),
FDR < 0.0001 ~ sprintf("%.2e", FDR)
)
) %>%
dplyr::slice(1:40) %>%
pander(
justify = "llrrrrl",
style = "rmarkdown",
caption = paste(
"The", nrow(.), "genes most correlated with the original high GC content.",
"Many failed the selection criteria for differential expression,",
"primarily due to the stringent logFC filter.",
"However, the unusually high number of ribosomal protein coding genes",
"is currently difficult to explain, but notable.",
"Two possibilities are these RNAs are somehow translated with physical",
"linkage to the ribosomal complex, or that there is genuinely a global",
"shift in the level of ribosomal activity.",
"Neither explanation is initially satisfying."
)
)
gene_id | gene_name | logFC | logCPM | FDR | riboCors | DE |
---|---|---|---|---|---|---|
ENSDARG00000012688 | eif1b | 0.5229 | 7.482 | 0.0024 | 0.9906 | FALSE |
ENSDARG00000012871 | npepl1 | 0.6049 | 4.485 | 0.0028 | 0.9823 | FALSE |
ENSDARG00000103994 | ppiab | 0.8891 | 7.372 | 6.18e-05 | 0.9814 | FALSE |
ENSDARG00000034897 | rps10 | 0.8809 | 5.898 | 0.0001 | 0.9795 | FALSE |
ENSDARG00000068995 | h2afx1 | 0.644 | 6.605 | 0.0025 | 0.9734 | FALSE |
ENSDARG00000077291 | rps2 | 1.305 | 7.52 | 1.58e-07 | 0.9733 | TRUE |
ENSDARG00000002240 | psmb6 | 0.6407 | 4.702 | 0.0056 | 0.9733 | FALSE |
ENSDARG00000036298 | rps13 | 0.6762 | 6.076 | 8.26e-05 | 0.9707 | FALSE |
ENSDARG00000037071 | rps26 | 0.651 | 5.323 | 0.0045 | 0.97 | FALSE |
ENSDARG00000057026 | ran | 0.6265 | 7.071 | 0.0002 | 0.9689 | FALSE |
ENSDARG00000111753 | hist1h4l | 0.9775 | 2.015 | 0.0061 | 0.9685 | FALSE |
ENSDARG00000038028 | ndufa6 | 1.051 | 4.693 | 2.57e-05 | 0.9667 | TRUE |
ENSDARG00000046157 | RPS17 | 0.7725 | 6.74 | 0.0002 | 0.9667 | FALSE |
ENSDARG00000078929 | abhd16a | 0.5358 | 4.906 | 0.0013 | 0.9664 | FALSE |
ENSDARG00000099226 | CABZ01076667.1 | 0.5407 | 3.973 | 0.0008 | 0.9652 | FALSE |
ENSDARG00000014690 | rps4x | 0.6511 | 7.368 | 0.0003 | 0.9649 | FALSE |
ENSDARG00000092553 | slc25a5 | 0.9012 | 8.952 | 2.64e-05 | 0.9648 | TRUE |
ENSDARG00000011665 | aldoaa | 0.6082 | 7.763 | 0.0029 | 0.9639 | FALSE |
ENSDARG00000045447 | slc35g2b | 0.766 | 4.873 | 0.0002 | 0.9627 | FALSE |
ENSDARG00000034534 | atp6v1aa | 0.6212 | 7.049 | 0.0003 | 0.9618 | FALSE |
ENSDARG00000099766 | myl12.1 | 0.3841 | 6.468 | 0.0058 | 0.9595 | FALSE |
ENSDARG00000037962 | psmb7 | 0.7979 | 4.104 | 3.30e-05 | 0.9592 | FALSE |
ENSDARG00000019230 | rpl7a | 0.9036 | 7.71 | 5.34e-05 | 0.9591 | FALSE |
ENSDARG00000099104 | rpl24 | 0.662 | 6.729 | 0.0004 | 0.9571 | FALSE |
ENSDARG00000055475 | rps27.2 | 0.6175 | 6.719 | 0.0019 | 0.9566 | FALSE |
ENSDARG00000026322 | dhrs13a.1 | 0.7388 | 2.804 | 0.0032 | 0.9563 | FALSE |
ENSDARG00000025581 | rpl10 | 1.096 | 6.381 | 0.0003 | 0.9562 | TRUE |
ENSDARG00000092807 | si:dkey-151g10.6 | 0.8303 | 6.491 | 9.37e-07 | 0.9551 | TRUE |
ENSDARG00000053365 | rpl31 | 1.215 | 6.638 | 8.99e-06 | 0.9549 | TRUE |
ENSDARG00000075445 | psmb5 | 0.8405 | 5.59 | 5.33e-05 | 0.9544 | FALSE |
ENSDARG00000030237 | pgrmc2 | 0.3734 | 6.21 | 0.0046 | 0.9538 | FALSE |
ENSDARG00000011405 | rps9 | 0.823 | 6.209 | 0.0006 | 0.9537 | FALSE |
ENSDARG00000035808 | clcn4 | 0.6887 | 5.678 | 0.0014 | 0.9528 | FALSE |
ENSDARG00000043561 | psmc1b | 0.6844 | 4.673 | 0.0004 | 0.9517 | FALSE |
ENSDARG00000017235 | eif5a | 0.7003 | 7.388 | 5.54e-06 | 0.9514 | TRUE |
ENSDARG00000034291 | rpl37 | 1.085 | 5.662 | 5.03e-06 | 0.9513 | TRUE |
ENSDARG00000104173 | tufm | 0.7123 | 5.051 | 0.0013 | 0.9511 | FALSE |
ENSDARG00000058451 | rpl6 | 0.7848 | 7.877 | 0.0004 | 0.9497 | FALSE |
ENSDARG00000014867 | rpl8 | 0.9139 | 6.716 | 1.50e-06 | 0.9493 | TRUE |
ENSDARG00000089976 | spcs3 | 0.7379 | 3.54 | 0.0034 | 0.949 | FALSE |
topTables$mutant %>%
mutate(riboCors = riboCors[gene_id]) %>%
dplyr::filter(
FDR < alpha
) %>%
dplyr::select(gene_id, gene_name, logFC, logCPM, FDR, riboCors, DE) %>%
arrange(riboCors) %>%
mutate(FDR = case_when(
FDR >= 0.0001 ~ sprintf("%.4f", FDR),
FDR < 0.0001 ~ sprintf("%.2e", FDR)
)
) %>%
dplyr::slice(1:20) %>%
pander(
justify = "llrrrrl",
style = "rmarkdown",
caption = paste(
"The", nrow(.), "genes most negatively correlated with the original high GC content.",
"Most failed the selection criteria for differential expression,",
"primarily due to the stringent logFC filter.",
"These genes may be being selected against in the presence of rRNA."
)
)
gene_id | gene_name | logFC | logCPM | FDR | riboCors | DE |
---|---|---|---|---|---|---|
ENSDARG00000001829 | zgc:112982 | -0.8984 | 6.26 | 0.0010 | -0.9826 | FALSE |
ENSDARG00000100915 | si:ch211-255f4.7 | -0.6696 | 3.31 | 0.0015 | -0.9652 | FALSE |
ENSDARG00000033889 | spty2d1 | -0.8317 | 5.462 | 0.0004 | -0.9611 | FALSE |
ENSDARG00000019223 | arglu1a | -0.8874 | 7.854 | 0.0028 | -0.9602 | FALSE |
ENSDARG00000059097 | nktr | -1.657 | 6.727 | 0.0002 | -0.9581 | TRUE |
ENSDARG00000059768 | gpatch8 | -0.9409 | 7.864 | 0.0011 | -0.9526 | FALSE |
ENSDARG00000075380 | kri1 | -0.778 | 4.336 | 0.0006 | -0.9496 | FALSE |
ENSDARG00000001162 | matr3l1.1 | -0.6702 | 6.855 | 0.0003 | -0.9495 | FALSE |
ENSDARG00000014825 | si:dkeyp-69b9.6 | -0.4038 | 6.53 | 0.0061 | -0.9441 | FALSE |
ENSDARG00000077060 | rbm6 | -0.5108 | 5.891 | 0.0037 | -0.9438 | FALSE |
ENSDARG00000104840 | si:dkey-28a3.2 | -0.472 | 4.533 | 0.0019 | -0.9435 | FALSE |
ENSDARG00000069855 | pnisr | -0.7195 | 7.192 | 0.0056 | -0.9434 | FALSE |
ENSDARG00000007444 | ift81 | -0.6065 | 3.6 | 0.0009 | -0.9396 | FALSE |
ENSDARG00000097086 | si:dkey-172k15.4 | -0.7701 | 3.038 | 0.0028 | -0.9391 | FALSE |
ENSDARG00000012763 | arl13b | -0.8101 | 4.787 | 2.80e-05 | -0.9383 | FALSE |
ENSDARG00000061490 | u2surp | -1.191 | 6.981 | 4.01e-06 | -0.933 | TRUE |
ENSDARG00000044625 | pcf11 | -0.879 | 7.04 | 3.29e-05 | -0.9312 | FALSE |
ENSDARG00000060361 | cep104 | -0.6092 | 4.35 | 0.0026 | -0.9281 | FALSE |
ENSDARG00000051953 | ythdc1 | -0.7477 | 7.073 | 0.0007 | -0.9263 | FALSE |
ENSDARG00000104682 | tpm2 | -0.4561 | 5.335 | 0.0028 | -0.9237 | FALSE |
topTables %>%
bind_rows() %>%
mutate(stat = -sign(logFC)*log10(PValue)) %>%
ggplot(aes(gc_content, stat)) +
geom_point(aes(colour = DE), alpha = 0.4) +
geom_smooth(se = FALSE) +
facet_wrap(~coef, labeller = contLabeller) +
labs(
x = "GC content (%)",
y = "Ranking Statistic"
) +
coord_cartesian(ylim = c(-10, 10)) +
scale_colour_manual(values = deCols) +
theme(legend.position = "none")
topTables %>%
bind_rows() %>%
mutate(stat = -sign(logFC)*log10(PValue)) %>%
ggplot(aes(length, stat)) +
geom_point(aes(colour = DE), alpha = 0.4) +
geom_smooth(se = FALSE) +
facet_wrap(~coef, labeller = contLabeller) +
labs(
x = "Gene Length (bp)",
y = "Ranking Statistic"
) +
coord_cartesian(ylim = c(-10, 10)) +
scale_x_log10(labels = comma) +
scale_colour_manual(values = deCols) +
theme(legend.position = "none")
Checks for both GC and length bias on differential expression showed that a small bias remained evident, despite using conditional-quantile normalisation. However, an alternative analysis on the same dataset excluding the CQN steps revealed vastly different and exaggerated bias. As such, the impact of CQN normalisation was considered to be appropriate.
topTables %>%
bind_rows() %>%
arrange(DE) %>%
ggplot(aes(logCPM, logFC)) +
geom_point(aes(colour = DE)) +
geom_text_repel(
aes(label = gene_name, colour = DE),
data = . %>% dplyr::filter(DE & abs(logFC) > 3)
) +
geom_text_repel(
aes(label = gene_name, colour = DE),
data = . %>% dplyr::filter(FDR < 0.05 & coef == "homozygous")
) +
geom_smooth(se = FALSE) +
geom_hline(
yintercept = c(-1, 1)*minLfc,
linetype = 2,
colour = "red"
) +
facet_wrap(~coef, nrow = 2, labeller = contLabeller) +
scale_y_continuous(breaks = seq(-8, 8, by = 2)) +
scale_colour_manual(values = deCols) +
theme(legend.position = "none")
topTables %>%
bind_rows() %>%
ggplot(aes(PValue)) +
geom_histogram(
binwidth = 0.02,
colour = "black", fill = "grey90"
) +
facet_wrap(~coef, labeller = contLabeller)
topTables %>%
bind_rows() %>%
ggplot(aes(logFC, -log10(PValue), colour = DE)) +
geom_point(alpha = 0.4) +
geom_text_repel(
aes(label = gene_name),
data = . %>% dplyr::filter(PValue < 1e-13 | abs(logFC) > 4)
) +
geom_text_repel(
aes(label = gene_name),
data = . %>% dplyr::filter(FDR < 0.05 & coef == "homozygous")
) +
geom_vline(
xintercept = c(-1, 1)*minLfc,
linetype = 2,
colour = "red"
) +
facet_wrap(~coef, nrow = 1, labeller = contLabeller) +
scale_colour_manual(values = deCols) +
scale_x_continuous(breaks = seq(-8, 8, by = 2)) +
theme(legend.position = "none")
deGenes <- topTables %>%
lapply(dplyr::filter, DE) %>%
lapply(magrittr::extract2, "gene_id")
n <- 40
A set of 615 genes was identified as DE in the presence of the mutant psen2S4Ter transcript. The most highly ranked 40 genes are shown in the following heatmap. The presence of SNORNA70
in this list is of some concern as this is an ncRNA which is known to interact with ribosomes. In particular the expression pattern follows very closely the patterns previosuly identified for rRNA depletion using the percentage of libraries with GC > 70. For example, WT_86 was the sample with most complete rRNA depletion whilst, Het_84 showed the largest proportion of the library as being derived from rRNA. These represent the samples with lowest and highest expression of SNORNA70 respectively.
genoCols <- RColorBrewer::brewer.pal(3, "Set1") %>%
setNames(levels(dgeList$samples$genotype))
fit$fitted.values %>%
cpm(log = TRUE) %>%
extract(deGenes$mutant[seq_len(n)],) %>%
as.data.frame() %>%
set_rownames(unlist(geneLabeller(rownames(.)))) %>%
pheatmap::pheatmap(
color = viridis_pal(option = "magma")(100),
labels_col = colnames(.) %>%
str_replace_all(".+(Het|Hom|WT).+F3(_[0-9]{2}).+", "\\1\\2"),
legend_breaks = c(seq(-2, 8, by = 2), max(.)),
legend_labels = c(seq(-2, 8, by = 2), "logCPM\n"),
annotation_col = dgeList$samples %>%
dplyr::select(Genotype = genotype),
annotation_names_col = FALSE,
annotation_colors = list(Genotype = genoCols),
cutree_cols = 2,
cutree_rows = 6
)
When inspecting the genes showing differences between mutants, it was noted that the WT and Homozygous mutant samples clustered together, implying that there was a unique effect on these genes which was specific to the heterozygous condition.
fit$fitted.values %>%
cpm(log = TRUE) %>%
extract(deGenes$homozygous,) %>%
as.data.frame() %>%
set_rownames(unlist(geneLabeller(rownames(.)))) %>%
pheatmap::pheatmap(
color = viridis_pal(option = "magma")(100),
labels_col = colnames(.) %>%
str_replace_all(".+(Het|Hom|WT).+F3(_[0-9]{2}).+", "\\1\\2"),
legend_breaks = c(seq(-2, 8, by = 2), max(.)),
legend_labels = c(seq(-2, 8, by = 2), "logCPM\n"),
annotation_col = dgeList$samples %>%
dplyr::select(Genotype = genotype),
annotation_names_col = FALSE,
annotation_colors = list(Genotype = genoCols),
cutree_cols = 2,
cutree_rows = 3
)
Final gene lists were exported as separate csv files, along with fit and dgeList objects.
write_csv(topTables$mutant, "output/psen2VsWT.csv")
write_csv(topTables$homozygous, "output/psen2HomVsHet.csv")
write_rds(fit, "data/fit.rds", compress = "gz")
write_rds(dgeList, "data/dgeList.rds", compress = "gz")
devtools::session_info()
─ Session info ───────────────────────────────────────────────────────────────
setting value
version R version 3.6.2 (2019-12-12)
os Ubuntu 18.04.4 LTS
system x86_64, linux-gnu
ui X11
language en_AU:en
collate en_AU.UTF-8
ctype en_AU.UTF-8
tz Australia/Adelaide
date 2020-02-17
─ Packages ───────────────────────────────────────────────────────────────────
package * version date lib source
AnnotationDbi * 1.48.0 2019-10-29 [2] Bioconductor
AnnotationFilter * 1.10.0 2019-10-29 [2] Bioconductor
AnnotationHub * 2.18.0 2019-10-29 [2] Bioconductor
askpass 1.1 2019-01-13 [2] CRAN (R 3.6.0)
assertthat 0.2.1 2019-03-21 [2] CRAN (R 3.6.0)
backports 1.1.5 2019-10-02 [2] CRAN (R 3.6.1)
Biobase * 2.46.0 2019-10-29 [2] Bioconductor
BiocFileCache * 1.10.2 2019-11-08 [2] Bioconductor
BiocGenerics * 0.32.0 2019-10-29 [2] Bioconductor
BiocManager 1.30.10 2019-11-16 [2] CRAN (R 3.6.1)
BiocParallel 1.20.1 2019-12-21 [2] Bioconductor
BiocVersion 3.10.1 2019-06-06 [2] Bioconductor
biomaRt 2.42.0 2019-10-29 [2] Bioconductor
Biostrings 2.54.0 2019-10-29 [2] Bioconductor
bit 1.1-15.1 2020-01-14 [2] CRAN (R 3.6.2)
bit64 0.9-7 2017-05-08 [2] CRAN (R 3.6.0)
bitops 1.0-6 2013-08-17 [2] CRAN (R 3.6.0)
blob 1.2.1 2020-01-20 [2] CRAN (R 3.6.2)
broom 0.5.4 2020-01-27 [2] CRAN (R 3.6.2)
callr 3.4.1 2020-01-24 [2] CRAN (R 3.6.2)
cellranger 1.1.0 2016-07-27 [2] CRAN (R 3.6.0)
cli 2.0.1 2020-01-08 [2] CRAN (R 3.6.2)
cluster 2.1.0 2019-06-19 [2] CRAN (R 3.6.1)
colorspace 1.4-1 2019-03-18 [2] CRAN (R 3.6.0)
cqn * 1.32.0 2019-10-29 [2] Bioconductor
crayon 1.3.4 2017-09-16 [2] CRAN (R 3.6.0)
curl 4.3 2019-12-02 [2] CRAN (R 3.6.2)
data.table 1.12.8 2019-12-09 [2] CRAN (R 3.6.2)
DBI 1.1.0 2019-12-15 [2] CRAN (R 3.6.2)
dbplyr * 1.4.2 2019-06-17 [2] CRAN (R 3.6.0)
DelayedArray 0.12.2 2020-01-06 [2] Bioconductor
desc 1.2.0 2018-05-01 [2] CRAN (R 3.6.0)
devtools 2.2.1 2019-09-24 [2] CRAN (R 3.6.1)
digest 0.6.23 2019-11-23 [2] CRAN (R 3.6.1)
dplyr * 0.8.4 2020-01-31 [2] CRAN (R 3.6.2)
edgeR * 3.28.0 2019-10-29 [2] Bioconductor
ellipsis 0.3.0 2019-09-20 [2] CRAN (R 3.6.1)
ensembldb * 2.10.2 2019-11-20 [2] Bioconductor
evaluate 0.14 2019-05-28 [2] CRAN (R 3.6.0)
FactoMineR 2.1 2020-01-17 [2] CRAN (R 3.6.2)
fansi 0.4.1 2020-01-08 [2] CRAN (R 3.6.2)
farver 2.0.3 2020-01-16 [2] CRAN (R 3.6.2)
fastmap 1.0.1 2019-10-08 [2] CRAN (R 3.6.1)
flashClust 1.01-2 2012-08-21 [2] CRAN (R 3.6.1)
forcats * 0.4.0 2019-02-17 [2] CRAN (R 3.6.0)
fs 1.3.1 2019-05-06 [2] CRAN (R 3.6.0)
generics 0.0.2 2018-11-29 [2] CRAN (R 3.6.0)
GenomeInfoDb * 1.22.0 2019-10-29 [2] Bioconductor
GenomeInfoDbData 1.2.2 2019-11-21 [2] Bioconductor
GenomicAlignments 1.22.1 2019-11-12 [2] Bioconductor
GenomicFeatures * 1.38.1 2020-01-22 [2] Bioconductor
GenomicRanges * 1.38.0 2019-10-29 [2] Bioconductor
ggdendro 0.1-20 2016-04-27 [2] CRAN (R 3.6.0)
ggplot2 * 3.2.1 2019-08-10 [2] CRAN (R 3.6.1)
ggrepel * 0.8.1 2019-05-07 [2] CRAN (R 3.6.0)
git2r 0.26.1 2019-06-29 [2] CRAN (R 3.6.1)
glue 1.3.1 2019-03-12 [2] CRAN (R 3.6.0)
gtable 0.3.0 2019-03-25 [2] CRAN (R 3.6.0)
haven 2.2.0 2019-11-08 [2] CRAN (R 3.6.1)
highr 0.8 2019-03-20 [2] CRAN (R 3.6.0)
hms 0.5.3 2020-01-08 [2] CRAN (R 3.6.2)
htmltools 0.4.0 2019-10-04 [2] CRAN (R 3.6.1)
htmlwidgets 1.5.1 2019-10-08 [2] CRAN (R 3.6.1)
httpuv 1.5.2 2019-09-11 [2] CRAN (R 3.6.1)
httr 1.4.1 2019-08-05 [2] CRAN (R 3.6.1)
hwriter 1.3.2 2014-09-10 [2] CRAN (R 3.6.0)
interactiveDisplayBase 1.24.0 2019-10-29 [2] Bioconductor
IRanges * 2.20.2 2020-01-13 [2] Bioconductor
jpeg 0.1-8.1 2019-10-24 [2] CRAN (R 3.6.1)
jsonlite 1.6 2018-12-07 [2] CRAN (R 3.6.0)
kableExtra 1.1.0 2019-03-16 [2] CRAN (R 3.6.1)
knitr 1.27 2020-01-16 [2] CRAN (R 3.6.2)
labeling 0.3 2014-08-23 [2] CRAN (R 3.6.0)
later 1.0.0 2019-10-04 [2] CRAN (R 3.6.1)
lattice 0.20-38 2018-11-04 [4] CRAN (R 3.6.0)
latticeExtra 0.6-29 2019-12-19 [2] CRAN (R 3.6.2)
lazyeval 0.2.2 2019-03-15 [2] CRAN (R 3.6.0)
leaps 3.1 2020-01-16 [2] CRAN (R 3.6.2)
lifecycle 0.1.0 2019-08-01 [2] CRAN (R 3.6.1)
limma * 3.42.1 2020-01-29 [2] Bioconductor
locfit 1.5-9.1 2013-04-20 [2] CRAN (R 3.6.0)
lubridate 1.7.4 2018-04-11 [2] CRAN (R 3.6.0)
magrittr * 1.5 2014-11-22 [2] CRAN (R 3.6.0)
MASS 7.3-51.5 2019-12-20 [4] CRAN (R 3.6.2)
Matrix 1.2-18 2019-11-27 [2] CRAN (R 3.6.1)
MatrixModels 0.4-1 2015-08-22 [2] CRAN (R 3.6.1)
matrixStats 0.55.0 2019-09-07 [2] CRAN (R 3.6.1)
mclust * 5.4.5 2019-07-08 [2] CRAN (R 3.6.1)
memoise 1.1.0 2017-04-21 [2] CRAN (R 3.6.0)
mgcv 1.8-31 2019-11-09 [4] CRAN (R 3.6.1)
mime 0.8 2019-12-19 [2] CRAN (R 3.6.2)
modelr 0.1.5 2019-08-08 [2] CRAN (R 3.6.1)
munsell 0.5.0 2018-06-12 [2] CRAN (R 3.6.0)
ngsReports * 1.1.2 2019-10-16 [1] Bioconductor
nlme 3.1-144 2020-02-06 [4] CRAN (R 3.6.2)
nor1mix * 1.3-0 2019-06-13 [2] CRAN (R 3.6.1)
openssl 1.4.1 2019-07-18 [2] CRAN (R 3.6.1)
pander * 0.6.3 2018-11-06 [2] CRAN (R 3.6.0)
pheatmap * 1.0.12 2019-01-04 [2] CRAN (R 3.6.0)
pillar 1.4.3 2019-12-20 [2] CRAN (R 3.6.2)
pkgbuild 1.0.6 2019-10-09 [2] CRAN (R 3.6.1)
pkgconfig 2.0.3 2019-09-22 [2] CRAN (R 3.6.1)
pkgload 1.0.2 2018-10-29 [2] CRAN (R 3.6.0)
plotly 4.9.1 2019-11-07 [2] CRAN (R 3.6.1)
plyr 1.8.5 2019-12-10 [2] CRAN (R 3.6.2)
png 0.1-7 2013-12-03 [2] CRAN (R 3.6.0)
preprocessCore * 1.48.0 2019-10-29 [2] Bioconductor
prettyunits 1.1.1 2020-01-24 [2] CRAN (R 3.6.2)
processx 3.4.1 2019-07-18 [2] CRAN (R 3.6.1)
progress 1.2.2 2019-05-16 [2] CRAN (R 3.6.0)
promises 1.1.0 2019-10-04 [2] CRAN (R 3.6.1)
ProtGenerics 1.18.0 2019-10-29 [2] Bioconductor
ps 1.3.0 2018-12-21 [2] CRAN (R 3.6.0)
purrr * 0.3.3 2019-10-18 [2] CRAN (R 3.6.1)
quantreg * 5.54 2019-12-13 [2] CRAN (R 3.6.2)
R6 2.4.1 2019-11-12 [2] CRAN (R 3.6.1)
rappdirs 0.3.1 2016-03-28 [2] CRAN (R 3.6.0)
RColorBrewer * 1.1-2 2014-12-07 [2] CRAN (R 3.6.0)
Rcpp 1.0.3 2019-11-08 [2] CRAN (R 3.6.1)
RCurl 1.98-1.1 2020-01-19 [2] CRAN (R 3.6.2)
readr * 1.3.1 2018-12-21 [2] CRAN (R 3.6.0)
readxl 1.3.1 2019-03-13 [2] CRAN (R 3.6.0)
remotes 2.1.0 2019-06-24 [2] CRAN (R 3.6.0)
reprex 0.3.0 2019-05-16 [2] CRAN (R 3.6.0)
reshape2 1.4.3 2017-12-11 [2] CRAN (R 3.6.0)
rlang 0.4.4 2020-01-28 [2] CRAN (R 3.6.2)
rmarkdown 2.1 2020-01-20 [2] CRAN (R 3.6.2)
rprojroot 1.3-2 2018-01-03 [2] CRAN (R 3.6.0)
Rsamtools 2.2.1 2019-11-06 [2] Bioconductor
RSQLite 2.2.0 2020-01-07 [2] CRAN (R 3.6.2)
rstudioapi 0.10 2019-03-19 [2] CRAN (R 3.6.0)
rtracklayer 1.46.0 2019-10-29 [2] Bioconductor
rvest 0.3.5 2019-11-08 [2] CRAN (R 3.6.1)
S4Vectors * 0.24.3 2020-01-18 [2] Bioconductor
scales * 1.1.0 2019-11-18 [2] CRAN (R 3.6.1)
scatterplot3d 0.3-41 2018-03-14 [2] CRAN (R 3.6.1)
sessioninfo 1.1.1 2018-11-05 [2] CRAN (R 3.6.0)
shiny 1.4.0 2019-10-10 [2] CRAN (R 3.6.1)
ShortRead 1.44.1 2019-12-19 [2] Bioconductor
SparseM * 1.78 2019-12-13 [2] CRAN (R 3.6.2)
stringi 1.4.5 2020-01-11 [2] CRAN (R 3.6.2)
stringr * 1.4.0 2019-02-10 [2] CRAN (R 3.6.0)
SummarizedExperiment 1.16.1 2019-12-19 [2] Bioconductor
testthat 2.3.1 2019-12-01 [2] CRAN (R 3.6.1)
tibble * 2.1.3 2019-06-06 [2] CRAN (R 3.6.0)
tidyr * 1.0.2 2020-01-24 [2] CRAN (R 3.6.2)
tidyselect 1.0.0 2020-01-27 [2] CRAN (R 3.6.2)
tidyverse * 1.3.0 2019-11-21 [2] CRAN (R 3.6.1)
truncnorm 1.0-8 2018-02-27 [2] CRAN (R 3.6.0)
usethis 1.5.1 2019-07-04 [2] CRAN (R 3.6.1)
vctrs 0.2.2 2020-01-24 [2] CRAN (R 3.6.2)
viridisLite 0.3.0 2018-02-01 [2] CRAN (R 3.6.0)
webshot 0.5.2 2019-11-22 [2] CRAN (R 3.6.1)
whisker 0.4 2019-08-28 [2] CRAN (R 3.6.1)
withr 2.1.2 2018-03-15 [2] CRAN (R 3.6.0)
workflowr * 1.6.0 2019-12-19 [2] CRAN (R 3.6.2)
xfun 0.12 2020-01-13 [2] CRAN (R 3.6.2)
XML 3.99-0.3 2020-01-20 [2] CRAN (R 3.6.2)
xml2 1.2.2 2019-08-09 [2] CRAN (R 3.6.1)
xtable 1.8-4 2019-04-21 [2] CRAN (R 3.6.0)
XVector 0.26.0 2019-10-29 [2] Bioconductor
yaml 2.2.1 2020-02-01 [2] CRAN (R 3.6.2)
zlibbioc 1.32.0 2019-10-29 [2] Bioconductor
zoo 1.8-7 2020-01-10 [2] CRAN (R 3.6.2)
[1] /home/steveped/R/x86_64-pc-linux-gnu-library/3.6
[2] /usr/local/lib/R/site-library
[3] /usr/lib/R/site-library
[4] /usr/lib/R/library