Last updated: 2020-01-25

Checks: 7 0

Knit directory: 20170327_Psen2S4Ter_RNASeq/

This reproducible R Markdown analysis was created with workflowr (version 1.6.0). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.


Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.

Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.

The command set.seed(20200119) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.

Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.

Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility. The version displayed above was the version of the Git repository at the time these results were generated.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Ignored files:
    Ignored:    .Rhistory
    Ignored:    .Rproj.user/

Untracked files:
    Untracked:  analysis/3_Enrichment.Rmd
    Untracked:  analysis/figure/

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.


These are the previous versions of the R Markdown and HTML files. If you’ve configured a remote Git repository (see ?wflow_git_remote), click on the hyperlinks in the table below to view them.

File Version Author Date Message
Rmd 07a6053 Steve Ped 2020-01-25 Corrected paths
Rmd 09180e5 Steve Ped 2020-01-25 Added data export
Rmd fca554f Steve Ped 2020-01-25 Tweaked explanations & labels
Rmd be95a60 Steve Ped 2020-01-25 Finished first pass of DE Analysis
html be95a60 Steve Ped 2020-01-25 Finished first pass of DE Analysis
html f04bb47 Steve Ped 2020-01-25 Minor typos
Rmd 7251bd2 Steve Ped 2020-01-25 Tweaks
Rmd 7fea156 Steve Ped 2020-01-25 Added rRNA checks & heatmaps
html 9bff516 Steve Ped 2020-01-24 Added analysis without CQN
Rmd 7b680b2 Steve Ped 2020-01-24 Added analysis without CQN
html 95ccc90 Steve Ped 2020-01-22 Compiled DE so far
Rmd 1858f49 Steve Ped 2020-01-22 Started presenting actual DE genes
html 1858f49 Steve Ped 2020-01-22 Started presenting actual DE genes
Rmd 10a285b Steve Ped 2020-01-22 Expanded description in index and tried reusing an image
html 10a285b Steve Ped 2020-01-22 Expanded description in index and tried reusing an image
html aabb570 Steve Ped 2020-01-21 Rebuilt to get rid of warnings
Rmd 71b8832 Steve Ped 2020-01-21 Added DE QC for GC bias
html 71b8832 Steve Ped 2020-01-21 Added DE QC for GC bias
Rmd e825637 Steve Ped 2020-01-21 Minor updates to DE plots
html 01512da Steve Ped 2020-01-21 Added initial DE analysis to index
Rmd fbb6242 Steve Ped 2020-01-21 Paused DE analysis
Rmd c560637 Steve Ped 2020-01-20 Started DE analysis

Setup

library(ngsReports)
library(tidyverse)
library(magrittr)
library(edgeR)
library(AnnotationHub)
library(ensembldb)
library(scales)
library(pander)
library(cqn)
library(ggrepel)
library(pheatmap)
library(RColorBrewer)
if (interactive()) setwd(here::here())
theme_set(theme_bw())
panderOptions("big.mark", ",")
panderOptions("table.split.table", Inf)
panderOptions("table.style", "rmarkdown")

Annotations

ah <- AnnotationHub() %>%
    subset(species == "Danio rerio") %>%
    subset(rdataclass == "EnsDb")
ensDb <- ah[["AH74989"]]
grTrans <- transcripts(ensDb)
trLengths <- exonsBy(ensDb, "tx") %>%
    width() %>%
    vapply(sum, integer(1))
mcols(grTrans)$length <- trLengths[names(grTrans)]
gcGene <- grTrans %>%
    mcols() %>%
    as.data.frame() %>%
    dplyr::select(gene_id, tx_id, gc_content, length) %>%
    as_tibble() %>%
    group_by(gene_id) %>%
    summarise(
        gc_content = sum(gc_content*length) / sum(length),
        length = ceiling(median(length))
    )
grGenes <- genes(ensDb)
mcols(grGenes) %<>%
    as.data.frame() %>%
    left_join(gcGene) %>%
    as.data.frame() %>%
    DataFrame()

Similarly to the Quality Assessment steps, GRanges objects were formed at the gene and transcript levels, to enable estimation of GC content and length for each transcript and gene. GC content and transcript length are available for each transcript, and for gene-level estimates, GC content was taken as the sum of all GC bases divided by the sum of all transcript lengths, effectively averaging across all transcripts. Gene length was defined as the median transcript length.

samples <- read_csv("data/samples.csv") %>%
  distinct(sampleName, .keep_all = TRUE) %>%
  dplyr::select(sample = sampleName, sampleID, genotype) %>%
  mutate(
    genotype = factor(genotype, levels = c("WT", "Het", "Hom")),
    mutant = genotype %in% c("Het", "Hom"),
    homozygous = genotype == "Hom"
  )
genoCols <- samples$genotype %>%
  levels() %>%
  length() %>%
  brewer.pal("Set1") %>%
  setNames(levels(samples$genotype))

Sample metadata was also loaded, with only the sampleID and genotype being retained. All other fields were considered irrelevant.

Count Data

minCPM <- 1.5
minSamples <- 4
dgeList <- file.path("data", "2_alignedData", "featureCounts", "genes.out") %>%
    read_delim(delim = "\t") %>%
    set_names(basename(names(.))) %>%
    as.data.frame() %>%
    column_to_rownames("Geneid") %>%
    as.matrix() %>% 
    set_colnames(str_remove(colnames(.), "Aligned.sortedByCoord.out.bam")) %>%
    .[rowSums(cpm(.) >= minCPM) >= minCPM,] %>%
    DGEList(
        samples = tibble(sample = colnames(.)) %>%
            left_join(samples),
        genes = grGenes[rownames(.)] %>%
            as.data.frame() %>%
            dplyr::select(
                chromosome = seqnames, start, end, 
                gene_id, gene_name, gene_biotype, description, 
                entrezid, gc_content, length
            )
    ) %>%
    .[!grepl("rRNA", .$genes$gene_biotype),] %>%
    calcNormFactors()

Gene-level count data as output by featureCounts, was loaded and formed into a DGEList object. During this process, genes were removed if:

  • They were not considered as detectable (CPM < 1.5 in > 8 samples). This translates to > 18 reads assigned a gene in all samples from one or more of the genotype groups
  • The gene_biotype was any type of rRNA.

These filtering steps returned gene-level counts for 16,640 genes, with total library sizes between 11,852,141 and 16,997,219 reads assigned to genes. It was noted that these library sizes were about 1.5-fold larger than the transcript-level counts used for the QA steps.

cpm(dgeList, log = TRUE) %>%
  as.data.frame() %>%
  pivot_longer(
    cols = everything(),
    names_to = "sample",
    values_to = "logCPM"
  ) %>%
  split(f = .$sample) %>%
  lapply(function(x){
    d <- density(x$logCPM)
    tibble(
      sample = unique(x$sample),
      x = d$x,
      y = d$y
    )
  }) %>%
  bind_rows() %>%
  left_join(samples) %>%
  ggplot(aes(x, y, colour = genotype, group = sample)) +
  geom_line() +
  scale_colour_manual(
    values = genoCols
  ) +
  labs(
    x = "logCPM",
    y = "Density",
    colour = "Genotype"
  )
*Expression density plots for all samples after filtering, showing logCPM values.*

Expression density plots for all samples after filtering, showing logCPM values.

Version Author Date
7251bd2 Steve Ped 2020-01-25
1858f49 Steve Ped 2020-01-22
01512da Steve Ped 2020-01-21

Additional Functions

contLabeller <- as_labeller(
    c(
        HetVsWT = "S4Ter/+ Vs +/+",
        HomVsWT = "S4Ter/S4Ter Vs +/+",
        HomVsHet = "S4Ter/S4Ter Vs S4Ter/+",
        Hom = "S4Ter/S4Ter",
        Het = "S4Ter/+",
        WT = "+/+",
        mutant = "S4Ter Vs WT",
        homozygous = "S4Ter/S4Ter Vs S4Ter/+"
    )
)
geneLabeller <- structure(grGenes$gene_name, names = grGenes$gene_id) %>%
    as_labeller()

Labeller functions for genotypes, contrasts and gene names were additionally defined for simpler plotting using ggplot2.

Analysis

PCA

pca <- dgeList %>%
    cpm(log = TRUE) %>%
    t() %>%
    prcomp() 
pcaVars <- percent_format(0.1)(summary(pca)$importance["Proportion of Variance",])
pca$x %>%
  as.data.frame() %>%
  rownames_to_column("sample") %>%
  left_join(samples) %>%
  as_tibble() %>%
  ggplot(aes(PC1, PC2, colour = genotype, fill = genotype)) +
  geom_point() +
  geom_text_repel(aes(label = sampleID), show.legend = FALSE) +
  stat_ellipse(geom = "polygon", alpha = 0.05, show.legend = FALSE) +
  guides(fill = FALSE) +
  scale_colour_manual(
    values = genoCols
  ) +
  labs(
    x = paste0("PC1 (", pcaVars[["PC1"]], ")"),
    y = paste0("PC2 (", pcaVars[["PC2"]], ")"),
    colour = "Genotype"
  )
*PCA of gene-level counts.*

PCA of gene-level counts.

Version Author Date
7251bd2 Steve Ped 2020-01-25
1858f49 Steve Ped 2020-01-22
01512da Steve Ped 2020-01-21

A Principal Component Analysis (PCA) was also performed using logCPM values from each sample. Both mutant genotypes appear to cluster together, however it has previously been noted that GC content appears to track closely with PC1, as a result of variable rRNA removal.

Model Description

Version Author Date
10a285b Steve Ped 2020-01-22

The initial expectation was to perform three pairwise comparisons as described in the above figure. However, given there was a strong similarity between mutants, the model matrix was defined as containing an intercept (i.e. baseline expression in wild-type), with additional columns defining presence of any mutant alleles, and the final column capturing the difference between mutants. This avoids any issues with the use of hard cutoffs when combing lists, such as would occur when comparing separate results from both mutant genotypes against wild-type samples.

d <- model.matrix(~mutant + homozygous, data = dgeList$samples) %>%
  set_colnames(str_remove(colnames(.), "TRUE"))
pheatmap(
  d, 
  cluster_cols = FALSE, 
  cluster_rows = FALSE, 
  color = c("white", "grey50"),
  annotation_row = dgeList$samples["genotype"],
  annotation_colors = list(genotype = genoCols),
  legend = FALSE
)
*Visualisation of the design matrix*

Visualisation of the design matrix

Version Author Date
7251bd2 Steve Ped 2020-01-25

Normalisation

As GC content and length was noted as being of concern for this dataset, conditional-quantile normalisation was performed using the cqn package. This adds a gene and sample-level offset for each count which takes into account any systemic bias, such as that identified previously as an artefact of variable rRNA removal. The resultant glm.offset values were added to the original DGEList object, and all dispersion estimates were calculated.

gcCqn <- cqn(
    counts = dgeList$counts,
    x = dgeList$genes$gc_content,
    lengths = dgeList$genes$length,
    sizeFactors = dgeList$samples$lib.size
)
par(mfrow = c(1, 2))
cqnplot(gcCqn, n = 1, xlab = "GC Content", col = genoCols)
cqnplot(gcCqn, n = 2, xlab = "Length", col = genoCols)
legend("bottomright", legend = levels(samples$genotype), col = genoCols, lty = 1)
*Model fits for GC content and gene length under the CQN model. Genotype-specific effects are clearly visible.*

Model fits for GC content and gene length under the CQN model. Genotype-specific effects are clearly visible.

Version Author Date
7251bd2 Steve Ped 2020-01-25
01512da Steve Ped 2020-01-21
par(mfrow = c(1, 1))
dgeList$offset <- gcCqn$glm.offset 
dgeList %<>% estimateDisp(design = d)

Model Fitting

minLfc <- log2(2)
alpha <- 0.01
fit <- glmFit(dgeList)
topTables <- colnames(d)[2:3] %>%
  sapply(function(x){
    glmLRT(fit, coef = x) %>%
      topTags(n = Inf) %>%
      .[["table"]] %>%
      as_tibble() %>%
      arrange(PValue) %>%
      dplyr::select(
        gene_id, gene_name, logFC, logCPM, PValue, FDR, everything()  
      ) %>%
      mutate(
        coef = x,
        bonfP = p.adjust(PValue, "bonf"),
        DE = case_when(
          bonfP < alpha ~ TRUE,
          FDR < alpha & abs(logFC) > minLfc ~ TRUE
        ),
        DE = ifelse(is.na(DE), FALSE, DE)
      )
  }, simplify = FALSE)

Models were fit using the negative-binomial approaches of glmFit(). Top Tables were then obtained using likelihood-ratio tests in glmLRT(). These test the standard \(H_0\) that the true value of the estimated model coefficient is zero. These model coefficients effectively estimate:

  1. the effect of the presence of a mutation, and
  2. the difference in heterozygous and heterozygous mutants

For enrichment testing, genes were initially considered to be DE using:

  1. An Bonferroni-adjusted p-value < 0.01
  2. An FDR-adjusted p-value < 0.01 along with an estimated logFC outside of the range \(\pm \log_2(2)\).

As fewer genes were detected in the comparisons between homozygous and heterozygous mutants, a simple FDR of 0.05 was subsequently chosen for this comparison.

topTables$homozygous %<>%
  mutate(DE = FDR < 0.05)

Using these criteria, the following initial DE gene-sets were defined:

topTables %>%
  lapply(dplyr::filter, DE) %>% 
  vapply(nrow, integer(1)) %>%
  set_names(
    case_when(
      names(.) == "mutant" ~ "psen2 mutant",
      names(.) == "homozygous" ~ "HomVsHet"
    )
  ) %>%
  pander()
psen2 mutant HomVsHet
615 7
deCols <- c(
  `FALSE` = rgb(0.5, 0.5, 0.5, 0.4), 
  `TRUE` = rgb(1, 0, 0, 0.7)
)

Model Checking

GC and Length Bias

topTables %>%
    bind_rows() %>%
    mutate(stat = -sign(logFC)*log10(PValue)) %>%
    ggplot(aes(gc_content, stat)) +
    geom_point(aes(colour = DE), alpha = 0.4) +
    geom_smooth(se = FALSE) +
    facet_wrap(~coef, labeller = contLabeller)  +
    labs(
        x = "GC content (%)",
        y = "Ranking Statistic"
    ) +
    coord_cartesian(ylim = c(-10, 10)) +
    scale_colour_manual(values = deCols) +
    theme(legend.position = "none")
*Checks for GC bias in differential expression. GC content is shown against the ranking statistic, using -log10(p) multiplied by the sign of log fold-change. A small amount of bias was noted particularly in the comparison between mutants and wild-type.*

Checks for GC bias in differential expression. GC content is shown against the ranking statistic, using -log10(p) multiplied by the sign of log fold-change. A small amount of bias was noted particularly in the comparison between mutants and wild-type.

Version Author Date
be95a60 Steve Ped 2020-01-25
7251bd2 Steve Ped 2020-01-25
1858f49 Steve Ped 2020-01-22
71b8832 Steve Ped 2020-01-21
topTables %>%
    bind_rows() %>%
    mutate(stat = -sign(logFC)*log10(PValue)) %>%
    ggplot(aes(length, stat)) +
    geom_point(aes(colour = DE), alpha = 0.4) +
    geom_smooth(se = FALSE) +
    facet_wrap(~coef, labeller = contLabeller)  +
    labs(
        x = "Gene Length (bp)",
        y = "Ranking Statistic"
    ) +
    coord_cartesian(ylim = c(-10, 10)) +
    scale_x_log10(labels = comma) +
    scale_colour_manual(values = deCols) +
    theme(legend.position = "none")
*Checks for length bias in differential expression. Gene length is shown against the ranking statistic, using -log10(p) multiplied by the sign of log fold-change. Again, a small amount of bias was noted particularly in the comparison between mutants and wild-type.*

Checks for length bias in differential expression. Gene length is shown against the ranking statistic, using -log10(p) multiplied by the sign of log fold-change. Again, a small amount of bias was noted particularly in the comparison between mutants and wild-type.

Version Author Date
be95a60 Steve Ped 2020-01-25
7251bd2 Steve Ped 2020-01-25
1858f49 Steve Ped 2020-01-22
71b8832 Steve Ped 2020-01-21

Checks for both GC and length bias on differential expression showed that a small bias remained evident, despite using conditional-quantile normalisation. However, an alternative analysis on the same dataset excluding the CQN steps revealed vastly different and exaggerated bias. As such, the impact of CQN normalisation was considered to be appropriate.

Checks for rRNA impacts

rawFqc <- list.files(
  path = "data/0_rawData/FastQC/",
  pattern = "zip",
  full.names = TRUE
) %>%
  FastqcDataList()
rawGC <- getModule(rawFqc, "Per_sequence_GC") %>%
  group_by(Filename) %>% 
  mutate(Freq = Count / sum(Count)) %>%
  dplyr::filter(GC_Content > 70) %>% 
  summarise(Freq = sum(Freq)) %>% 
  arrange(desc(Freq)) %>%
  mutate(sample = str_remove(Filename, "_R[12].fastq.gz")) %>%
  group_by(sample) %>%
  summarise(Freq = mean(Freq)) %>%
  left_join(samples) 
riboVec <- structure(rawGC$Freq, names = rawGC$sample)
riboCors <- cpm(dgeList, log = TRUE)%>%
  apply(1, function(x){
    cor(x, riboVec[names(x)])
  })

Given the previously identified concerns about variable rRNA removal, correlations were calculated between each gene’s expression values and the proportion of the raw libraries with > 70% GC. Those with the strongest correlation, and for which the FDR is < 0.01 are shown below.

Most notably, the most correlated RNA with these rRNA proportions was eif1b which is the initiation factor for translation. As this is a protein coding gene with the protein product interacting with the ribosome, the reason for this correlation at the RNA level raises difficult questions which cannot be answered here.

topTables$mutant %>%
  mutate(riboCors = riboCors[gene_id]) %>%
  dplyr::filter(
    FDR < alpha
  ) %>%
  dplyr::select(gene_id, gene_name, logFC, logCPM, FDR, riboCors, DE) %>%
  arrange(desc(riboCors)) %>%
  mutate(FDR = case_when(
    FDR >= 0.0001 ~ sprintf("%.4f", FDR),
    FDR < 0.0001 ~ sprintf("%.2e", FDR)
    )
  ) %>%
  dplyr::slice(1:40) %>%
  pander(
    justify = "llrrrrl",
    style = "rmarkdown",
    caption = paste(
      "The", nrow(.), "genes most correlated with the original high GC content.",
      "Many failed the selection criteria for differential expression,",
      "primarily due to the stringent logFC filter.",
      "However, the unusually high number of ribosomal protein coding genes",
      "is currently difficult to explain, but notable.",
      "Two possibilities are these RNAs are somehow translated with physical",
      "linkage to the ribosomal complex, or that there is genuinely a global",
      "shift in the level of ribosomal activity.",
      "Neither explanation is initially satisfying."
    )
  )
The 40 genes most correlated with the original high GC content. Many failed the selection criteria for differential expression, primarily due to the stringent logFC filter. However, the unusually high number of ribosomal protein coding genes is currently difficult to explain, but notable. Two possibilities are these RNAs are somehow translated with physical linkage to the ribosomal complex, or that there is genuinely a global shift in the level of ribosomal activity. Neither explanation is initially satisfying.
gene_id gene_name logFC logCPM FDR riboCors DE
ENSDARG00000012688 eif1b 0.5229 7.482 0.0024 0.9906 FALSE
ENSDARG00000012871 npepl1 0.6049 4.485 0.0028 0.9823 FALSE
ENSDARG00000103994 ppiab 0.8891 7.372 6.18e-05 0.9814 FALSE
ENSDARG00000034897 rps10 0.8809 5.898 0.0001 0.9795 FALSE
ENSDARG00000068995 h2afx1 0.644 6.605 0.0025 0.9734 FALSE
ENSDARG00000077291 rps2 1.305 7.52 1.58e-07 0.9733 TRUE
ENSDARG00000002240 psmb6 0.6407 4.702 0.0056 0.9733 FALSE
ENSDARG00000036298 rps13 0.6762 6.076 8.26e-05 0.9707 FALSE
ENSDARG00000037071 rps26 0.651 5.323 0.0045 0.97 FALSE
ENSDARG00000057026 ran 0.6265 7.071 0.0002 0.9689 FALSE
ENSDARG00000111753 hist1h4l 0.9775 2.015 0.0061 0.9685 FALSE
ENSDARG00000038028 ndufa6 1.051 4.693 2.57e-05 0.9667 TRUE
ENSDARG00000046157 RPS17 0.7725 6.74 0.0002 0.9667 FALSE
ENSDARG00000078929 abhd16a 0.5358 4.906 0.0013 0.9664 FALSE
ENSDARG00000099226 CABZ01076667.1 0.5407 3.973 0.0008 0.9652 FALSE
ENSDARG00000014690 rps4x 0.6511 7.368 0.0003 0.9649 FALSE
ENSDARG00000092553 slc25a5 0.9012 8.952 2.64e-05 0.9648 TRUE
ENSDARG00000011665 aldoaa 0.6082 7.763 0.0029 0.9639 FALSE
ENSDARG00000045447 slc35g2b 0.766 4.873 0.0002 0.9627 FALSE
ENSDARG00000034534 atp6v1aa 0.6212 7.049 0.0003 0.9618 FALSE
ENSDARG00000099766 myl12.1 0.3841 6.468 0.0058 0.9595 FALSE
ENSDARG00000037962 psmb7 0.7979 4.104 3.30e-05 0.9592 FALSE
ENSDARG00000019230 rpl7a 0.9036 7.71 5.34e-05 0.9591 FALSE
ENSDARG00000099104 rpl24 0.662 6.729 0.0004 0.9571 FALSE
ENSDARG00000055475 rps27.2 0.6175 6.719 0.0019 0.9566 FALSE
ENSDARG00000026322 dhrs13a.1 0.7388 2.804 0.0032 0.9563 FALSE
ENSDARG00000025581 rpl10 1.096 6.381 0.0003 0.9562 TRUE
ENSDARG00000092807 si:dkey-151g10.6 0.8303 6.491 9.37e-07 0.9551 TRUE
ENSDARG00000053365 rpl31 1.215 6.638 8.99e-06 0.9549 TRUE
ENSDARG00000075445 psmb5 0.8405 5.59 5.33e-05 0.9544 FALSE
ENSDARG00000030237 pgrmc2 0.3734 6.21 0.0046 0.9538 FALSE
ENSDARG00000011405 rps9 0.823 6.209 0.0006 0.9537 FALSE
ENSDARG00000035808 clcn4 0.6887 5.678 0.0014 0.9528 FALSE
ENSDARG00000043561 psmc1b 0.6844 4.673 0.0004 0.9517 FALSE
ENSDARG00000017235 eif5a 0.7003 7.388 5.54e-06 0.9514 TRUE
ENSDARG00000034291 rpl37 1.085 5.662 5.03e-06 0.9513 TRUE
ENSDARG00000104173 tufm 0.7123 5.051 0.0013 0.9511 FALSE
ENSDARG00000058451 rpl6 0.7848 7.877 0.0004 0.9497 FALSE
ENSDARG00000014867 rpl8 0.9139 6.716 1.50e-06 0.9493 TRUE
ENSDARG00000089976 spcs3 0.7379 3.54 0.0034 0.949 FALSE

Other Artefacts

topTables %>%
  bind_rows() %>%
  arrange(DE) %>%
  ggplot(aes(logCPM, logFC)) +
  geom_point(aes(colour = DE)) +
  geom_text_repel(
    aes(label = gene_name, colour = DE),
    data = . %>% dplyr::filter(DE & abs(logFC) > 2.9)
  ) +
  geom_text_repel(
    aes(label = gene_name, colour = DE),
    data = . %>% dplyr::filter(FDR < 0.05 & coef == "homozygous")
  ) +
  geom_smooth(se = FALSE) +
  geom_hline(
    yintercept = c(-1, 1)*minLfc,
    linetype = 2,
    colour = "red"
  ) +
  facet_wrap(~coef, nrow = 2, labeller = contLabeller) +
  scale_y_continuous(breaks = seq(-8, 8, by = 2)) +
  scale_colour_manual(values = deCols) +
  theme(legend.position = "none")
*MA plots checking for any logFC bias across the range of expression values. The small curve in the average at the low end of expression values was considered to be an artefact of the sparse points at this end. Initial DE genes are shown in red, with select points labelled.*

MA plots checking for any logFC bias across the range of expression values. The small curve in the average at the low end of expression values was considered to be an artefact of the sparse points at this end. Initial DE genes are shown in red, with select points labelled.

Version Author Date
be95a60 Steve Ped 2020-01-25
7251bd2 Steve Ped 2020-01-25
1858f49 Steve Ped 2020-01-22
71b8832 Steve Ped 2020-01-21
01512da Steve Ped 2020-01-21
topTables %>%
  bind_rows() %>%
  ggplot(aes(PValue)) +
  geom_histogram(
    binwidth = 0.02,
    colour = "black", fill = "grey90"
    ) +
  facet_wrap(~coef, labeller = contLabeller)
*Histograms of p-values for both sets of coefficients. Values for the difference between mutants follow the expected distribution for when there are very few differences, whilst values for the presence of a mutant allele show the expected distribution for when there are many differences. In particular the spike near zero indicates many genes which are differentially expressed.*

Histograms of p-values for both sets of coefficients. Values for the difference between mutants follow the expected distribution for when there are very few differences, whilst values for the presence of a mutant allele show the expected distribution for when there are many differences. In particular the spike near zero indicates many genes which are differentially expressed.

Version Author Date
be95a60 Steve Ped 2020-01-25
7251bd2 Steve Ped 2020-01-25

Results

topTables %>%
    bind_rows() %>%
    ggplot(aes(logFC, -log10(PValue), colour = DE)) +
    geom_point(alpha = 0.4) +
    geom_text_repel(
        aes(label = gene_name),
        data = . %>% dplyr::filter(PValue < 1e-12 | abs(logFC) > 4)
    ) +
    geom_text_repel(
        aes(label = gene_name),
        data = . %>% dplyr::filter(FDR < 0.05 & coef == "homozygous")
    ) +
  geom_vline(
    xintercept = c(-1, 1)*minLfc,
    linetype = 2,
    colour = "red"
  ) +
  facet_wrap(~coef, nrow = 1, labeller = contLabeller) +
  scale_colour_manual(values = deCols) +
  scale_x_continuous(breaks = seq(-8, 8, by = 2)) +
  theme(legend.position = "none") 
*Volcano Plots showing DE genes against logFC.*

Volcano Plots showing DE genes against logFC.

Version Author Date
be95a60 Steve Ped 2020-01-25
7251bd2 Steve Ped 2020-01-25
1858f49 Steve Ped 2020-01-22
71b8832 Steve Ped 2020-01-21
01512da Steve Ped 2020-01-21
deGenes <- topTables %>%
  lapply(dplyr::filter, DE) %>%
  lapply(magrittr::extract2, "gene_id") 

Genes Tracking with psen2S4Ter

n <- 40

A set of 615 genes was identified as DE in the presence of the mutant psen2S4Ter transcript. The most highly ranked 40 genes are shown in the following heatmap. The presence of SNORNA70 in this list is of some concern as this is an ncRNA which is known to interact with ribosomes. In particular the expression pattern follows very closely the patterns previosuly identified for rRNA depletion using the percentage of libraries with GC > 70. For example, WT_86 was the sample with most complete rRNA depletion whilst, Het_84 showed the largest proportion of the library as being derived from rRNA. These represent the samples with lowest and highest expression of SNORNA70 respectively.

genoCols <- RColorBrewer::brewer.pal(3, "Set1") %>%
  setNames(levels(dgeList$samples$genotype))
dgeList %>%
  cpm(log = TRUE) %>% 
  extract(deGenes$mutant[seq_len(n)],) %>%
  as.data.frame() %>%
  set_rownames(unlist(geneLabeller(rownames(.)))) %>%
  pheatmap::pheatmap(
    color = viridis_pal(option = "magma")(100),
    labels_col = colnames(.) %>%
      str_replace_all(".+(Het|Hom|WT).+F3(_[0-9]{2}).+", "\\1\\2"),
    legend_breaks = c(seq(-2, 8, by = 2), max(.)),
    legend_labels = c(seq(-2, 8, by = 2), "logCPM\n"),
    annotation_col = dgeList$samples %>%
      dplyr::select(Genotype = genotype),
    annotation_names_col = FALSE,
    annotation_colors = list(Genotype = genoCols),
    cutree_cols = 2,
    cutree_rows = 1#6
  )
*The 40 most highly-ranked genes by FDR which are commonly considered DE between mutants and WT samples*

The 40 most highly-ranked genes by FDR which are commonly considered DE between mutants and WT samples

Version Author Date
7251bd2 Steve Ped 2020-01-25
1858f49 Steve Ped 2020-01-22

Genes showing differences between mutants

When inspecting the genes showing differences between mutants, it was noted that the WT and Homozygous mutant samples clustered together, implying that there was a unique effect on these genes which was specific to the heterozygous condition.

dgeList %>%
  cpm(log = TRUE) %>% 
  extract(deGenes$homozygous,) %>%
  as.data.frame() %>%
  set_rownames(unlist(geneLabeller(rownames(.)))) %>%
  pheatmap::pheatmap(
    color = viridis_pal(option = "magma")(100),
    labels_col = colnames(.) %>%
      str_replace_all(".+(Het|Hom|WT).+F3(_[0-9]{2}).+", "\\1\\2"),
    legend_breaks = c(seq(-2, 8, by = 2), max(.)),
    legend_labels = c(seq(-2, 8, by = 2), "logCPM\n"),
    annotation_col = dgeList$samples %>%
      dplyr::select(Genotype = genotype),
    annotation_names_col = FALSE,
    annotation_colors = list(Genotype = genoCols),
    cutree_cols = 2,
    cutree_rows = 3
  )
*The 7 most highly-ranked genes by FDR which are DE between between mutants*

The 7 most highly-ranked genes by FDR which are DE between between mutants

Version Author Date
7251bd2 Steve Ped 2020-01-25

Data Export

Final gene lists were exported as separate csv files, along with fit and dgeList objects.

write_csv(topTables$mutant, "output/psen2VsWT.csv")
write_csv(topTables$homozygous, "output/psen2HomVsHet.csv")
write_rds(fit, "data/fit.rds")
write_rds(dgeList, "data/dgeList.rds")

devtools::session_info()
─ Session info ───────────────────────────────────────────────────────────────
 setting  value                       
 version  R version 3.6.2 (2019-12-12)
 os       Ubuntu 18.04.3 LTS          
 system   x86_64, linux-gnu           
 ui       X11                         
 language en_AU:en                    
 collate  en_AU.UTF-8                 
 ctype    en_AU.UTF-8                 
 tz       Australia/Adelaide          
 date     2020-01-25                  

─ Packages ───────────────────────────────────────────────────────────────────
 package                * version  date       lib source        
 AnnotationDbi          * 1.48.0   2019-10-29 [2] Bioconductor  
 AnnotationFilter       * 1.10.0   2019-10-29 [2] Bioconductor  
 AnnotationHub          * 2.18.0   2019-10-29 [2] Bioconductor  
 askpass                  1.1      2019-01-13 [2] CRAN (R 3.6.0)
 assertthat               0.2.1    2019-03-21 [2] CRAN (R 3.6.0)
 backports                1.1.5    2019-10-02 [2] CRAN (R 3.6.1)
 Biobase                * 2.46.0   2019-10-29 [2] Bioconductor  
 BiocFileCache          * 1.10.2   2019-11-08 [2] Bioconductor  
 BiocGenerics           * 0.32.0   2019-10-29 [2] Bioconductor  
 BiocManager              1.30.10  2019-11-16 [2] CRAN (R 3.6.1)
 BiocParallel             1.20.1   2019-12-21 [2] Bioconductor  
 BiocVersion              3.10.1   2019-06-06 [2] Bioconductor  
 biomaRt                  2.42.0   2019-10-29 [2] Bioconductor  
 Biostrings               2.54.0   2019-10-29 [2] Bioconductor  
 bit                      1.1-15.1 2020-01-14 [2] CRAN (R 3.6.2)
 bit64                    0.9-7    2017-05-08 [2] CRAN (R 3.6.0)
 bitops                   1.0-6    2013-08-17 [2] CRAN (R 3.6.0)
 blob                     1.2.1    2020-01-20 [2] CRAN (R 3.6.2)
 broom                    0.5.3    2019-12-14 [2] CRAN (R 3.6.2)
 callr                    3.4.0    2019-12-09 [2] CRAN (R 3.6.2)
 cellranger               1.1.0    2016-07-27 [2] CRAN (R 3.6.0)
 cli                      2.0.1    2020-01-08 [2] CRAN (R 3.6.2)
 cluster                  2.1.0    2019-06-19 [2] CRAN (R 3.6.1)
 colorspace               1.4-1    2019-03-18 [2] CRAN (R 3.6.0)
 cqn                    * 1.32.0   2019-10-29 [2] Bioconductor  
 crayon                   1.3.4    2017-09-16 [2] CRAN (R 3.6.0)
 curl                     4.3      2019-12-02 [2] CRAN (R 3.6.2)
 data.table               1.12.8   2019-12-09 [2] CRAN (R 3.6.2)
 DBI                      1.1.0    2019-12-15 [2] CRAN (R 3.6.2)
 dbplyr                 * 1.4.2    2019-06-17 [2] CRAN (R 3.6.0)
 DelayedArray             0.12.2   2020-01-06 [2] Bioconductor  
 desc                     1.2.0    2018-05-01 [2] CRAN (R 3.6.0)
 devtools                 2.2.1    2019-09-24 [2] CRAN (R 3.6.1)
 digest                   0.6.23   2019-11-23 [2] CRAN (R 3.6.1)
 dplyr                  * 0.8.3    2019-07-04 [2] CRAN (R 3.6.1)
 edgeR                  * 3.28.0   2019-10-29 [2] Bioconductor  
 ellipsis                 0.3.0    2019-09-20 [2] CRAN (R 3.6.1)
 ensembldb              * 2.10.2   2019-11-20 [2] Bioconductor  
 evaluate                 0.14     2019-05-28 [2] CRAN (R 3.6.0)
 FactoMineR               2.1      2020-01-17 [2] CRAN (R 3.6.2)
 fansi                    0.4.1    2020-01-08 [2] CRAN (R 3.6.2)
 farver                   2.0.3    2020-01-16 [2] CRAN (R 3.6.2)
 fastmap                  1.0.1    2019-10-08 [2] CRAN (R 3.6.1)
 flashClust               1.01-2   2012-08-21 [2] CRAN (R 3.6.1)
 forcats                * 0.4.0    2019-02-17 [2] CRAN (R 3.6.0)
 fs                       1.3.1    2019-05-06 [2] CRAN (R 3.6.0)
 generics                 0.0.2    2018-11-29 [2] CRAN (R 3.6.0)
 GenomeInfoDb           * 1.22.0   2019-10-29 [2] Bioconductor  
 GenomeInfoDbData         1.2.2    2019-11-21 [2] Bioconductor  
 GenomicAlignments        1.22.1   2019-11-12 [2] Bioconductor  
 GenomicFeatures        * 1.38.1   2020-01-22 [2] Bioconductor  
 GenomicRanges          * 1.38.0   2019-10-29 [2] Bioconductor  
 ggdendro                 0.1-20   2016-04-27 [2] CRAN (R 3.6.0)
 ggplot2                * 3.2.1    2019-08-10 [2] CRAN (R 3.6.1)
 ggrepel                * 0.8.1    2019-05-07 [2] CRAN (R 3.6.0)
 git2r                    0.26.1   2019-06-29 [2] CRAN (R 3.6.1)
 glue                     1.3.1    2019-03-12 [2] CRAN (R 3.6.0)
 gtable                   0.3.0    2019-03-25 [2] CRAN (R 3.6.0)
 haven                    2.2.0    2019-11-08 [2] CRAN (R 3.6.1)
 highr                    0.8      2019-03-20 [2] CRAN (R 3.6.0)
 hms                      0.5.3    2020-01-08 [2] CRAN (R 3.6.2)
 htmltools                0.4.0    2019-10-04 [2] CRAN (R 3.6.1)
 htmlwidgets              1.5.1    2019-10-08 [2] CRAN (R 3.6.1)
 httpuv                   1.5.2    2019-09-11 [2] CRAN (R 3.6.1)
 httr                     1.4.1    2019-08-05 [2] CRAN (R 3.6.1)
 hwriter                  1.3.2    2014-09-10 [2] CRAN (R 3.6.0)
 interactiveDisplayBase   1.24.0   2019-10-29 [2] Bioconductor  
 IRanges                * 2.20.2   2020-01-13 [2] Bioconductor  
 jpeg                     0.1-8.1  2019-10-24 [2] CRAN (R 3.6.1)
 jsonlite                 1.6      2018-12-07 [2] CRAN (R 3.6.0)
 kableExtra               1.1.0    2019-03-16 [2] CRAN (R 3.6.1)
 knitr                    1.27     2020-01-16 [2] CRAN (R 3.6.2)
 labeling                 0.3      2014-08-23 [2] CRAN (R 3.6.0)
 later                    1.0.0    2019-10-04 [2] CRAN (R 3.6.1)
 lattice                  0.20-38  2018-11-04 [4] CRAN (R 3.6.0)
 latticeExtra             0.6-29   2019-12-19 [2] CRAN (R 3.6.2)
 lazyeval                 0.2.2    2019-03-15 [2] CRAN (R 3.6.0)
 leaps                    3.1      2020-01-16 [2] CRAN (R 3.6.2)
 lifecycle                0.1.0    2019-08-01 [2] CRAN (R 3.6.1)
 limma                  * 3.42.0   2019-10-29 [2] Bioconductor  
 locfit                   1.5-9.1  2013-04-20 [2] CRAN (R 3.6.0)
 lubridate                1.7.4    2018-04-11 [2] CRAN (R 3.6.0)
 magrittr               * 1.5      2014-11-22 [2] CRAN (R 3.6.0)
 MASS                     7.3-51.5 2019-12-20 [4] CRAN (R 3.6.2)
 Matrix                   1.2-18   2019-11-27 [2] CRAN (R 3.6.1)
 MatrixModels             0.4-1    2015-08-22 [2] CRAN (R 3.6.1)
 matrixStats              0.55.0   2019-09-07 [2] CRAN (R 3.6.1)
 mclust                 * 5.4.5    2019-07-08 [2] CRAN (R 3.6.1)
 memoise                  1.1.0    2017-04-21 [2] CRAN (R 3.6.0)
 mgcv                     1.8-31   2019-11-09 [4] CRAN (R 3.6.1)
 mime                     0.8      2019-12-19 [2] CRAN (R 3.6.2)
 modelr                   0.1.5    2019-08-08 [2] CRAN (R 3.6.1)
 munsell                  0.5.0    2018-06-12 [2] CRAN (R 3.6.0)
 ngsReports             * 1.1.2    2019-10-16 [1] Bioconductor  
 nlme                     3.1-143  2019-12-10 [4] CRAN (R 3.6.2)
 nor1mix                * 1.3-0    2019-06-13 [2] CRAN (R 3.6.1)
 openssl                  1.4.1    2019-07-18 [2] CRAN (R 3.6.1)
 pander                 * 0.6.3    2018-11-06 [2] CRAN (R 3.6.0)
 pheatmap               * 1.0.12   2019-01-04 [2] CRAN (R 3.6.0)
 pillar                   1.4.3    2019-12-20 [2] CRAN (R 3.6.2)
 pkgbuild                 1.0.6    2019-10-09 [2] CRAN (R 3.6.1)
 pkgconfig                2.0.3    2019-09-22 [2] CRAN (R 3.6.1)
 pkgload                  1.0.2    2018-10-29 [2] CRAN (R 3.6.0)
 plotly                   4.9.1    2019-11-07 [2] CRAN (R 3.6.1)
 plyr                     1.8.5    2019-12-10 [2] CRAN (R 3.6.2)
 png                      0.1-7    2013-12-03 [2] CRAN (R 3.6.0)
 preprocessCore         * 1.48.0   2019-10-29 [2] Bioconductor  
 prettyunits              1.1.1    2020-01-24 [2] CRAN (R 3.6.2)
 processx                 3.4.1    2019-07-18 [2] CRAN (R 3.6.1)
 progress                 1.2.2    2019-05-16 [2] CRAN (R 3.6.0)
 promises                 1.1.0    2019-10-04 [2] CRAN (R 3.6.1)
 ProtGenerics             1.18.0   2019-10-29 [2] Bioconductor  
 ps                       1.3.0    2018-12-21 [2] CRAN (R 3.6.0)
 purrr                  * 0.3.3    2019-10-18 [2] CRAN (R 3.6.1)
 quantreg               * 5.54     2019-12-13 [2] CRAN (R 3.6.2)
 R6                       2.4.1    2019-11-12 [2] CRAN (R 3.6.1)
 rappdirs                 0.3.1    2016-03-28 [2] CRAN (R 3.6.0)
 RColorBrewer           * 1.1-2    2014-12-07 [2] CRAN (R 3.6.0)
 Rcpp                     1.0.3    2019-11-08 [2] CRAN (R 3.6.1)
 RCurl                    1.98-1.1 2020-01-19 [2] CRAN (R 3.6.2)
 readr                  * 1.3.1    2018-12-21 [2] CRAN (R 3.6.0)
 readxl                   1.3.1    2019-03-13 [2] CRAN (R 3.6.0)
 remotes                  2.1.0    2019-06-24 [2] CRAN (R 3.6.0)
 reprex                   0.3.0    2019-05-16 [2] CRAN (R 3.6.0)
 reshape2                 1.4.3    2017-12-11 [2] CRAN (R 3.6.0)
 rlang                    0.4.2    2019-11-23 [2] CRAN (R 3.6.1)
 rmarkdown                2.1      2020-01-20 [2] CRAN (R 3.6.2)
 rprojroot                1.3-2    2018-01-03 [2] CRAN (R 3.6.0)
 Rsamtools                2.2.1    2019-11-06 [2] Bioconductor  
 RSQLite                  2.2.0    2020-01-07 [2] CRAN (R 3.6.2)
 rstudioapi               0.10     2019-03-19 [2] CRAN (R 3.6.0)
 rtracklayer              1.46.0   2019-10-29 [2] Bioconductor  
 rvest                    0.3.5    2019-11-08 [2] CRAN (R 3.6.1)
 S4Vectors              * 0.24.3   2020-01-18 [2] Bioconductor  
 scales                 * 1.1.0    2019-11-18 [2] CRAN (R 3.6.1)
 scatterplot3d            0.3-41   2018-03-14 [2] CRAN (R 3.6.1)
 sessioninfo              1.1.1    2018-11-05 [2] CRAN (R 3.6.0)
 shiny                    1.4.0    2019-10-10 [2] CRAN (R 3.6.1)
 ShortRead                1.44.1   2019-12-19 [2] Bioconductor  
 SparseM                * 1.78     2019-12-13 [2] CRAN (R 3.6.2)
 stringi                  1.4.5    2020-01-11 [2] CRAN (R 3.6.2)
 stringr                * 1.4.0    2019-02-10 [2] CRAN (R 3.6.0)
 SummarizedExperiment     1.16.1   2019-12-19 [2] Bioconductor  
 testthat                 2.3.1    2019-12-01 [2] CRAN (R 3.6.1)
 tibble                 * 2.1.3    2019-06-06 [2] CRAN (R 3.6.0)
 tidyr                  * 1.0.0    2019-09-11 [2] CRAN (R 3.6.1)
 tidyselect               0.2.5    2018-10-11 [2] CRAN (R 3.6.0)
 tidyverse              * 1.3.0    2019-11-21 [2] CRAN (R 3.6.1)
 truncnorm                1.0-8    2018-02-27 [2] CRAN (R 3.6.0)
 usethis                  1.5.1    2019-07-04 [2] CRAN (R 3.6.1)
 vctrs                    0.2.1    2019-12-17 [2] CRAN (R 3.6.2)
 viridisLite              0.3.0    2018-02-01 [2] CRAN (R 3.6.0)
 webshot                  0.5.2    2019-11-22 [2] CRAN (R 3.6.1)
 whisker                  0.4      2019-08-28 [2] CRAN (R 3.6.1)
 withr                    2.1.2    2018-03-15 [2] CRAN (R 3.6.0)
 workflowr              * 1.6.0    2019-12-19 [2] CRAN (R 3.6.2)
 xfun                     0.12     2020-01-13 [2] CRAN (R 3.6.2)
 XML                      3.99-0.3 2020-01-20 [2] CRAN (R 3.6.2)
 xml2                     1.2.2    2019-08-09 [2] CRAN (R 3.6.1)
 xtable                   1.8-4    2019-04-21 [2] CRAN (R 3.6.0)
 XVector                  0.26.0   2019-10-29 [2] Bioconductor  
 yaml                     2.2.0    2018-07-25 [2] CRAN (R 3.6.0)
 zeallot                  0.1.0    2018-01-28 [2] CRAN (R 3.6.0)
 zlibbioc                 1.32.0   2019-10-29 [2] Bioconductor  
 zoo                      1.8-7    2020-01-10 [2] CRAN (R 3.6.2)

[1] /home/steveped/R/x86_64-pc-linux-gnu-library/3.6
[2] /usr/local/lib/R/site-library
[3] /usr/lib/R/site-library
[4] /usr/lib/R/library