Last updated: 2020-01-25

Checks: 6 1

Knit directory: 20170327_Psen2S4Ter_RNASeq/

This reproducible R Markdown analysis was created with workflowr (version 1.6.0). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.

R Markdown file: uncommitted changes

The R Markdown file has unstaged changes. To know which version of the R Markdown file created these results, you’ll want to first commit it to the Git repo. If you’re still working on the analysis, you can ignore this warning. When you’re finished, you can run wflow_publish to commit the R Markdown file and build the HTML.

Environment: empty

Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.

Seed: set.seed(20200119)

The command set.seed(20200119) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.

Session information: recorded

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.

Cache: none

Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.

File paths: relative

Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.

Repository version: f04bb47

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility. The version displayed above was the version of the Git repository at the time these results were generated.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Ignored files:
    Ignored:    .Rhistory
    Ignored:    .Rproj.user/

Unstaged changes:
    Modified:   analysis/2_DifferentialExpression.Rmd
    Modified:   analysis/2a_NoCQN.Rmd

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.

These are the previous versions of the R Markdown and HTML files. If you’ve configured a remote Git repository (see ?wflow_git_remote), click on the hyperlinks in the table below to view them.

File	Version	Author	Date	Message
html	f04bb47	Steve Ped	2020-01-25	Minor typos
Rmd	7251bd2	Steve Ped	2020-01-25	Tweaks
Rmd	7fea156	Steve Ped	2020-01-25	Added rRNA checks & heatmaps
html	9bff516	Steve Ped	2020-01-24	Added analysis without CQN
Rmd	7b680b2	Steve Ped	2020-01-24	Added analysis without CQN
html	95ccc90	Steve Ped	2020-01-22	Compiled DE so far
Rmd	1858f49	Steve Ped	2020-01-22	Started presenting actual DE genes
html	1858f49	Steve Ped	2020-01-22	Started presenting actual DE genes
Rmd	10a285b	Steve Ped	2020-01-22	Expanded description in index and tried reusing an image
html	10a285b	Steve Ped	2020-01-22	Expanded description in index and tried reusing an image
html	aabb570	Steve Ped	2020-01-21	Rebuilt to get rid of warnings
Rmd	71b8832	Steve Ped	2020-01-21	Added DE QC for GC bias
html	71b8832	Steve Ped	2020-01-21	Added DE QC for GC bias
Rmd	e825637	Steve Ped	2020-01-21	Minor updates to DE plots
html	01512da	Steve Ped	2020-01-21	Added initial DE analysis to index
Rmd	fbb6242	Steve Ped	2020-01-21	Paused DE analysis
Rmd	c560637	Steve Ped	2020-01-20	Started DE analysis

Setup

library(ngsReports)
library(tidyverse)
library(magrittr)
library(edgeR)
library(AnnotationHub)
library(ensembldb)
library(scales)
library(pander)
library(cowplot)
library(cqn)
library(ggrepel)
# library(UpSetR)
library(pheatmap)
library(RColorBrewer)

if (interactive()) setwd(here::here())
theme_set(theme_bw())
panderOptions("big.mark", ",")
panderOptions("table.split.table", Inf)
panderOptions("table.style", "rmarkdown")

Annotations

ah <- AnnotationHub() %>%
    subset(species == "Danio rerio") %>%
    subset(rdataclass == "EnsDb")
ensDb <- ah[["AH74989"]]

grTrans <- transcripts(ensDb)
trLengths <- exonsBy(ensDb, "tx") %>%
    width() %>%
    vapply(sum, integer(1))
mcols(grTrans)$length <- trLengths[names(grTrans)]

gcGene <- grTrans %>%
    mcols() %>%
    as.data.frame() %>%
    dplyr::select(gene_id, tx_id, gc_content, length) %>%
    as_tibble() %>%
    group_by(gene_id) %>%
    summarise(
        gc_content = sum(gc_content*length) / sum(length),
        length = ceiling(median(length))
    )
grGenes <- genes(ensDb)
mcols(grGenes) %<>%
    as.data.frame() %>%
    left_join(gcGene) %>%
    as.data.frame() %>%
    DataFrame()

Similarly to the Quality Assessment steps, GRanges objects were formed at the gene and transcript levels, to enable estimation of GC content and length for each transcript and gene. GC content and transcript length are available for each transcript, and for gene-level estimates, GC content was taken as the sum of all GC bases divided by the sum of all transcript lengths, effectively averaging across all transcripts. Gene length was defined as the median transcript length.

samples <- read_csv("data/samples.csv") %>%
  distinct(sampleName, .keep_all = TRUE) %>%
  dplyr::select(sample = sampleName, sampleID, genotype) %>%
  mutate(
    genotype = factor(genotype, levels = c("WT", "Het", "Hom")),
    mutant = genotype %in% c("Het", "Hom"),
    homozygous = genotype == "Hom"
  )
genoCols <- samples$genotype %>%
  levels() %>%
  length() %>%
  brewer.pal("Set1") %>%
  setNames(levels(samples$genotype))

Sample metadata was also loaded, with only the sampleID and genotype being retained. All other fields were considered irrelevant.

Count Data

minCPM <- 1.5
minSamples <- 4
dgeList <- file.path("data", "2_alignedData", "featureCounts", "genes.out") %>%
    read_delim(delim = "\t") %>%
    set_names(basename(names(.))) %>%
    as.data.frame() %>%
    column_to_rownames("Geneid") %>%
    as.matrix() %>% 
    set_colnames(str_remove(colnames(.), "Aligned.sortedByCoord.out.bam")) %>%
    .[rowSums(cpm(.) >= minCPM) >= minCPM,] %>%
    DGEList(
        samples = tibble(sample = colnames(.)) %>%
            left_join(samples),
        genes = grGenes[rownames(.)] %>%
            as.data.frame() %>%
            dplyr::select(
                chromosome = seqnames, start, end, 
                gene_id, gene_name, gene_biotype, description, 
                entrezid, gc_content, length
            )
    ) %>%
    .[!grepl("rRNA", .$genes$gene_biotype),] %>%
    calcNormFactors()

Gene-level count data as output by featureCounts, was loaded and formed into a DGEList object. During this process, genes were removed if:

They were not considered as detectable (CPM < 1.5 in > 8 samples). This translates to > 18 reads assigned a gene in all samples from one or more of the genotype groups
The gene_biotype was any type of rRNA.

These filtering steps returned gene-level counts for 16,640 genes, with total library sizes between 11,852,141 and 16,997,219 reads assigned to genes. It was noted that these library sizes were about 1.5-fold larger than the transcript-level counts used for the QA steps.

cpm(dgeList, log = TRUE) %>%
  as.data.frame() %>%
  pivot_longer(
    cols = everything(),
    names_to = "sample",
    values_to = "logCPM"
  ) %>%
  split(f = .$sample) %>%
  lapply(function(x){
    d <- density(x$logCPM)
    tibble(
      sample = unique(x$sample),
      x = d$x,
      y = d$y
    )
  }) %>%
  bind_rows() %>%
  left_join(samples) %>%
  ggplot(aes(x, y, colour = genotype, group = sample)) +
  geom_line() +
  scale_colour_manual(
    values = genoCols
  ) +
  labs(
    x = "logCPM",
    y = "Density",
    colour = "Genotype"
  )

Expression density plots for all samples after filtering, showing logCPM values.

Version	Author	Date
7251bd2	Steve Ped	2020-01-25
1858f49	Steve Ped	2020-01-22
01512da	Steve Ped	2020-01-21

Additional Functions

contLabeller <- as_labeller(
    c(
        HetVsWT = "S4Ter/+ Vs +/+",
        HomVsWT = "S4Ter/S4Ter Vs +/+",
        HomVsHet = "S4Ter/S4Ter Vs S4Ter/+",
        Hom = "S4Ter/S4Ter",
        Het = "S4Ter/+",
        WT = "+/+",
        mutant = "Mutant Vs Wild-type",
        homozygous = "Difference between mutants"
    )
)
geneLabeller <- structure(grGenes$gene_name, names = grGenes$gene_id) %>%
    as_labeller()

Labeller functions for genotypes, contrasts and gene names were additionally defined for simpler plotting using ggplot2.

Analysis

PCA

pca <- dgeList %>%
    cpm(log = TRUE) %>%
    t() %>%
    prcomp() 
pcaVars <- percent_format(0.1)(summary(pca)$importance["Proportion of Variance",])

pca$x %>%
  as.data.frame() %>%
  rownames_to_column("sample") %>%
  left_join(samples) %>%
  as_tibble() %>%
  ggplot(aes(PC1, PC2, colour = genotype, fill = genotype)) +
  geom_point() +
  geom_text_repel(aes(label = sampleID), show.legend = FALSE) +
  stat_ellipse(geom = "polygon", alpha = 0.05, show.legend = FALSE) +
  guides(fill = FALSE) +
  scale_colour_manual(
    values = genoCols
  ) +
  labs(
    x = paste0("PC1 (", pcaVars[["PC1"]], ")"),
    y = paste0("PC2 (", pcaVars[["PC2"]], ")"),
    colour = "Genotype"
  )

PCA of gene-level counts.

Version	Author	Date
7251bd2	Steve Ped	2020-01-25
1858f49	Steve Ped	2020-01-22
01512da	Steve Ped	2020-01-21

A Principal Component Analysis (PCA) was also performed using logCPM values from each sample. Both mutant genotypes appear to cluster together, however it has previously been noted that GC content appears to track closely with PC1, as a result of variable rRNA removal.

Model Description

Version	Author	Date
10a285b	Steve Ped	2020-01-22

Given that there was a strong similarity between mutants, the model matrix was defined as containing an intercept, with additional columns defining presence of any mutant alleles, and the final column capturing the difference between mutants.

d <- model.matrix(~mutant + homozygous, data = dgeList$samples) %>%
  set_colnames(str_remove(colnames(.), "TRUE"))

pheatmap(
  d, 
  cluster_cols = FALSE, 
  cluster_rows = FALSE, 
  color = c("white", "grey50"),
  annotation_row = dgeList$samples["genotype"],
  annotation_colors = list(genotype = genoCols),
  legend = FALSE
)

Visualisation of the design matrix

Version	Author	Date
7251bd2	Steve Ped	2020-01-25

Normalisation

As GC content and length was noted as being of concern for this dataset, conditional-quantile normalisation was performed using the cqn package. This adds a gene and sample-level offset for each count which takes into account any systemic bias, such as that identified previously as an artefact of variable rRNA removal. The resultant glm.offset values were added to the original DGEList object, and all dispersion estimates were calculated.

gcCqn <- cqn(
    counts = dgeList$counts,
    x = dgeList$genes$gc_content,
    lengths = dgeList$genes$length,
    sizeFactors = dgeList$samples$lib.size
)

par(mfrow = c(1, 2))
cqnplot(gcCqn, n = 1, xlab = "GC Content", col = genoCols)
cqnplot(gcCqn, n = 2, xlab = "Length", col = genoCols)
legend("bottomright", legend = levels(samples$genotype), col = genoCols, lty = 1)

Model fits for GC content and gene length under the CQN model. Genotype-specific effects are clearly visible.

Version	Author	Date
7251bd2	Steve Ped	2020-01-25
01512da	Steve Ped	2020-01-21

par(mfrow = c(1, 1))

dgeList$offset <- gcCqn$glm.offset 
dgeList %<>% estimateDisp(design = d)

Model Fitting

minLfc <- log2(2)
alpha <- 0.01
fit <- glmFit(dgeList)
topTables <- colnames(d)[2:3] %>%
  sapply(function(x){
    glmLRT(fit, coef = x) %>%
      topTags(n = Inf) %>%
      .[["table"]] %>%
      as_tibble() %>%
      arrange(PValue) %>%
      dplyr::select(
        gene_id, gene_name, logFC, logCPM, PValue, FDR, everything()  
      ) %>%
      mutate(
        coef = x,
        bonfP = p.adjust(PValue, "bonf"),
        DE = case_when(
          bonfP < alpha ~ TRUE,
          FDR < alpha & abs(logFC) > minLfc ~ TRUE
        ),
        DE = ifelse(is.na(DE), FALSE, DE)
      )
  }, simplify = FALSE)

Models were fit using the negative-binomial approaches of glmFit(). Top Tables were then obtained using likelihood-ratio tests in glmLRT(). These test the standard \(H_0\) that the true value of the estimated model coefficient is zero. These model coefficients effectively estimate:

the effect of the presence of a mutation, and
the difference in heterozygous and heterozygous mutants

For enrichment testing, genes were initially considered to be DE using:

An Bonferroni-adjusted p-value < 0.01
An FDR-adjusted p-value < 0.01 along with an estimated logFC outside of the range \(\pm \log_2(2)\).

As fewer genes were detected in the comparisons between mutants, a simple FDR of 0.05 was subsequently chosen.

topTables$homozygous %<>%
  mutate(DE = FDR < 0.05)

Using these criteria, the following initial DE gene-sets were defined:

topTables %>%
  lapply(dplyr::filter, DE) %>% 
  vapply(nrow, integer(1)) %>%
  set_names(
    case_when(
      names(.) == "mutant" ~ "psen2 mutant",
      names(.) == "homozygous" ~ "HomVsHet"
    )
  ) %>%
  pander()

psen2 mutant	HomVsHet
615	7

deCols <- c(
  `FALSE` = rgb(0.5, 0.5, 0.5, 0.4), 
  `TRUE` = rgb(1, 0, 0, 0.7)
)

Model Checking

GC and Length Bias

topTables %>%
    bind_rows() %>%
    mutate(stat = -sign(logFC)*log10(PValue)) %>%
    ggplot(aes(gc_content, stat)) +
    geom_point(aes(colour = DE), alpha = 0.4) +
    geom_smooth(se = FALSE) +
    facet_wrap(~coef, labeller = contLabeller)  +
    labs(
        x = "GC content (%)",
        y = "Ranking Statistic"
    ) +
    coord_cartesian(ylim = c(-10, 10)) +
    scale_colour_manual(values = deCols) +
    theme(legend.position = "none")

Checks for GC bias in differential expression. GC content is shown against the ranking statistic, using -log10(p) multiplied by the sign of log fold-change. A small amount of bias was noted particularly in the comparison between homozygous mutants and wild-type.

Version	Author	Date
7251bd2	Steve Ped	2020-01-25
1858f49	Steve Ped	2020-01-22
71b8832	Steve Ped	2020-01-21

topTables %>%
    bind_rows() %>%
    mutate(stat = -sign(logFC)*log10(PValue)) %>%
    ggplot(aes(length, stat)) +
    geom_point(aes(colour = DE), alpha = 0.4) +
    geom_smooth(se = FALSE) +
    facet_wrap(~coef, labeller = contLabeller)  +
    labs(
        x = "Gene Length (bp)",
        y = "Ranking Statistic"
    ) +
    coord_cartesian(ylim = c(-10, 10)) +
    scale_x_log10(labels = comma) +
    scale_colour_manual(values = deCols) +
    theme(legend.position = "none")

Checks for length bias in differential expression. Gene length is shown against the ranking statistic, using -log10(p) multiplied by the sign of log fold-change. Again, a small amount of bias was noted particularly in the comparison between homozygous mutants and wild-type.

Version	Author	Date
7251bd2	Steve Ped	2020-01-25
1858f49	Steve Ped	2020-01-22
71b8832	Steve Ped	2020-01-21

Checks for both GC and length bias on differential expression showed that a small bias remained evident, despite using conditional-quantile normalisation. However, an alternative analysis on the same dataset excluding the CQN steps revealed vastly different and exaggerated bias. As such, the impact of CQN normalisation was considered to be appropriate.

Checks for rRNA impacts

rawFqc <- list.files(
  path = "data/0_rawData/FastQC/",
  pattern = "zip",
  full.names = TRUE
) %>%
  FastqcDataList()
rawGC <- getModule(rawFqc, "Per_sequence_GC") %>%
  group_by(Filename) %>% 
  mutate(Freq = Count / sum(Count)) %>%
  dplyr::filter(GC_Content > 70) %>% 
  summarise(Freq = sum(Freq)) %>% 
  arrange(desc(Freq)) %>%
  mutate(sample = str_remove(Filename, "_R[12].fastq.gz")) %>%
  group_by(sample) %>%
  summarise(Freq = mean(Freq)) %>%
  left_join(samples) 
riboVec <- structure(rawGC$Freq, names = rawGC$sample)

riboCors <- cpm(dgeList, log = TRUE)%>%
  apply(1, function(x){
    cor(x, riboVec[names(x)])
  })

Given the previously identified concerns about variable rRNA removal, correlations were calculated between each gene’s expression values and the proportion of the raw libraries with > 70% GC. Those with the strongest correlation, and which the FDR is < 0.01 are shown below.

topTables$mutant %>%
  mutate(riboCors = riboCors[gene_id]) %>%
  dplyr::filter(
    FDR < alpha
  ) %>%
  dplyr::select(gene_id, gene_name, logFC, logCPM, FDR, riboCors, DE) %>%
  arrange(desc(riboCors)) %>%
  mutate(FDR = case_when(
    FDR >= 0.0001 ~ sprintf("%.4f", FDR),
    FDR < 0.0001 ~ sprintf("%.2e", FDR)
    )
  ) %>%
  dplyr::slice(1:40) %>%
  pander(
    justify = "llrrrrl",
    style = "rmarkdown",
    caption = paste(
      "The", nrow(.), "genes most correlated with the original high GC content.",
      "Many failed the selection criteria for differential expression,",
      "primarily due to the stringent logFC filter.",
      "However the unusually high number of ribosomal protein coding genes",
      "is currently inexplicable, but notable."
    )
  )

The 40 genes most correlated with the original high GC content. Many failed the selection criteria for differential expression, primarily due to the stringent logFC filter. However the unusually high number of ribosomal protein coding genes is currently inexplicable, but notable.
gene_id	gene_name	logFC	logCPM	FDR	riboCors	DE
ENSDARG00000012688	eif1b	0.5229	7.482	0.0024	0.9906	FALSE
ENSDARG00000012871	npepl1	0.6049	4.485	0.0028	0.9823	FALSE
ENSDARG00000103994	ppiab	0.8891	7.372	6.18e-05	0.9814	FALSE
ENSDARG00000034897	rps10	0.8809	5.898	0.0001	0.9795	FALSE
ENSDARG00000068995	h2afx1	0.644	6.605	0.0025	0.9734	FALSE
ENSDARG00000077291	rps2	1.305	7.52	1.58e-07	0.9733	TRUE
ENSDARG00000002240	psmb6	0.6407	4.702	0.0056	0.9733	FALSE
ENSDARG00000036298	rps13	0.6762	6.076	8.26e-05	0.9707	FALSE
ENSDARG00000037071	rps26	0.651	5.323	0.0045	0.97	FALSE
ENSDARG00000057026	ran	0.6265	7.071	0.0002	0.9689	FALSE
ENSDARG00000111753	hist1h4l	0.9775	2.015	0.0061	0.9685	FALSE
ENSDARG00000038028	ndufa6	1.051	4.693	2.57e-05	0.9667	TRUE
ENSDARG00000046157	RPS17	0.7725	6.74	0.0002	0.9667	FALSE
ENSDARG00000078929	abhd16a	0.5358	4.906	0.0013	0.9664	FALSE
ENSDARG00000099226	CABZ01076667.1	0.5407	3.973	0.0008	0.9652	FALSE
ENSDARG00000014690	rps4x	0.6511	7.368	0.0003	0.9649	FALSE
ENSDARG00000092553	slc25a5	0.9012	8.952	2.64e-05	0.9648	TRUE
ENSDARG00000011665	aldoaa	0.6082	7.763	0.0029	0.9639	FALSE
ENSDARG00000045447	slc35g2b	0.766	4.873	0.0002	0.9627	FALSE
ENSDARG00000034534	atp6v1aa	0.6212	7.049	0.0003	0.9618	FALSE
ENSDARG00000099766	myl12.1	0.3841	6.468	0.0058	0.9595	FALSE
ENSDARG00000037962	psmb7	0.7979	4.104	3.30e-05	0.9592	FALSE
ENSDARG00000019230	rpl7a	0.9036	7.71	5.34e-05	0.9591	FALSE
ENSDARG00000099104	rpl24	0.662	6.729	0.0004	0.9571	FALSE
ENSDARG00000055475	rps27.2	0.6175	6.719	0.0019	0.9566	FALSE
ENSDARG00000026322	dhrs13a.1	0.7388	2.804	0.0032	0.9563	FALSE
ENSDARG00000025581	rpl10	1.096	6.381	0.0003	0.9562	TRUE
ENSDARG00000092807	si:dkey-151g10.6	0.8303	6.491	9.37e-07	0.9551	TRUE
ENSDARG00000053365	rpl31	1.215	6.638	8.99e-06	0.9549	TRUE
ENSDARG00000075445	psmb5	0.8405	5.59	5.33e-05	0.9544	FALSE
ENSDARG00000030237	pgrmc2	0.3734	6.21	0.0046	0.9538	FALSE
ENSDARG00000011405	rps9	0.823	6.209	0.0006	0.9537	FALSE
ENSDARG00000035808	clcn4	0.6887	5.678	0.0014	0.9528	FALSE
ENSDARG00000043561	psmc1b	0.6844	4.673	0.0004	0.9517	FALSE
ENSDARG00000017235	eif5a	0.7003	7.388	5.54e-06	0.9514	TRUE
ENSDARG00000034291	rpl37	1.085	5.662	5.03e-06	0.9513	TRUE
ENSDARG00000104173	tufm	0.7123	5.051	0.0013	0.9511	FALSE
ENSDARG00000058451	rpl6	0.7848	7.877	0.0004	0.9497	FALSE
ENSDARG00000014867	rpl8	0.9139	6.716	1.50e-06	0.9493	TRUE
ENSDARG00000089976	spcs3	0.7379	3.54	0.0034	0.949	FALSE

Other Artefacts

topTables %>%
  bind_rows() %>%
  arrange(DE) %>%
  ggplot(aes(logCPM, logFC)) +
  geom_point(aes(colour = DE)) +
  geom_text_repel(
    aes(label = gene_name, colour = DE),
    data = . %>% dplyr::filter(DE & abs(logFC) > 2.9)
  ) +
  geom_text_repel(
    aes(label = gene_name, colour = DE),
    data = . %>% dplyr::filter(FDR < 0.05 & coef == "homozygous")
  ) +
  geom_smooth(se = FALSE) +
  geom_hline(
    yintercept = c(-1, 1)*minLfc,
    linetype = 2,
    colour = "red"
  ) +
  facet_wrap(~coef, nrow = 2, labeller = contLabeller) +
  scale_y_continuous(breaks = seq(-8, 8, by = 2)) +
  scale_colour_manual(values = deCols) +
  theme(legend.position = "none")

MA plots checking for any logFC bias across the range of expression values. The small curve in the average at the low end of expression values was considered to be an artefact of the sparse points at this end. Initial DE genes are shown in red, with select points labelled.

Version	Author	Date
7251bd2	Steve Ped	2020-01-25
1858f49	Steve Ped	2020-01-22
71b8832	Steve Ped	2020-01-21
01512da	Steve Ped	2020-01-21

topTables %>%
  bind_rows() %>%
  ggplot(aes(PValue)) +
  geom_histogram(
    binwidth = 0.02,
    colour = "black", fill = "grey90"
    ) +
  facet_wrap(~coef, labeller = contLabeller)

Histograms of p-values for both sets of coefficients. Values for the difference between mutants follow the expected distribution for when there are very few differences, whilst values for the presence of a mutant allele show the expected distribution for when there are many differences. In particular the spike near zero indicates many genes which are differentially expressed.

Version	Author	Date
7251bd2	Steve Ped	2020-01-25

Results

topTables %>%
    bind_rows() %>%
    ggplot(aes(logFC, -log10(PValue), colour = DE)) +
    geom_point(alpha = 0.4) +
    geom_text_repel(
        aes(label = gene_name),
        data = . %>% dplyr::filter(PValue < 1e-12 | abs(logFC) > 4)
    ) +
    geom_text_repel(
        aes(label = gene_name),
        data = . %>% dplyr::filter(FDR < 0.05 & coef == "homozygous")
    ) +
  geom_vline(
    xintercept = c(-1, 1)*minLfc,
    linetype = 2,
    colour = "red"
  ) +
  facet_wrap(~coef, nrow = 1, labeller = contLabeller) +
  scale_colour_manual(values = deCols) +
  scale_x_continuous(breaks = seq(-8, 8, by = 2)) +
  theme(legend.position = "none")

Volcano Plots showing DE genes against logFC.

Version	Author	Date
7251bd2	Steve Ped	2020-01-25
1858f49	Steve Ped	2020-01-22
71b8832	Steve Ped	2020-01-21
01512da	Steve Ped	2020-01-21

deGenes <- topTables %>%
  lapply(dplyr::filter, DE) %>%
  lapply(magrittr::extract2, "gene_id")

Genes Tracking with psen2^S4Ter

n <- 40

A set of 615 genes was identified as DE in the presence of the mutant psen2^S4Ter transcript. The most highly ranked 40 genes are shown in the following heatmap.

genoCols <- RColorBrewer::brewer.pal(3, "Set1") %>%
  setNames(levels(dgeList$samples$genotype))
dgeList %>%
  cpm(log = TRUE) %>% 
  extract(deGenes$mutant[seq_len(n)],) %>%
  as.data.frame() %>%
  set_rownames(unlist(geneLabeller(rownames(.)))) %>%
  pheatmap::pheatmap(
    color = viridis_pal(option = "magma")(100),
    labels_col = colnames(.) %>%
      str_replace_all(".+(Het|Hom|WT).+F3(_[0-9]{2}).+", "\\1\\2"),
    legend_breaks = c(seq(-2, 8, by = 2), max(.)),
    legend_labels = c(seq(-2, 8, by = 2), "logCPM\n"),
    annotation_col = dgeList$samples %>%
      dplyr::select(Genotype = genotype),
    annotation_names_col = FALSE,
    annotation_colors = list(Genotype = genoCols),
    cutree_cols = 2,
    cutree_rows = 6
  )

The 40 most highly-ranked genes by FDR which are commonly considered DE between mutants and WT samples

Version	Author	Date
7251bd2	Steve Ped	2020-01-25
1858f49	Steve Ped	2020-01-22

Genes showing differences between mutants

When inspecting the genes showing differences between mutants, it was noted that the WT and Homozygous mutant samples clustered together, implying that there was a unique effect on these genes which was specific to the heterozygous condition.

dgeList %>%
  cpm(log = TRUE) %>% 
  extract(deGenes$homozygous,) %>%
  as.data.frame() %>%
  set_rownames(unlist(geneLabeller(rownames(.)))) %>%
  pheatmap::pheatmap(
    color = viridis_pal(option = "magma")(100),
    labels_col = colnames(.) %>%
      str_replace_all(".+(Het|Hom|WT).+F3(_[0-9]{2}).+", "\\1\\2"),
    legend_breaks = c(seq(-2, 8, by = 2), max(.)),
    legend_labels = c(seq(-2, 8, by = 2), "logCPM\n"),
    annotation_col = dgeList$samples %>%
      dplyr::select(Genotype = genotype),
    annotation_names_col = FALSE,
    annotation_colors = list(Genotype = genoCols),
    cutree_cols = 2
  )

The 7 most highly-ranked genes by FDR which are DE between between mutants

Version	Author	Date
7251bd2	Steve Ped	2020-01-25

devtools::session_info()

─ Session info ───────────────────────────────────────────────────────────────
 setting  value                       
 version  R version 3.6.2 (2019-12-12)
 os       Ubuntu 18.04.3 LTS          
 system   x86_64, linux-gnu           
 ui       X11                         
 language en_AU:en                    
 collate  en_AU.UTF-8                 
 ctype    en_AU.UTF-8                 
 tz       Australia/Adelaide          
 date     2020-01-25                  

─ Packages ───────────────────────────────────────────────────────────────────
 package                * version  date       lib source        
 AnnotationDbi          * 1.48.0   2019-10-29 [2] Bioconductor  
 AnnotationFilter       * 1.10.0   2019-10-29 [2] Bioconductor  
 AnnotationHub          * 2.18.0   2019-10-29 [2] Bioconductor  
 askpass                  1.1      2019-01-13 [2] CRAN (R 3.6.0)
 assertthat               0.2.1    2019-03-21 [2] CRAN (R 3.6.0)
 backports                1.1.5    2019-10-02 [2] CRAN (R 3.6.1)
 Biobase                * 2.46.0   2019-10-29 [2] Bioconductor  
 BiocFileCache          * 1.10.2   2019-11-08 [2] Bioconductor  
 BiocGenerics           * 0.32.0   2019-10-29 [2] Bioconductor  
 BiocManager              1.30.10  2019-11-16 [2] CRAN (R 3.6.1)
 BiocParallel             1.20.1   2019-12-21 [2] Bioconductor  
 BiocVersion              3.10.1   2019-06-06 [2] Bioconductor  
 biomaRt                  2.42.0   2019-10-29 [2] Bioconductor  
 Biostrings               2.54.0   2019-10-29 [2] Bioconductor  
 bit                      1.1-15.1 2020-01-14 [2] CRAN (R 3.6.2)
 bit64                    0.9-7    2017-05-08 [2] CRAN (R 3.6.0)
 bitops                   1.0-6    2013-08-17 [2] CRAN (R 3.6.0)
 blob                     1.2.1    2020-01-20 [2] CRAN (R 3.6.2)
 broom                    0.5.3    2019-12-14 [2] CRAN (R 3.6.2)
 callr                    3.4.0    2019-12-09 [2] CRAN (R 3.6.2)
 cellranger               1.1.0    2016-07-27 [2] CRAN (R 3.6.0)
 cli                      2.0.1    2020-01-08 [2] CRAN (R 3.6.2)
 cluster                  2.1.0    2019-06-19 [2] CRAN (R 3.6.1)
 colorspace               1.4-1    2019-03-18 [2] CRAN (R 3.6.0)
 cowplot                * 1.0.0    2019-07-11 [2] CRAN (R 3.6.1)
 cqn                    * 1.32.0   2019-10-29 [2] Bioconductor  
 crayon                   1.3.4    2017-09-16 [2] CRAN (R 3.6.0)
 curl                     4.3      2019-12-02 [2] CRAN (R 3.6.2)
 data.table               1.12.8   2019-12-09 [2] CRAN (R 3.6.2)
 DBI                      1.1.0    2019-12-15 [2] CRAN (R 3.6.2)
 dbplyr                 * 1.4.2    2019-06-17 [2] CRAN (R 3.6.0)
 DelayedArray             0.12.2   2020-01-06 [2] Bioconductor  
 desc                     1.2.0    2018-05-01 [2] CRAN (R 3.6.0)
 devtools                 2.2.1    2019-09-24 [2] CRAN (R 3.6.1)
 digest                   0.6.23   2019-11-23 [2] CRAN (R 3.6.1)
 dplyr                  * 0.8.3    2019-07-04 [2] CRAN (R 3.6.1)
 edgeR                  * 3.28.0   2019-10-29 [2] Bioconductor  
 ellipsis                 0.3.0    2019-09-20 [2] CRAN (R 3.6.1)
 ensembldb              * 2.10.2   2019-11-20 [2] Bioconductor  
 evaluate                 0.14     2019-05-28 [2] CRAN (R 3.6.0)
 FactoMineR               2.1      2020-01-17 [2] CRAN (R 3.6.2)
 fansi                    0.4.1    2020-01-08 [2] CRAN (R 3.6.2)
 farver                   2.0.3    2020-01-16 [2] CRAN (R 3.6.2)
 fastmap                  1.0.1    2019-10-08 [2] CRAN (R 3.6.1)
 flashClust               1.01-2   2012-08-21 [2] CRAN (R 3.6.1)
 forcats                * 0.4.0    2019-02-17 [2] CRAN (R 3.6.0)
 fs                       1.3.1    2019-05-06 [2] CRAN (R 3.6.0)
 generics                 0.0.2    2018-11-29 [2] CRAN (R 3.6.0)
 GenomeInfoDb           * 1.22.0   2019-10-29 [2] Bioconductor  
 GenomeInfoDbData         1.2.2    2019-11-21 [2] Bioconductor  
 GenomicAlignments        1.22.1   2019-11-12 [2] Bioconductor  
 GenomicFeatures        * 1.38.1   2020-01-22 [2] Bioconductor  
 GenomicRanges          * 1.38.0   2019-10-29 [2] Bioconductor  
 ggdendro                 0.1-20   2016-04-27 [2] CRAN (R 3.6.0)
 ggplot2                * 3.2.1    2019-08-10 [2] CRAN (R 3.6.1)
 ggrepel                * 0.8.1    2019-05-07 [2] CRAN (R 3.6.0)
 git2r                    0.26.1   2019-06-29 [2] CRAN (R 3.6.1)
 glue                     1.3.1    2019-03-12 [2] CRAN (R 3.6.0)
 gtable                   0.3.0    2019-03-25 [2] CRAN (R 3.6.0)
 haven                    2.2.0    2019-11-08 [2] CRAN (R 3.6.1)
 highr                    0.8      2019-03-20 [2] CRAN (R 3.6.0)
 hms                      0.5.3    2020-01-08 [2] CRAN (R 3.6.2)
 htmltools                0.4.0    2019-10-04 [2] CRAN (R 3.6.1)
 htmlwidgets              1.5.1    2019-10-08 [2] CRAN (R 3.6.1)
 httpuv                   1.5.2    2019-09-11 [2] CRAN (R 3.6.1)
 httr                     1.4.1    2019-08-05 [2] CRAN (R 3.6.1)
 hwriter                  1.3.2    2014-09-10 [2] CRAN (R 3.6.0)
 interactiveDisplayBase   1.24.0   2019-10-29 [2] Bioconductor  
 IRanges                * 2.20.2   2020-01-13 [2] Bioconductor  
 jpeg                     0.1-8.1  2019-10-24 [2] CRAN (R 3.6.1)
 jsonlite                 1.6      2018-12-07 [2] CRAN (R 3.6.0)
 kableExtra               1.1.0    2019-03-16 [2] CRAN (R 3.6.1)
 knitr                    1.27     2020-01-16 [2] CRAN (R 3.6.2)
 labeling                 0.3      2014-08-23 [2] CRAN (R 3.6.0)
 later                    1.0.0    2019-10-04 [2] CRAN (R 3.6.1)
 lattice                  0.20-38  2018-11-04 [4] CRAN (R 3.6.0)
 latticeExtra             0.6-29   2019-12-19 [2] CRAN (R 3.6.2)
 lazyeval                 0.2.2    2019-03-15 [2] CRAN (R 3.6.0)
 leaps                    3.1      2020-01-16 [2] CRAN (R 3.6.2)
 lifecycle                0.1.0    2019-08-01 [2] CRAN (R 3.6.1)
 limma                  * 3.42.0   2019-10-29 [2] Bioconductor  
 locfit                   1.5-9.1  2013-04-20 [2] CRAN (R 3.6.0)
 lubridate                1.7.4    2018-04-11 [2] CRAN (R 3.6.0)
 magrittr               * 1.5      2014-11-22 [2] CRAN (R 3.6.0)
 MASS                     7.3-51.5 2019-12-20 [4] CRAN (R 3.6.2)
 Matrix                   1.2-18   2019-11-27 [2] CRAN (R 3.6.1)
 MatrixModels             0.4-1    2015-08-22 [2] CRAN (R 3.6.1)
 matrixStats              0.55.0   2019-09-07 [2] CRAN (R 3.6.1)
 mclust                 * 5.4.5    2019-07-08 [2] CRAN (R 3.6.1)
 memoise                  1.1.0    2017-04-21 [2] CRAN (R 3.6.0)
 mgcv                     1.8-31   2019-11-09 [4] CRAN (R 3.6.1)
 mime                     0.8      2019-12-19 [2] CRAN (R 3.6.2)
 modelr                   0.1.5    2019-08-08 [2] CRAN (R 3.6.1)
 munsell                  0.5.0    2018-06-12 [2] CRAN (R 3.6.0)
 ngsReports             * 1.1.2    2019-10-16 [1] Bioconductor  
 nlme                     3.1-143  2019-12-10 [4] CRAN (R 3.6.2)
 nor1mix                * 1.3-0    2019-06-13 [2] CRAN (R 3.6.1)
 openssl                  1.4.1    2019-07-18 [2] CRAN (R 3.6.1)
 pander                 * 0.6.3    2018-11-06 [2] CRAN (R 3.6.0)
 pheatmap               * 1.0.12   2019-01-04 [2] CRAN (R 3.6.0)
 pillar                   1.4.3    2019-12-20 [2] CRAN (R 3.6.2)
 pkgbuild                 1.0.6    2019-10-09 [2] CRAN (R 3.6.1)
 pkgconfig                2.0.3    2019-09-22 [2] CRAN (R 3.6.1)
 pkgload                  1.0.2    2018-10-29 [2] CRAN (R 3.6.0)
 plotly                   4.9.1    2019-11-07 [2] CRAN (R 3.6.1)
 plyr                     1.8.5    2019-12-10 [2] CRAN (R 3.6.2)
 png                      0.1-7    2013-12-03 [2] CRAN (R 3.6.0)
 preprocessCore         * 1.48.0   2019-10-29 [2] Bioconductor  
 prettyunits              1.1.1    2020-01-24 [2] CRAN (R 3.6.2)
 processx                 3.4.1    2019-07-18 [2] CRAN (R 3.6.1)
 progress                 1.2.2    2019-05-16 [2] CRAN (R 3.6.0)
 promises                 1.1.0    2019-10-04 [2] CRAN (R 3.6.1)
 ProtGenerics             1.18.0   2019-10-29 [2] Bioconductor  
 ps                       1.3.0    2018-12-21 [2] CRAN (R 3.6.0)
 purrr                  * 0.3.3    2019-10-18 [2] CRAN (R 3.6.1)
 quantreg               * 5.54     2019-12-13 [2] CRAN (R 3.6.2)
 R6                       2.4.1    2019-11-12 [2] CRAN (R 3.6.1)
 rappdirs                 0.3.1    2016-03-28 [2] CRAN (R 3.6.0)
 RColorBrewer           * 1.1-2    2014-12-07 [2] CRAN (R 3.6.0)
 Rcpp                     1.0.3    2019-11-08 [2] CRAN (R 3.6.1)
 RCurl                    1.98-1.1 2020-01-19 [2] CRAN (R 3.6.2)
 readr                  * 1.3.1    2018-12-21 [2] CRAN (R 3.6.0)
 readxl                   1.3.1    2019-03-13 [2] CRAN (R 3.6.0)
 remotes                  2.1.0    2019-06-24 [2] CRAN (R 3.6.0)
 reprex                   0.3.0    2019-05-16 [2] CRAN (R 3.6.0)
 reshape2                 1.4.3    2017-12-11 [2] CRAN (R 3.6.0)
 rlang                    0.4.2    2019-11-23 [2] CRAN (R 3.6.1)
 rmarkdown                2.1      2020-01-20 [2] CRAN (R 3.6.2)
 rprojroot                1.3-2    2018-01-03 [2] CRAN (R 3.6.0)
 Rsamtools                2.2.1    2019-11-06 [2] Bioconductor  
 RSQLite                  2.2.0    2020-01-07 [2] CRAN (R 3.6.2)
 rstudioapi               0.10     2019-03-19 [2] CRAN (R 3.6.0)
 rtracklayer              1.46.0   2019-10-29 [2] Bioconductor  
 rvest                    0.3.5    2019-11-08 [2] CRAN (R 3.6.1)
 S4Vectors              * 0.24.3   2020-01-18 [2] Bioconductor  
 scales                 * 1.1.0    2019-11-18 [2] CRAN (R 3.6.1)
 scatterplot3d            0.3-41   2018-03-14 [2] CRAN (R 3.6.1)
 sessioninfo              1.1.1    2018-11-05 [2] CRAN (R 3.6.0)
 shiny                    1.4.0    2019-10-10 [2] CRAN (R 3.6.1)
 ShortRead                1.44.1   2019-12-19 [2] Bioconductor  
 SparseM                * 1.78     2019-12-13 [2] CRAN (R 3.6.2)
 stringi                  1.4.5    2020-01-11 [2] CRAN (R 3.6.2)
 stringr                * 1.4.0    2019-02-10 [2] CRAN (R 3.6.0)
 SummarizedExperiment     1.16.1   2019-12-19 [2] Bioconductor  
 testthat                 2.3.1    2019-12-01 [2] CRAN (R 3.6.1)
 tibble                 * 2.1.3    2019-06-06 [2] CRAN (R 3.6.0)
 tidyr                  * 1.0.0    2019-09-11 [2] CRAN (R 3.6.1)
 tidyselect               0.2.5    2018-10-11 [2] CRAN (R 3.6.0)
 tidyverse              * 1.3.0    2019-11-21 [2] CRAN (R 3.6.1)
 truncnorm                1.0-8    2018-02-27 [2] CRAN (R 3.6.0)
 usethis                  1.5.1    2019-07-04 [2] CRAN (R 3.6.1)
 vctrs                    0.2.1    2019-12-17 [2] CRAN (R 3.6.2)
 viridisLite              0.3.0    2018-02-01 [2] CRAN (R 3.6.0)
 webshot                  0.5.2    2019-11-22 [2] CRAN (R 3.6.1)
 whisker                  0.4      2019-08-28 [2] CRAN (R 3.6.1)
 withr                    2.1.2    2018-03-15 [2] CRAN (R 3.6.0)
 workflowr              * 1.6.0    2019-12-19 [2] CRAN (R 3.6.2)
 xfun                     0.12     2020-01-13 [2] CRAN (R 3.6.2)
 XML                      3.99-0.3 2020-01-20 [2] CRAN (R 3.6.2)
 xml2                     1.2.2    2019-08-09 [2] CRAN (R 3.6.1)
 xtable                   1.8-4    2019-04-21 [2] CRAN (R 3.6.0)
 XVector                  0.26.0   2019-10-29 [2] Bioconductor  
 yaml                     2.2.0    2018-07-25 [2] CRAN (R 3.6.0)
 zeallot                  0.1.0    2018-01-28 [2] CRAN (R 3.6.0)
 zlibbioc                 1.32.0   2019-10-29 [2] Bioconductor  
 zoo                      1.8-7    2020-01-10 [2] CRAN (R 3.6.2)

[1] /home/steveped/R/x86_64-pc-linux-gnu-library/3.6
[2] /usr/local/lib/R/site-library
[3] /usr/lib/R/site-library
[4] /usr/lib/R/library

Differential Expression

Steve Pederson

25 January, 2020

Setup

Annotations

Count Data

Additional Functions

Analysis

PCA

Model Description

Normalisation

Model Fitting

Model Checking

GC and Length Bias

Checks for rRNA impacts

Other Artefacts

Results

Genes Tracking with psen2^S4Ter

Genes showing differences between mutants

Differential Expression

Steve Pederson

25 January, 2020

Setup

Annotations

Count Data

Additional Functions

Analysis

PCA

Model Description

Normalisation

Model Fitting

Model Checking

GC and Length Bias

Checks for rRNA impacts

Other Artefacts

Results

Genes Tracking with psen2S4Ter

Genes showing differences between mutants

Genes Tracking with psen2^S4Ter