Last updated: 2022-07-07

Checks: 7 0

Knit directory: 20180328_Atkins_RatFracture/

This reproducible R Markdown analysis was created with workflowr (version 1.7.0). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.

R Markdown file: up-to-date

Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.

Environment: empty

Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.

Seed: set.seed(20220705)

The command set.seed(20220705) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.

Session information: recorded

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.

Cache: none

Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.

File paths: relative

Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.

Repository version: defc17e

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.

The results in this page were generated with repository version defc17e. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Ignored files:
    Ignored:    .Rhistory
    Ignored:    .Rproj.user/

Untracked files:
    Untracked:  data/gmt/

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.

These are the previous versions of the repository in which changes were made to the R Markdown (analysis/dge.Rmd) and HTML (docs/dge.html) files. If you’ve configured a remote Git repository (see ?wflow_git_remote), click on the hyperlinks in the table below to view the files as they were in that past version.

File	Version	Author	Date	Message
Rmd	defc17e	Steve Pederson	2022-07-07	Finished primary analysis
Rmd	c4a6c6c	Steve Pederson	2022-07-06	Reanalysed using voom
Rmd	dd28879	Steve Pederson	2022-07-06	Setup initial DGE after restructure

library(tidyverse)
library(scales)
library(pander)
library(glue)
library(edgeR)
library(AnnotationHub)
library(ensembldb)
library(GenomicRanges)
library(magrittr)
library(cowplot)
library(matrixStats)
library(broom)
library(ggrepel)
library(statmod)
library(msigdbr)
library(fgsea)
library(reactable)
library(htmltools)

panderOptions("table.split.table", Inf)
panderOptions("big.mark", ",")
theme_set(theme_bw())
suffix <- "_L001"
pattern <- paste0("_CB2YGANXX_.+fastq.gz")
sp <- "Rnorvegicus"
with_tooltip <- function(value, width = 30) {
  tags$span(title = value, str_trunc(value, width))
}

samples <- "data/targets.csv" %>%
  here::here() %>%
  read_csv() %>% 
    mutate(
      Filename = paste0(File, suffix)
    )
dge <- read_rds(here::here("output/dge.rds"))

group_cols <- hcl.colors(
  n = length(unique(samples$group)), 
  palette = "Zissou 1"
  ) %>%
  setNames(unique(samples$group))

Setup

ah <- AnnotationHub() %>%
  subset(rdataclass == "EnsDb") %>%
  subset(species == "Rattus norvegicus") %>% 
  subset(str_detect(description, "96"))
ensDb <- ah[[1]]
genesGR <- read_rds(here::here("output/genesGR.rds"))
transGR <- transcripts(ensDb) %>% 
  subset(gene_id %in% names(genesGR))

Gene annotations were again loaded from Ensembl Release 96.

Count Data

Prior to filtering for undetectable genes, counts were loaded as a DGEList, incorporating both sample and gene metadata.

Data Processing and Analysis

min_cpm <- 2
genes2Keep <- cpm(dge) %>% 
  is_greater_than(min_cpm) %>% 
  rowSums() %>% 
  is_weakly_greater_than(3)

X <- model.matrix(~1, data = dge$samples)
voomData <- voomWithQualityWeights(dge[genes2Keep,,keep.lib.sizes = FALSE], design = X) 
X <- model.matrix(~group, data = voomData$targets)
results <- voomData %>%
    lmFit(design = X) %>%
  eBayes(robust = TRUE) %>% 
  topTable(n = Inf, coef = "groupDiabetic") %>% 
    dplyr::select(
      gene_id, gene_name,
      logFC, AveExpr, 
      t, P.Value, FDR = adj.P.Val
    ) %>% 
    arrange(P.Value) %>%
    as_tibble() %>% 
  mutate(DE = FDR < 0.05)

Taking the initial set of 23,706 genes, low expression genes were removed, retaining only the 13,160 genes where 2 or more reads per million (i.e. CPM) were detected in 3 or more samples.

Counts were then normalised using voom precision weights, allowing for individual sample-weights. For a conservative approach, sample-weights were estimated by considering each sample to be drawn from the same treatment group. Tests for differential expression were then performed voom precision weights, with genes being considered as Differentially Expressed (DE) if receiving an FDR-adjusted p-value < 0.05. The standard eBayes() methodology was used, setting robust = TRUE to protect against highly variable genes.

Density Plots

list(
  All = cpm(dge, log = TRUE) %>% 
    as.data.frame() %>% 
    mutate(which = "All Genes"),
  Detected = voomData$E %>% 
    as.data.frame() %>% 
    mutate(which = "Detected (Retained) Genes")
) %>% 
  bind_rows() %>% 
  pivot_longer(
    cols = all_of(samples$Rat), names_to = "Rat", values_to = "logCPM"
  )%>% 
  left_join(samples, by = "Rat") %>% 
  ggplot(
    aes(
      logCPM, colour = group, group = Rat
    )        
  ) +
  geom_density() +
  facet_wrap(~which) +
  geom_text(
    aes(x, y, label = lab),
    data = . %>% 
      group_by(which) %>% 
      summarise(
        x = 0.90*max(logCPM),
        y = 0.95*max(density(logCPM)$y),
        lab = glue("n = {comma(n() / ncol(dge))}"),
        .groups = "drop"
      ) %>% 
      mutate(y = max(y)),
    inherit.aes = FALSE
  ) +
  geom_vline(xintercept = log2(min_cpm), linetype = 2) +
  scale_colour_manual(values = group_cols) +
  scale_y_continuous(expand = expansion(c(0, 0.05))) +
  labs(
    y = "Density", colour = "Group"
  )

Distributions of logCPM values for all non-zero genes, and those retained as detected using the above CPM threshold. The number of genes in each category is also shown.

RLE

voomData$E %>%
  as.data.frame() %>% 
  pivot_longer(
    cols = everything(), names_to = "Rat", values_to = "logCPM"
  ) %>% 
  left_join(samples) %>% 
  group_by(Rat) %>% 
  mutate(RLE = logCPM - median(logCPM))%>% 
  ggplot(aes(Rat, RLE, fill = group)) +
  geom_boxplot(alpha = 0.9) +
  geom_hline(yintercept = 0, linetype = 2) +
  facet_wrap(~group, scales = "free_x") +
  scale_fill_manual(values = group_cols) +
  labs(
    x = "Sample", fill = "Group"
  )

Relative Log Expression. Any deviations from zero indicate potential batch effects, with none being evident.

PCA

pcaPost <- voomData$E %>%
  .[rowVars(.) > 0,] %>% 
  t() %>%
  prcomp() 
pcaPost %>%
  tidy() %>% 
  dplyr::rename(Rat = row) %>% 
  left_join(voomData$targets, by = "Rat") %>% 
  dplyr::filter(PC %in% 1:2) %>% 
  pivot_wider(names_from = "PC", names_prefix = "PC", values_from = "value") %>% 
  ggplot(
    aes(PC1, PC2, colour = group, size = lib.size/1e6)
  ) +
  geom_point() +
  geom_text_repel(aes(label = Rat), show.legend = FALSE) +
  scale_colour_manual(values = group_cols) +
  scale_size_continuous(limits = c(5, 15), breaks = seq(5, 15, by = 5)) +
  labs(
    x = glue("PC1 ({percent(pcaPost$sdev[[1]]^2 / sum(pcaPost$sdev^2), 0.1)})"),
    y = glue("PC2 ({percent(pcaPost$sdev[[2]]^2 / sum(pcaPost$sdev^2), 0.1)})"),
    colour = "Group",
    size = "Library Size\n(millions)"
  )

PCA on logCPM values. The clear impact of the smaller libraries is seen with library size being the lragest source of variability. This supports the previous caution that the libraries below 10m reads are not as representative of the true RNA content of the source material. The small final library size after assigning reads to genes also highlights the poor performance of total RNA in this context, given that all raw sequencing libraries were >35 million reads.

Sample Weights

voomData$targets %>% 
  ggplot(aes(Rat, sample.weights, fill = group)) +
  geom_col() +
  facet_wrap(~group, scales = "free_x") +
  scale_fill_manual(values = group_cols)

Sample weights after fitting the voom model.

Results

Or the genes retained as confidently detected, 7 were formally considered as DE, using an FDR of 0.05. These were Pnpla2, LOC100909761, Rcor2, Cxcl1, Mmp3, Mmp10 and Il6.

MA Plot

results %>%
  ggplot(aes(AveExpr, logFC)) +
  geom_point(aes(colour = DE),alpha = 0.5) +
  geom_text_repel(
    aes(label = gene_name, colour = DE),
    data = . %>%
      arrange(desc(abs(logFC))) %>% 
      dplyr::filter(FDR < 0.1),
    show.legend = FALSE
  ) +
  geom_smooth(se = FALSE) +
  scale_colour_manual(values = c("grey50", "red")) +
  theme(legend.position = "none")

MA-plot showing all genes with an FDR < 0.1. Those passing the FDR threshold of 0.05 are highlighted in red.

Volcano Plot

results %>%
  ggplot(aes(logFC, -log10(P.Value))) +
  geom_point(aes(colour = DE),alpha = 0.5) +
  geom_text_repel(
    aes(label = gene_name, colour = DE),
    data = . %>%
      arrange(desc(abs(logFC))) %>% 
      dplyr::filter(FDR < 0.1),
    show.legend = FALSE
  ) +
  scale_colour_manual(values = c("grey50", "red")) +
  theme(legend.position = "none") +
  labs(y = expression(paste(-log[10], "p")))

Volcano plot with genes labelled to an FDR of 0.1. DE genes to an FDR threshold of 0.05 are highlighted in red.

Highly Ranked Genes

results %>% 
  dplyr::slice(1:12) %>% 
  mutate(
    gene_name = case_when(
      DE ~ paste0(gene_name, "*"),
      TRUE ~ gene_name
    )
  ) %>% 
  dplyr::select(gene_id, gene_name) %>% 
  bind_cols(
    voomData[.$gene_id,]$E
  ) %>% 
  pivot_longer(
    cols = all_of(samples$Rat),
    names_to = "Rat", values_to = "logCPM"
  ) %>% 
  left_join(
    dplyr::select(voomData$targets, Rat, group, sample.weights)
  ) %>% 
  mutate(gene_name = fct_inorder(gene_name)) %>% 
  ggplot(
    aes(group, logCPM, fill = group)
  ) +
  geom_boxplot() +
  geom_hline(yintercept = log2(min_cpm), linetype = 2, colour = "grey30") +
  facet_wrap(~gene_name) +
  labs(x = "Group", fill = "Group")

The most highly ranked genes for differential expression. Those formally passing the criteria for differential expression (FDR < 0.05) are marked with an asterisk.

Enrichment Analysis

rankedGenes <- results %>% 
  arrange(t) %>% 
  with(
    structure(t, names = gene_id)
  )
msigdb <- msigdbr(species = "Rattus norvegicus") %>% 
  dplyr::filter(
    gs_cat %in% c("H", "C5") | 
      gs_subcat %in% c("CP:KEGG", "CP:REACTOME", "CP:WIKIPATHWAYS", "IMMUNESIGDB")
  ) %>% 
  dplyr::filter(ensembl_gene %in% names(rankedGenes)) 
gsByPathway <- msigdb %>% 
  split(.$gs_name) %>% 
  lapply(pull, "ensembl_gene") %>% 
  .[vapply(., length, integer(1)) > 5]
id2Name <- structure(
  genesGR$gene_name,
  names = genesGR$gene_id
) %>% 
  .[!duplicated(names(.))]

Given the low number of differentially expressed genes, GSEA was used as implemented in the R package fgsea, with the ranked list of genes generated using the t-statistics as obtained above.

Enrichment analysis was performed on gene-sets obtained from MSigDB version 7.5. Gene sets were selected from HALLMARK, KEGG, REACTOME, WIkiPathways, ImmuneSigDB and the Gene Ontology Database. Only the 19,499 gene-sets with more than 5 genes detected in the dataset were retained. P-values obtained from GSEA were adjusted using Bonferroni’s method to ensure strong control of the Family-Wise Error Rate (FWER).

gseaResults <- fgsea(gsByPathway, rankedGenes) %>% 
  arrange(pval) %>% 
  mutate(padj = p.adjust(pval, "bonf")) %>% 
  dplyr::select(gs_name = pathway, pval, padj, NES, size, leadingEdge) %>% 
  left_join(
    distinct(msigdb, gs_name, gs_cat, gs_subcat)
  )

HALLMARK Gene Sets

Results

df <- gseaResults %>% 
  dplyr::filter(gs_cat == "H", padj < 0.05) %>% 
  mutate(
    leSize = vapply(leadingEdge, length, integer(1)),
   leadingEdge = vapply(
     leadingEdge,
     function(x) paste(id2Name[x], collapse = "; "),
     character(1)
   ) 
  ) %>% 
  dplyr::select(gs_name, pval, padj, NES, ends_with("size"), leadingEdge) 
htmltools::tags$caption(
  htmltools::em(
    glue(
      "
      All {nrow(df)} HALLMARK gene sets considered to be enriched in the ranked 
      list of genes. 
      All p-values are Bonferroni-adjusted. A Normalised Enrichment Score 
      (NES) > 0 indicates that the geneset was enriched amongst 
      up-regulated genes, whilst a negative NES indicates enrichment in the 
      down-regulated genes. Genes in the Leading Edge represent those which
      appear in the ranked list up until the point of the most extreme NES.
      "
    )
  )
)

All 15 HALLMARK gene sets considered to be enriched in the ranked list of genes. All p-values are Bonferroni-adjusted. A Normalised Enrichment Score (NES) > 0 indicates that the geneset was enriched amongst up-regulated genes, whilst a negative NES indicates enrichment in the down-regulated genes. Genes in the Leading Edge represent those which appear in the ranked list up until the point of the most extreme NES.

df %>% 
  reactable(
    searchable = TRUE, filterable = TRUE,
    columns = list(
      gs_name = colDef(
        name = "Gene Set",
        cell = function(value) str_replace_all(value, "_", " ")
      ),
      pval = colDef(show = FALSE),
      padj = colDef(
        name = "p<sub>adj</sub>", html = TRUE,
        cell = function(value) {
          fmt <- "%.2e"
          if (value > 0.001) fmt <- "%.3f"
          sprintf(fmt, value)
        },
        maxWidth = 100
      ),
      NES = colDef(
        format = colFormat(digits = 2),
        maxWidth = 80
      ),
      size = colDef(name = "Gene Set Size", maxWidth = 100),
      leSize = colDef(name = "Leading Edge Size", maxWidth = 100),
      leadingEdge = colDef(
        name = "Leading Edge",
        cell = function(value) with_tooltip(value, width = 50)
      )
    )
  )

Plots

p <- df %>% 
  dplyr::slice(1:9) %>% 
  pull("gs_name") %>% 
  lapply(
    function(x) {
      plotEnrichment(gsByPathway[[x]], rankedGenes) +
        ggtitle(
          str_replace_all(x, "_", " ") %>% 
            str_trunc(40)
        ) +
        ylim(0.62 * c(-1, 1)) +
        theme(plot.title = element_text(hjust = 0.5, size = 10))
    }
  )
plot_grid(plotlist = p)

Enrichment plots for the most highly ranked HALLMARK gene-sets, showing the approximate position within the ranked list where the maximal enrichment score is found. The most compelling results will always be associated with maximal enrichment scores near either extreme.

Reactome Gene Sets

Results

df <- gseaResults %>% 
  dplyr::filter(gs_subcat == "CP:REACTOME", padj < 0.05) %>% 
  mutate(
    leSize = vapply(leadingEdge, length, integer(1)),
   leadingEdge = vapply(
     leadingEdge,
     function(x) paste(id2Name[x], collapse = "; "),
     character(1)
   ) 
  ) %>% 
  dplyr::select(gs_name, pval, padj, NES, ends_with("size"), leadingEdge) 
htmltools::tags$caption(
  htmltools::em(
    glue(
      "
      All {nrow(df)} Reactome gene sets considered to be enriched in the ranked 
      list of genes. 
      All p-values are Bonferroni-adjusted. A Normalised Enrichment Score 
      (NES) > 0 indicates that the geneset was enriched amongst 
      up-regulated genes, whilst a negative NES indicates enrichment in the 
      down-regulated genes. Genes in the Leading Edge represent those which
      appear in the ranked list up until the point of the most extreme NES.
      "
    )
  )
)

All 46 Reactome gene sets considered to be enriched in the ranked list of genes. All p-values are Bonferroni-adjusted. A Normalised Enrichment Score (NES) > 0 indicates that the geneset was enriched amongst up-regulated genes, whilst a negative NES indicates enrichment in the down-regulated genes. Genes in the Leading Edge represent those which appear in the ranked list up until the point of the most extreme NES.

df %>% 
  reactable(
    searchable = TRUE, filterable = TRUE,
    columns = list(
      gs_name = colDef(
        name = "Gene Set",
        cell = function(value) str_replace_all(value, "_", " ")
      ),
      pval = colDef(show = FALSE),
      padj = colDef(
        name = "p<sub>adj</sub>", html = TRUE,
        cell = function(value) {
          fmt <- "%.2e"
          if (value > 0.001) fmt <- "%.3f"
          sprintf(fmt, value)
        },
        maxWidth = 100
      ),
      NES = colDef(
        format = colFormat(digits = 2),
        maxWidth = 80
      ),
      size = colDef(name = "Gene Set Size", maxWidth = 100),
      leSize = colDef(name = "Leading Edge Size", maxWidth = 100),
      leadingEdge = colDef(
        name = "Leading Edge",
        cell = function(value) with_tooltip(value, width = 50)
      )
    )
  )

Plots

p <- df %>% 
  dplyr::slice(1:9) %>% 
  pull("gs_name") %>% 
  lapply(
    function(x) {
      plotEnrichment(gsByPathway[[x]], rankedGenes) +
        ggtitle(
          str_replace_all(x, "_", " ") %>% 
            str_trunc(40)
        ) +
        ylim(0.6 * c(-1, 1)) +
        theme(plot.title = element_text(hjust = 0.5, size = 10))
    }
  )
plot_grid(plotlist = p)

Enrichment plots for the most highly ranked Reactome gene-sets, showing the approximate position within the ranked list where the maximal enrichment score is found. The most compelling results will always be associated with maximal enrichment scores near either extreme.

KEGG Gene Sets

Results

df <- gseaResults %>% 
  dplyr::filter(gs_subcat == "CP:KEGG", padj < 0.05) %>% 
  mutate(
    leSize = vapply(leadingEdge, length, integer(1)),
   leadingEdge = vapply(
     leadingEdge,
     function(x) paste(id2Name[x], collapse = "; "),
     character(1)
   ) 
  ) %>% 
  dplyr::select(gs_name, pval, padj, NES, ends_with("size"), leadingEdge) 
htmltools::tags$caption(
  htmltools::em(
    glue(
      "
      All {nrow(df)} KEGG gene sets considered to be enriched in the ranked 
      list of genes. 
      All p-values are Bonferroni-adjusted. A Normalised Enrichment Score 
      (NES) > 0 indicates that the geneset was enriched amongst 
      up-regulated genes, whilst a negative NES indicates enrichment in the 
      down-regulated genes. Genes in the Leading Edge represent those which
      appear in the ranked list up until the point of the most extreme NES.
      "
    )
  )
)

All 3 KEGG gene sets considered to be enriched in the ranked list of genes. All p-values are Bonferroni-adjusted. A Normalised Enrichment Score (NES) > 0 indicates that the geneset was enriched amongst up-regulated genes, whilst a negative NES indicates enrichment in the down-regulated genes. Genes in the Leading Edge represent those which appear in the ranked list up until the point of the most extreme NES.

df %>% 
  reactable(
    searchable = TRUE, filterable = TRUE,
    columns = list(
      gs_name = colDef(
        name = "Gene Set",
        cell = function(value) str_replace_all(value, "_", " ")
      ),
      pval = colDef(show = FALSE),
      padj = colDef(
        name = "p<sub>adj</sub>", html = TRUE,
        cell = function(value) {
          fmt <- "%.2e"
          if (value > 0.001) fmt <- "%.3f"
          sprintf(fmt, value)
        },
        maxWidth = 100
      ),
      NES = colDef(
        format = colFormat(digits = 2),
        maxWidth = 80
      ),
      size = colDef(name = "Gene Set Size", maxWidth = 100),
      leSize = colDef(name = "Leading Edge Size", maxWidth = 100),
      leadingEdge = colDef(
        name = "Leading Edge",
        cell = function(value) with_tooltip(value, width = 50)
      )
    )
  )

Plots

p <- df %>% 
  dplyr::slice(1:9) %>% 
  pull("gs_name") %>% 
  lapply(
    function(x) {
      plotEnrichment(gsByPathway[[x]], rankedGenes) +
        ggtitle(
          str_replace_all(x, "_", " ") %>% 
            str_trunc(40)
        ) +
        ylim(0.62 * c(-1, 1)) +
        theme(plot.title = element_text(hjust = 0.5, size = 10))
    }
  )
plot_grid(plotlist = p)

Enrichment plots for the most highly ranked KEGG gene-sets, showing the approximate position within the ranked list where the maximal enrichment score is found. The most compelling results will always be associated with maximal enrichment scores near either extreme.

WIKIPATHWAYS Gene Sets

Results

df <- gseaResults %>% 
  dplyr::filter(gs_subcat == "CP:WIKIPATHWAYS", padj < 0.05) %>% 
  mutate(
    leSize = vapply(leadingEdge, length, integer(1)),
   leadingEdge = vapply(
     leadingEdge,
     function(x) paste(id2Name[x], collapse = "; "),
     character(1)
   ) 
  ) %>% 
  dplyr::select(gs_name, pval, padj, NES, ends_with("size"), leadingEdge) 
htmltools::tags$caption(
  htmltools::em(
    glue(
      "
      All {nrow(df)} WIKIPATHWAYS gene sets considered to be enriched in the 
      ranked list of genes. 
      All p-values are Bonferroni-adjusted. A Normalised Enrichment Score 
      (NES) > 0 indicates that the geneset was enriched amongst 
      up-regulated genes, whilst a negative NES indicates enrichment in the 
      down-regulated genes. Genes in the Leading Edge represent those which
      appear in the ranked list up until the point of the most extreme NES.
      "
    )
  )
)

All 14 WIKIPATHWAYS gene sets considered to be enriched in the ranked list of genes. All p-values are Bonferroni-adjusted. A Normalised Enrichment Score (NES) > 0 indicates that the geneset was enriched amongst up-regulated genes, whilst a negative NES indicates enrichment in the down-regulated genes. Genes in the Leading Edge represent those which appear in the ranked list up until the point of the most extreme NES.

df %>% 
  reactable(
    searchable = TRUE, filterable = TRUE,
    columns = list(
      gs_name = colDef(
        name = "Gene Set",
        cell = function(value) str_replace_all(value, "_", " ")
      ),
      pval = colDef(show = FALSE),
      padj = colDef(
        name = "p<sub>adj</sub>", html = TRUE,
        cell = function(value) {
          fmt <- "%.2e"
          if (value > 0.001) fmt <- "%.3f"
          sprintf(fmt, value)
        },
        maxWidth = 100
      ),
      NES = colDef(
        format = colFormat(digits = 2),
        maxWidth = 80
      ),
      size = colDef(name = "Gene Set Size", maxWidth = 100),
      leSize = colDef(name = "Leading Edge Size", maxWidth = 100),
      leadingEdge = colDef(
        name = "Leading Edge",
        cell = function(value) with_tooltip(value, width = 50)
      )
    )
  )

Plots

p <- df %>% 
  dplyr::slice(1:9) %>% 
  pull("gs_name") %>% 
  lapply(
    function(x) {
      plotEnrichment(gsByPathway[[x]], rankedGenes) +
        ggtitle(
          str_replace_all(x, "_", " ") %>% 
            str_trunc(40)
        ) +
        ylim(0.9 * c(-1, 1)) +
        theme(plot.title = element_text(hjust = 0.5, size = 10))
    }
  )
plot_grid(plotlist = p)

Enrichment plots for the most highly ranked WIKIPATHWAYS gene-sets, showing the approximate position within the ranked list where the maximal enrichment score is found. The most compelling results will always be associated with maximal enrichment scores near either extreme.

ImmunSigDB Gene Sets

Results

df <- gseaResults %>% 
  dplyr::filter(gs_subcat == "IMMUNESIGDB", padj < 0.05) %>% 
  mutate(
    leSize = vapply(leadingEdge, length, integer(1)),
   leadingEdge = vapply(
     leadingEdge,
     function(x) paste(id2Name[x], collapse = "; "),
     character(1)
   ) 
  ) %>% 
  dplyr::select(gs_name, pval, padj, NES, ends_with("size"), leadingEdge) 
htmltools::tags$caption(
  htmltools::em(
    glue(
      "
      All {nrow(df)} ImmunSigDB gene sets considered to be enriched in the 
      ranked list of genes. 
      All p-values are Bonferroni-adjusted. A Normalised Enrichment Score 
      (NES) > 0 indicates that the geneset was enriched amongst 
      up-regulated genes, whilst a negative NES indicates enrichment in the 
      down-regulated genes. Genes in the Leading Edge represent those which
      appear in the ranked list up until the point of the most extreme NES.
      "
    )
  )
)

All 222 ImmunSigDB gene sets considered to be enriched in the ranked list of genes. All p-values are Bonferroni-adjusted. A Normalised Enrichment Score (NES) > 0 indicates that the geneset was enriched amongst up-regulated genes, whilst a negative NES indicates enrichment in the down-regulated genes. Genes in the Leading Edge represent those which appear in the ranked list up until the point of the most extreme NES.

df %>% 
  reactable(
    searchable = TRUE, filterable = TRUE,
    columns = list(
      gs_name = colDef(
        name = "Gene Set",
        cell = function(value) str_replace_all(value, "_", " ")
      ),
      pval = colDef(show = FALSE),
      padj = colDef(
        name = "p<sub>adj</sub>", html = TRUE,
        cell = function(value) {
          fmt <- "%.2e"
          if (value > 0.001) fmt <- "%.3f"
          sprintf(fmt, value)
        },
        maxWidth = 100
      ),
      NES = colDef(
        format = colFormat(digits = 2),
        maxWidth = 80
      ),
      size = colDef(name = "Gene Set Size", maxWidth = 100),
      leSize = colDef(name = "Leading Edge Size", maxWidth = 100),
      leadingEdge = colDef(
        name = "Leading Edge",
        cell = function(value) with_tooltip(value, width = 50)
      )
    )
  )

Plots

p <- df %>% 
  dplyr::slice(1:9) %>% 
  pull("gs_name") %>% 
  lapply(
    function(x) {
      plotEnrichment(gsByPathway[[x]], rankedGenes) +
        ggtitle(
          str_replace_all(x, "_", " ") %>% 
            str_trunc(40)
        ) +
        ylim(0.6 * c(-1, 1)) +
        theme(plot.title = element_text(hjust = 0.5, size = 10))
    }
  )
plot_grid(plotlist = p)

Enrichment plots for the most highly ranked ImmuneSigDB gene-sets, showing the approximate position within the ranked list where the maximal enrichment score is found. The most compelling results will always be associated with maximal enrichment scores near either extreme.

GO: Biological Process Gene Sets

Results

df <- gseaResults %>% 
  dplyr::filter(gs_subcat == "GO:BP", padj < 0.05) %>% 
  mutate(
    leSize = vapply(leadingEdge, length, integer(1)),
   leadingEdge = vapply(
     leadingEdge,
     function(x) paste(id2Name[x], collapse = "; "),
     character(1)
   ) 
  ) %>% 
  dplyr::select(gs_name, pval, padj, NES, ends_with("size"), leadingEdge) 
htmltools::tags$caption(
  htmltools::em(
    glue(
      "
      All {nrow(df)} GO: Biological Process gene sets considered to be enriched 
      in the ranked list of genes. 
      All p-values are Bonferroni-adjusted. A Normalised Enrichment Score 
      (NES) > 0 indicates that the geneset was enriched amongst 
      up-regulated genes, whilst a negative NES indicates enrichment in the 
      down-regulated genes. Genes in the Leading Edge represent those which
      appear in the ranked list up until the point of the most extreme NES.
      "
    )
  )
)

All 135 GO: Biological Process gene sets considered to be enriched in the ranked list of genes. All p-values are Bonferroni-adjusted. A Normalised Enrichment Score (NES) > 0 indicates that the geneset was enriched amongst up-regulated genes, whilst a negative NES indicates enrichment in the down-regulated genes. Genes in the Leading Edge represent those which appear in the ranked list up until the point of the most extreme NES.

df %>% 
  reactable(
    searchable = TRUE, filterable = TRUE,
    columns = list(
      gs_name = colDef(
        name = "Gene Set",
        cell = function(value) str_replace_all(value, "_", " ") %>% 
          str_remove("^GO(BP|CC|MF) ")
      ),
      pval = colDef(show = FALSE),
      padj = colDef(
        name = "p<sub>adj</sub>", html = TRUE,
        cell = function(value) {
          fmt <- "%.2e"
          if (value > 0.001) fmt <- "%.3f"
          sprintf(fmt, value)
        },
        maxWidth = 100
      ),
      NES = colDef(
        format = colFormat(digits = 2),
        maxWidth = 80
      ),
      size = colDef(name = "Gene Set Size", maxWidth = 100),
      leSize = colDef(name = "Leading Edge Size", maxWidth = 100),
      leadingEdge = colDef(
        name = "Leading Edge",
        cell = function(value) with_tooltip(value, width = 50)
      )
    )
  )

Plots

p <- df %>% 
  dplyr::slice(1:9) %>% 
  pull("gs_name") %>% 
  lapply(
    function(x) {
      plotEnrichment(gsByPathway[[x]], rankedGenes) +
        ggtitle(
          str_replace_all(x, "_", " ") %>% 
            str_remove("^GO(BP|CC|MF) ") %>% 
            str_trunc(40)
        ) +
        ylim(0.55 * c(-1, 1)) +
        theme(plot.title = element_text(hjust = 0.5, size = 10))
    }
  )
plot_grid(plotlist = p)

Enrichment plots for the most highly ranked GO: Biological Process gene-sets, showing the approximate position within the ranked list where the maximal enrichment score is found. The most compelling results will always be associated with maximal enrichment scores near either extreme.

GO: Cellular Component Gene Sets

Results

df <- gseaResults %>% 
  dplyr::filter(gs_subcat == "GO:MF", padj < 0.05) %>% 
  mutate(
    leSize = vapply(leadingEdge, length, integer(1)),
   leadingEdge = vapply(
     leadingEdge,
     function(x) paste(id2Name[x], collapse = "; "),
     character(1)
   ) 
  ) %>% 
  dplyr::select(gs_name, pval, padj, NES, ends_with("size"), leadingEdge) 
htmltools::tags$caption(
  htmltools::em(
    glue(
      "
      All {nrow(df)} GO: Molecular Function gene sets considered to be enriched 
      in the ranked list of genes. 
      All p-values are Bonferroni-adjusted. A Normalised Enrichment Score 
      (NES) > 0 indicates that the geneset was enriched amongst 
      up-regulated genes, whilst a negative NES indicates enrichment in the 
      down-regulated genes. Genes in the Leading Edge represent those which
      appear in the ranked list up until the point of the most extreme NES.
      "
    )
  )
)

All 31 GO: Molecular Function gene sets considered to be enriched in the ranked list of genes. All p-values are Bonferroni-adjusted. A Normalised Enrichment Score (NES) > 0 indicates that the geneset was enriched amongst up-regulated genes, whilst a negative NES indicates enrichment in the down-regulated genes. Genes in the Leading Edge represent those which appear in the ranked list up until the point of the most extreme NES.

df %>% 
  reactable(
    searchable = TRUE, filterable = TRUE,
    columns = list(
      gs_name = colDef(
        name = "Gene Set",
        cell = function(value) str_replace_all(value, "_", " ") %>% 
          str_remove("^GO(BP|CC|MF) ")
      ),
      pval = colDef(show = FALSE),
      padj = colDef(
        name = "p<sub>adj</sub>", html = TRUE,
        cell = function(value) {
          fmt <- "%.2e"
          if (value > 0.001) fmt <- "%.3f"
          sprintf(fmt, value)
        },
        maxWidth = 100
      ),
      NES = colDef(
        format = colFormat(digits = 2),
        maxWidth = 80
      ),
      size = colDef(name = "Gene Set Size", maxWidth = 100),
      leSize = colDef(name = "Leading Edge Size", maxWidth = 100),
      leadingEdge = colDef(
        name = "Leading Edge",
        cell = function(value) with_tooltip(value, width = 50)
      )
    )
  )

Plots

p <- df %>% 
  dplyr::slice(1:9) %>% 
  pull("gs_name") %>% 
  lapply(
    function(x) {
      plotEnrichment(gsByPathway[[x]], rankedGenes) +
        ggtitle(
          str_replace_all(x, "_", " ") %>% 
            str_remove("^GO(BP|CC|MF) ") %>% 
            str_trunc(40)
        ) +
        ylim(0.7 * c(-1, 1)) +
        theme(plot.title = element_text(hjust = 0.5, size = 10))
    }
  )
plot_grid(plotlist = p)

Enrichment plots for the most highly ranked GO: Molecular Function gene-sets, showing the approximate position within the ranked list where the maximal enrichment score is found. The most compelling results will always be associated with maximal enrichment scores near either extreme.

GO: Molecular Function Gene Sets

Results

df <- gseaResults %>% 
  dplyr::filter(gs_subcat == "GO:CC", padj < 0.05) %>% 
  mutate(
    leSize = vapply(leadingEdge, length, integer(1)),
   leadingEdge = vapply(
     leadingEdge,
     function(x) paste(id2Name[x], collapse = "; "),
     character(1)
   ) 
  ) %>% 
  dplyr::select(gs_name, pval, padj, NES, ends_with("size"), leadingEdge) 
htmltools::tags$caption(
  htmltools::em(
    glue(
      "
      All {nrow(df)} GO: Cellular Component gene sets considered to be enriched 
      in the ranked list of genes. 
      All p-values are Bonferroni-adjusted. A Normalised Enrichment Score 
      (NES) > 0 indicates that the geneset was enriched amongst 
      up-regulated genes, whilst a negative NES indicates enrichment in the 
      down-regulated genes. Genes in the Leading Edge represent those which
      appear in the ranked list up until the point of the most extreme NES.
      "
    )
  )
)

All 33 GO: Cellular Component gene sets considered to be enriched in the ranked list of genes. All p-values are Bonferroni-adjusted. A Normalised Enrichment Score (NES) > 0 indicates that the geneset was enriched amongst up-regulated genes, whilst a negative NES indicates enrichment in the down-regulated genes. Genes in the Leading Edge represent those which appear in the ranked list up until the point of the most extreme NES.

df %>% 
  reactable(
    searchable = TRUE, filterable = TRUE,
    columns = list(
      gs_name = colDef(
        name = "Gene Set",
        cell = function(value) str_replace_all(value, "_", " ") %>% 
          str_remove("^GO(BP|CC|MF) ")
      ),
      pval = colDef(show = FALSE),
      padj = colDef(
        name = "p<sub>adj</sub>", html = TRUE,
        cell = function(value) {
          fmt <- "%.2e"
          if (value > 0.001) fmt <- "%.3f"
          sprintf(fmt, value)
        },
        maxWidth = 100
      ),
      NES = colDef(
        format = colFormat(digits = 2),
        maxWidth = 80
      ),
      size = colDef(name = "Gene Set Size", maxWidth = 100),
      leSize = colDef(name = "Leading Edge Size", maxWidth = 100),
      leadingEdge = colDef(
        name = "Leading Edge",
        cell = function(value) with_tooltip(value, width = 50)
      )
    )
  )

Plots

p <- df %>% 
  dplyr::slice(1:9) %>% 
  pull("gs_name") %>% 
  lapply(
    function(x) {
      plotEnrichment(gsByPathway[[x]], rankedGenes) +
        ggtitle(
          str_replace_all(x, "_", " ") %>% 
            str_remove("^GO(BP|CC|MF) ") %>% 
            str_trunc(40)
        ) +
        ylim(0.55 * c(-1, 1)) +
        theme(plot.title = element_text(hjust = 0.5, size = 10))
    }
  )
plot_grid(plotlist = p)

Enrichment plots for the most highly ranked GO: Cellular Component gene-sets, showing the approximate position within the ranked list where the maximal enrichment score is found. The most compelling results will always be associated with maximal enrichment scores near either extreme.

sessionInfo()

R version 4.2.0 (2022-04-22)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.4 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0

locale:
 [1] LC_CTYPE=en_AU.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_AU.UTF-8        LC_COLLATE=en_AU.UTF-8    
 [5] LC_MONETARY=en_AU.UTF-8    LC_MESSAGES=en_AU.UTF-8   
 [7] LC_PAPER=en_AU.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_AU.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
 [1] htmltools_0.5.2         reactable_0.2.3         fgsea_1.22.0           
 [4] msigdbr_7.5.1           statmod_1.4.36          ggrepel_0.9.1          
 [7] broom_0.8.0             matrixStats_0.62.0      cowplot_1.1.1          
[10] magrittr_2.0.3          ensembldb_2.20.1        AnnotationFilter_1.20.0
[13] GenomicFeatures_1.48.0  AnnotationDbi_1.58.0    Biobase_2.56.0         
[16] GenomicRanges_1.48.0    GenomeInfoDb_1.32.1     IRanges_2.30.0         
[19] S4Vectors_0.34.0        AnnotationHub_3.4.0     BiocFileCache_2.4.0    
[22] dbplyr_2.1.1            BiocGenerics_0.42.0     edgeR_3.38.0           
[25] limma_3.52.0            glue_1.6.2              pander_0.6.5           
[28] scales_1.2.0            forcats_0.5.1           stringr_1.4.0          
[31] dplyr_1.0.9             purrr_0.3.4             readr_2.1.2            
[34] tidyr_1.2.0             tibble_3.1.7            ggplot2_3.3.6          
[37] tidyverse_1.3.1         workflowr_1.7.0        

loaded via a namespace (and not attached):
  [1] readxl_1.4.0                  backports_1.4.1              
  [3] fastmatch_1.1-3               lazyeval_0.2.2               
  [5] splines_4.2.0                 crosstalk_1.2.0              
  [7] BiocParallel_1.30.0           digest_0.6.29                
  [9] fansi_1.0.3                   memoise_2.0.1                
 [11] tzdb_0.3.0                    Biostrings_2.64.0            
 [13] modelr_0.1.8                  vroom_1.5.7                  
 [15] prettyunits_1.1.1             colorspace_2.0-3             
 [17] blob_1.2.3                    rvest_1.0.2                  
 [19] rappdirs_0.3.3                haven_2.5.0                  
 [21] xfun_0.30                     callr_3.7.0                  
 [23] crayon_1.5.1                  RCurl_1.98-1.6               
 [25] jsonlite_1.8.0                gtable_0.3.0                 
 [27] zlibbioc_1.42.0               XVector_0.36.0               
 [29] DelayedArray_0.22.0           DBI_1.1.2                    
 [31] Rcpp_1.0.8.3                  xtable_1.8-4                 
 [33] progress_1.2.2                bit_4.0.4                    
 [35] htmlwidgets_1.5.4             httr_1.4.3                   
 [37] ellipsis_0.3.2                farver_2.1.0                 
 [39] pkgconfig_2.0.3               XML_3.99-0.9                 
 [41] sass_0.4.1                    here_1.0.1                   
 [43] locfit_1.5-9.5                utf8_1.2.2                   
 [45] labeling_0.4.2                tidyselect_1.1.2             
 [47] rlang_1.0.2                   later_1.3.0                  
 [49] reactR_0.4.4                  munsell_0.5.0                
 [51] BiocVersion_3.15.2            cellranger_1.1.0             
 [53] tools_4.2.0                   cachem_1.0.6                 
 [55] cli_3.3.0                     generics_0.1.2               
 [57] RSQLite_2.2.13                evaluate_0.15                
 [59] fastmap_1.1.0                 yaml_2.3.5                   
 [61] processx_3.5.3                babelgene_22.3               
 [63] knitr_1.39                    bit64_4.0.5                  
 [65] fs_1.5.2                      KEGGREST_1.36.0              
 [67] nlme_3.1-157                  whisker_0.4                  
 [69] mime_0.12                     xml2_1.3.3                   
 [71] biomaRt_2.52.0                compiler_4.2.0               
 [73] rstudioapi_0.13               filelock_1.0.2               
 [75] curl_4.3.2                    png_0.1-7                    
 [77] interactiveDisplayBase_1.34.0 reprex_2.0.1                 
 [79] bslib_0.3.1                   stringi_1.7.6                
 [81] highr_0.9                     ps_1.7.0                     
 [83] lattice_0.20-45               ProtGenerics_1.28.0          
 [85] Matrix_1.4-1                  vctrs_0.4.1                  
 [87] pillar_1.7.0                  lifecycle_1.0.1              
 [89] BiocManager_1.30.17           jquerylib_0.1.4              
 [91] data.table_1.14.2             bitops_1.0-7                 
 [93] httpuv_1.6.5                  rtracklayer_1.56.0           
 [95] R6_2.5.1                      BiocIO_1.6.0                 
 [97] promises_1.2.0.1              gridExtra_2.3                
 [99] assertthat_0.2.1              SummarizedExperiment_1.26.1  
[101] rprojroot_2.0.3               rjson_0.2.21                 
[103] withr_2.5.0                   GenomicAlignments_1.32.0     
[105] Rsamtools_2.12.0              GenomeInfoDbData_1.2.8       
[107] mgcv_1.8-40                   parallel_4.2.0               
[109] hms_1.1.1                     grid_4.2.0                   
[111] rmarkdown_2.14                MatrixGenerics_1.8.0         
[113] git2r_0.30.1                  getPass_0.2-2                
[115] shiny_1.7.1                   lubridate_1.8.0              
[117] restfulr_0.0.13

Differential Gene Expression

Stephen Pederson

07 July, 2022

Setup

Count Data

Data Processing and Analysis

Density Plots

RLE

PCA

Sample Weights

Results

MA Plot

Volcano Plot

Highly Ranked Genes

Enrichment Analysis

HALLMARK Gene Sets

Results

Plots

Reactome Gene Sets

Results

Plots

KEGG Gene Sets

Results

Plots

WIKIPATHWAYS Gene Sets

Results

Plots

ImmunSigDB Gene Sets

Results

Plots

GO: Biological Process Gene Sets

Results

Plots

GO: Cellular Component Gene Sets

Results

Plots

GO: Molecular Function Gene Sets

Results

Plots