Differential Gene Expression Analysis

Last updated: 2021-07-02

Checks: 7 0

Knit directory: Bulk_RNAseq/

This reproducible R Markdown analysis was created with workflowr (version 1.6.2). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.

R Markdown file: up-to-date

Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.

Environment: empty

Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.

Seed: set.seed(20210629)

The command set.seed(20210629) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.

Session information: recorded

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.

Cache: none

Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.

File paths: relative

Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.

Repository version: 76d837c

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.

The results in this page were generated with repository version 76d837c. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Ignored files:
    Ignored:    .DS_Store
    Ignored:    .Rhistory
    Ignored:    .Rproj.user/
    Ignored:    analysis/.DS_Store
    Ignored:    data/.DS_Store
    Ignored:    output/.DS_Store
    Ignored:    renv/library/
    Ignored:    renv/local/
    Ignored:    renv/staging/

Untracked files:
    Untracked:  code/
    Untracked:  rlib/

Unstaged changes:
    Modified:   output/GO_network.pdf

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.

These are the previous versions of the repository in which changes were made to the R Markdown (analysis/de.Rmd) and HTML (docs/de.html) files. If you’ve configured a remote Git repository (see ?wflow_git_remote), click on the hyperlinks in the table below to view the files as they were in that past version.

File	Version	Author	Date	Message
html	8a808a2	Nhi Hin	2021-07-02	Build site.
Rmd	ed2fe3c	Nhi Hin	2021-07-02	wflow_publish("analysis/*.Rmd")
html	66b2c36	Nhi Hin	2021-07-02	Build site.
Rmd	9b94e39	Nhi Hin	2021-07-02	wflow_publish("analysis/*.Rmd")
html	f62ae4f	Nhi Hin	2021-06-30	Build site.
Rmd	1d1160f	Nhi Hin	2021-06-30	wflow_publish("analysis/*.Rmd")

Summary

On this page, we will perform differential gene expression analysis to identify genes that show significant differential expression between Covid19 and Healthy samples. The steps involved are:

Importing in the DGEList object prepared in the Setting Up page.
Filtering of low expressed genes to increase our power to detect differentially expressed genes.
Identifying differentially expressed genes using the limma package.
Visualisation of differentially expressed genes using volcano plots and boxplots.
GO enrichment analysis of differentially expressed genes to explore their biological significance, including network plot visualisation.

Load R Packages

The following code chunk contains the R packages that are required to run through the analysis on this page:

# Working with data:
library(dplyr)
library(magrittr)
library(readr)
library(tibble)
library(reshape2)

# Visualisation:
library(kableExtra)
library(ggplot2)
library(ggbiplot)
library(ggrepel)
library(grid)
library(cowplot)
# Set ggplot2 theme
theme_set(theme_bw())

# Other packages:
library(here)
library(export)

# Bioconductor packages:
library(AnnotationHub)
library(edgeR)
library(limma)
library(Glimma)
library(clusterProfiler)
library(org.Hs.eg.db)
library(enrichplot)

Import Data

Below, we are importing in an object called dge, which was prepared in the Setting Up page. This object contains: gene counts, sample metadata, and gene annotations. Unfortunately we don’t have time to cover this in detail, but please at least skim through the Setting Up page to understand how we have imported the data in and gotten it into this format.

dge <- readRDS(here("data", "R", "dge.rds"))

To quickly summarise, the gene counts in the dge object can be accessed using dge$counts, and we can preview the first 5 rows and columns as follows. Each row represents one gene and the columns represent different samples.

dge$counts[1:5,1:5]

      Healthy_1 Healthy_2 Healthy_3 Healthy_4 Healthy_5
A1BG  133.14360 110.14223  94.68670  89.60004  85.81497
A1CF    9.00000   8.00000   1.00000  10.00020   3.00000
A2M    65.00001  31.72058  36.63567  19.22508  27.99998
A2ML1  29.28124  26.30470  20.57935  13.00000   5.00000
A2MP1 262.00034  83.00004  98.99997  55.99998  41.00000

The sample metadata in the dge object can be accessed using dge$samples, and we can preview the first 5 rows and columns as follows. Each row represents one sample and the columns represent information/characteristics about the samples.

dge$samples[1:5,1:5]

            group lib.size norm.factors patient_code age
Healthy_1 Healthy 40594184     1.139493        507-V  53
Healthy_2 Healthy 36139025     1.127571       1189-V  57
Healthy_3 Healthy 29466261     1.180684       1406-V  61
Healthy_4 Healthy 33691447     1.080104       1918-V  56
Healthy_5 Healthy 33097717     1.012251       1951-V  57

Lastly, the gene metadata in the dge object can be accessed using dge$genes, and we can preview the first 5 rows and columns as follows. Each row represents one gene and the columns represent information about that gene.

dge$genes[1:5,1:5]

       gene seqnames    start      end width
A1BG   A1BG       19 58345178 58353492  8315
A1CF   A1CF       10 50799409 50885675 86267
A2M     A2M       12  9067664  9116229 48566
A2ML1 A2ML1       12  8822621  8887001 64381
A2MP1 A2MP1       12  9228533  9275817 47285

To understand the contents of these objects better, try viewing them using the View() function, i.e. View(dge$genes) or View(dge$samples).
The nrow() and ncol() functions can be used to know the number of rows and columns in these objects, e.g.

nrow(dge$samples) # Number of samples

[1] 54

Questions

How many samples in total are in this dataset? How many Healthy/Covid19 samples are there?
How many genes are in this dataset?

Filtering of low-expressed genes

Filtering of low-expressed genes is a standard step in differential gene expression analysis as it helps to increase the power we have to detect differentially expressed genes, by not having to consider those which are expressed at levels too low to be reliable. Please refer to this paper for more background reading.
The cutoff for filtering out low-expressed genes is somewhat arbitrary, and we can decide through plotting density plots like the ones below. Ideally, the filtering should remove the large peak of genes with low expression towards the left of the density plot.
A common guideline is to filter so that we retain genes expressed at least 1 cpm in the smallest group of samples. Here, the smallest group of samples is 10 (we have 10 Healthy samples and 44 Covid-19 samples).
By setting the expression cutoff to 1 cpm in at least 10 samples below, we can see that the peak corresponding to low-expressed genes is successfully largely reduced in the data after filtering.

keepTheseGenes <- (rowSums(cpm(dge) > 1) >= 10) 

beforeFiltering_plot <- dge %>% 
  cpm(log = TRUE) %>% 
  melt %>% 
  dplyr::filter(is.finite(value)) %>% 
  ggplot(aes(x = value, colour = Var2)) +
  geom_density() + 
  guides(colour = FALSE) +
  ggtitle("A. Before filtering", subtitle = paste0(nrow(dge), " genes")) +
  labs(x = "logCPM", y = "Density")

afterFiltering_plot <- dge %>% 
  cpm(log = TRUE) %>% 
  magrittr::extract(keepTheseGenes,) %>%
  melt %>% 
  dplyr::filter(is.finite(value)) %>% 
  ggplot(aes(x = value, colour = Var2)) +
  geom_density() + 
  guides(colour = FALSE) +
  ggtitle("B. After filtering", subtitle = paste0(table(keepTheseGenes)[[2]], " genes"))+
  labs(x = "logCPM", y = "Density")

cowplot::plot_grid(beforeFiltering_plot, afterFiltering_plot)

Version	Author	Date
8a808a2	Nhi Hin	2021-07-02
f62ae4f	Nhi Hin	2021-06-30

Questions

What happens to the density plot and number of genes left after filtering if we change the filtering so that the expression cutoff is 0.1 cpm instead of 1 cpm? What about if the expression cutoff is 1 cpm in 30 samples instead of 10?

- Using the filtering of > 1 cpm in 10 or more samples, the step below filters out 15097 genes from the original 29302 genes, giving the remaining 14205 to be used in the analysis.

dge <- dge[keepTheseGenes,,keep.lib.sizes = FALSE]