Differential gene expression analysis with DESeq2

Installation

Example data

Size factors

Mean-variance relationship

A basic analysis

Exploring the results

Multifactorial designs and linear models

Noise and replicates

Last updated: 2024-08-06

Checks: 7 0

Knit directory: muse/

This reproducible R Markdown analysis was created with workflowr (version 1.7.1). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.

R Markdown file: up-to-date

Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.

Environment: empty

Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.

Seed: set.seed(20200712)

The command set.seed(20200712) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.

Session information: recorded

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.

Cache: none

Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.

File paths: relative

Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.

Repository version: 665782b

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.

The results in this page were generated with repository version 665782b. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Ignored files:
    Ignored:    .Rhistory
    Ignored:    .Rproj.user/
    Ignored:    r_packages_4.3.3/
    Ignored:    r_packages_4.4.0/

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.

These are the previous versions of the repository in which changes were made to the R Markdown (analysis/deseq2.Rmd) and HTML (docs/deseq2.html) files. If you’ve configured a remote Git repository (see ?wflow_git_remote), click on the hyperlinks in the table below to view the files as they were in that past version.

File	Version	Author	Date	Message
Rmd	665782b	Dave Tang	2024-08-06	Noise and replicates
html	3a1d43a	Dave Tang	2024-08-06	Build site.
Rmd	3738c6f	Dave Tang	2024-08-06	Multifactorial design
html	f4da2ce	Dave Tang	2024-08-05	Build site.
Rmd	9293da5	Dave Tang	2024-08-05	Visualisations after a DGE analysis
html	365890e	Dave Tang	2024-08-05	Build site.
Rmd	3780bfa	Dave Tang	2024-08-05	A basic DESeq2 analysis
html	1383bf7	Dave Tang	2024-08-05	Build site.
Rmd	d5e67bc	Dave Tang	2024-08-05	Getting started with DESeq2

DESeq2 is used to:

Estimate variance-mean dependence in count data from high-throughput sequencing assays and test for differential expression based on a model using the negative binomial distribution.

Installation

Install using BiocManager::install().

if (!require("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("DESeq2")

We will use data from the pasilla package so install it too.

BiocManager::install("pasilla")

Example data

Example dataset in the experiment data package {pasilla}.

fn <- system.file("extdata", "pasilla_gene_counts.tsv", package = "pasilla", mustWork = TRUE)
counts <- as.matrix(read.csv(fn, sep="\t", row.names = "gene_id"))
dim(counts)

[1] 14599     7

The matrix tallies the number of reads assigned for each gene in each sample.

tail(counts)

            untreated1 untreated2 untreated3 untreated4 treated1 treated2
FBgn0261570       3296       4910       2156       2060     5077     3069
FBgn0261571          0          0          0          0        1        0
FBgn0261572          4         13          4         11        7        3
FBgn0261573       2651       3653       1571       1612     3334     1848
FBgn0261574       6385       9318       3110       2819    10455     3508
FBgn0261575          6         53          1          3       42        3
            treated3
FBgn0261570     3022
FBgn0261571        0
FBgn0261572        3
FBgn0261573     1908
FBgn0261574     3047
FBgn0261575        4

Size factors

Estimate size factors.

DESeq2::estimateSizeFactorsForMatrix(counts)

untreated1 untreated2 untreated3 untreated4   treated1   treated2   treated3 
 1.1382630  1.7930004  0.6495470  0.7516892  1.6355751  0.7612698  0.8326526

Mean-variance relationship

Variance versus mean for the (size factor adjusted) counts data. The axes are logarithmic. Also shown are lines through the origin with slopes 1 (green) and 2 (red).

sf <- DESeq2::estimateSizeFactorsForMatrix(counts)

ncounts <- t(t(counts) / sf)

# untreated samples
uncounts <- ncounts[, grep("^untreated", colnames(ncounts)), drop = FALSE]

ggplot(
  tibble(
    mean = rowMeans(uncounts),
    var  = rowVars(uncounts)
  ),
  aes(x = log1p(mean), y = log1p(var))
) +
  geom_hex() +
  coord_fixed() +
  theme_minimal() +
  theme(legend.position = "none") +
  geom_abline(slope = 1:2, color = c("forestgreen", "red"))

Version	Author	Date
1383bf7	Dave Tang	2024-08-05

The green line is what we expect if the variance equals the mean, as is the case for a Poisson-distributed random variable. This approximately fits the data in the lower range. The red line corresponds to the quadratic mean-variance relationship $v = m^2$ . We can see that in the upper range of the data, the quadratic relationship approximately fits the data.

A basic analysis

The {pasilla} data set is from an experiment on Drosophila melanogaster cell cultures that tested the effect of RNAi knockdown of the splicing factor pasilla on the cells’ transcriptome. There were two experimental conditions, termed untreated and treated in the column headers and they correspond to negative control and siRNA against pasilla. The experimental metadata of the seven samples in this dataset are loaded below.

annotationFile <- system.file("extdata", "pasilla_sample_annotation.csv", package = "pasilla", mustWork = TRUE)
pasillaSampleAnno <- readr::read_csv(annotationFile)

Rows: 7 Columns: 6
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (4): file, condition, type, total number of reads
dbl (2): number of lanes, exon counts

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

pasillaSampleAnno

# A tibble: 7 × 6
  file    condition type  `number of lanes` total number of read…¹ `exon counts`
  <chr>   <chr>     <chr>             <dbl> <chr>                          <dbl>
1 treate… treated   sing…                 5 35158667                    15679615
2 treate… treated   pair…                 2 12242535 (x2)               15620018
3 treate… treated   pair…                 2 12443664 (x2)               12733865
4 untrea… untreated sing…                 2 17812866                    14924838
5 untrea… untreated sing…                 6 34284521                    20764558
6 untrea… untreated pair…                 2 10542625 (x2)               10283129
7 untrea… untreated pair…                 2 12214974 (x2)               11653031
# ℹ abbreviated name: ¹`total number of reads`

The dataset was produced in two batches, the first consisting of three sequencing libraries using single read sequencing and the second using using paired-end sequencing.

Create factors.

mutate(
  pasillaSampleAnno,
  condition = factor(condition, levels = c("untreated", "treated")),
  type = factor(sub("-.*", "", type), levels = c("single", "paired"))
) -> pasillaSampleAnno

Note that the design is approximately balanced between the factor of interest, condition, and the “nuisance factor”, type.

with(pasillaSampleAnno, table(condition, type))

           type
condition   single paired
  untreated      2      2
  treated        1      2

{DESeq2} uses DESeqDataSet objects and is an extension of the SummarizedExperiment class in Bioconductor. We can create a DESeqDataSet object using the constructor function DESeqDataSetFromMatrix.

mt <- match(colnames(counts), sub("fb$", "", pasillaSampleAnno$file))
stopifnot(!any(is.na(mt)))

pasilla <- DESeqDataSetFromMatrix(
  countData = counts,
  colData   = pasillaSampleAnno[mt, ],
  design    = ~ condition
)

class(pasilla)

[1] "DESeqDataSet"
attr(,"package")
[1] "DESeq2"

is(pasilla, "SummarizedExperiment")

[1] TRUE

After creating a DESeqDataSet we are ready to carry out a differential expression analysis. The aim is to identify genes that are differentially abundant between the treated and untreated cells. A test that is conceptually similar to the t-test is used. A choice of standard analysis steps are wrapped into a single function, DESeq().

pasilla <- DESeq(pasilla)

estimating size factors

estimating dispersions

gene-wise dispersion estimates

mean-dispersion relationship

final dispersion estimates

fitting model and testing

The DESeq() function is a wrapper that calls:

estimateSizeFactors for normalisation
estimateDispersions for dispersion estimation
nbinomWaldTest for hypothesis tests for differential abundance

The test is between the two levels untreated and treated of the factor condition, since this is what was specified in the design: design = ~ condition.

res <- results(pasilla)
res[order(res$padj), ] |> head()

log2 fold change (MLE): condition treated vs untreated 
Wald test p-value: condition treated vs untreated 
DataFrame with 6 rows and 6 columns
             baseMean log2FoldChange     lfcSE      stat       pvalue
            <numeric>      <numeric> <numeric> <numeric>    <numeric>
FBgn0039155   730.596       -4.61901 0.1687068  -27.3789 4.88599e-165
FBgn0025111  1501.411        2.89986 0.1269205   22.8479 1.53430e-115
FBgn0029167  3706.117       -2.19700 0.0969888  -22.6521 1.33042e-113
FBgn0003360  4343.035       -3.17967 0.1435264  -22.1539 9.56283e-109
FBgn0035085   638.233       -2.56041 0.1372952  -18.6490  1.28772e-77
FBgn0039827   261.916       -4.16252 0.2325888  -17.8965  1.25663e-71
                    padj
               <numeric>
FBgn0039155 4.06661e-161
FBgn0025111 6.38497e-112
FBgn0029167 3.69104e-110
FBgn0003360 1.98979e-105
FBgn0035085  2.14354e-74
FBgn0039827  1.74316e-68

Exploring the results

The first step after a differential expression analysis is to visualise the following plots:

the histogram of p-values
the MA plot
an ordination plot
a heatmap

Histogram of p-values of a differential expression analysis.”

ggplot(as(res, "data.frame"), aes(x = pvalue)) +
  geom_histogram(binwidth = 0.01, fill = "Royalblue", boundary = 0) +
  geom_hline(yintercept = 100, lty = 3) +
  theme_minimal()

Warning: Removed 2241 rows containing non-finite outside the scale range
(`stat_bin()`).

Version	Author	Date
f4da2ce	Dave Tang	2024-08-05

The p-value distribution shows a uniform background (non-DE genes) with values between 0 and 1 and a peak of small p-values at the left (DE genes). The ratio of the level of the background to the height of the peak provides us with a rough indication of the false discovery rate (FDR) that would be associated with calling the genes in the leftmost bin differentially expressed.

In this case, the leftmost bin contains 993 genes and the background level (dotted line) is around 100, so the FDR associated with calling all genes in the leftmost bin would be around 10%.

table(res$pvalue < 0.01)


FALSE  TRUE 
11365   993

A background distribution that is not uniform, but shows a tilted shape with an increase towards the right, tends to indicate batch effects, that is, underlying systematic variation that makes replicates look more different than expected.

The {DESeq2} package can produce the MA plot.

plotMA(pasilla, ylim = c(-2, 2))

Version	Author	Date
f4da2ce	Dave Tang	2024-08-05

{DESeq2} also has a function plotPCA() that can produce PCA plots. PC1 is mostly aligned with the experimental covariate of interest (untreated/treated), while PC2 is roughly aligned with the sequencing protocol (single/paired).

# transform using regularised logarithm
pas_rlog <- rlogTransformation(pasilla)
plotPCA(pas_rlog, intgroup=c("condition", "type")) + coord_fixed()

using ntop=500 top features by variance

Coordinate system already present. Adding new coordinate system, which will
replace the existing one.

Version	Author	Date
f4da2ce	Dave Tang	2024-08-05

Heatmap using the regularised logarithm (rlog) transformed data of the 30 most variable genes.

library("pheatmap")
wanted <- order(rowMeans(assay(pas_rlog)), decreasing = TRUE)[1:30]

pheatmap(
  assay(pas_rlog)[wanted, ],
  scale = "row",
  annotation_col = as.data.frame(
    colData(pas_rlog)[, c("condition", "type")]
  )
)

Version	Author	Date
f4da2ce	Dave Tang	2024-08-05

Note the clustering of the sequencing protocol and this highlights that our differential expression analysis was probably too naive and that we should adjust for the “nuisance” factor.

Multifactorial designs and linear models

Let’s assume that, in addition to the siRNA knockdown of the pasilla gene, we also want to test the effect of a certain drug. We would then envisage an experiment in which the experimenter treats the cells either with negative control, with the siRNA against pasilla, with the drug, or with both. To analyse this experiment, we can introduce the notation:

$\begin{equation} y = \beta_0 + x_1 \beta_1 + x_2 \beta_2 + x_1 x_2 \beta_{12} \end{equation}$

The left-hand size, $y$ , is the experimental measurement of interest, which in our case is the suitably transformed expression level of a gene. Since in an RNA-seq experiment there are lots of genes, we’ll have as many copies of the above equation, one for each.

The coefficient $\beta_0$ is the base level of the measurement in the negative control; often it is called the intercept.

The design factors $x_1$ and $x_2$ are binary indicator variables: $x_1$ takes the value 1 if the siRNA was transfected and 0 if not, and similarly, $x_2$ indicates whether the drug was administered.

In the experiment where only where only the siRNA is used, $x_1$ = 1 and $x_2$ = 0, and the third and fourth terms of the equation vanish. The equation then simplifies to $y = \beta_0 + \beta_1$ . This means that $\beta_1$ represents the difference between treatment and control. If our measurements are on a logarithmic scale, then:

$y = \beta_0 + \beta_1 \\ \beta_1 = y - \beta_0 = log_2(expression_{treated}) - log2(expression_{untreated}) \\ = log_2\frac{expression_{treated}}{expression_{untreated}}$

is the logarithmic fold change due to treatment with the siRNA. In exactly the same way, $\beta_2$ is the logarithmic fold change due to treatment with the drug.

If we treat the cells with both siRNA and drug, $x_1$ = $x_2$ = 1:

$y = \beta_0 + x_1 \beta_1 + x_2 \beta_2 + x_1 x_2 \beta_{12} \\ y = \beta_0 + \beta_1 + \beta_2 + \beta_{12} \\ \beta_{12} = y - (\beta_0 + \beta_1 + \beta_2)$

This means that $\beta_{12}$ is the difference between the observed outcome, $y$ , and the outcome expected from the individual treatments, obtained by adding to the baseline the effect of siRNA alone, $\beta_1$ , and of drug alone, $\beta_2$ .

We call $\beta_{12}$ the interaction effect of siRNA and drug. It has nothing to do with a physical interaction: the terminology indicates that the effects of these two different experimental factors do not simply add up, but combine in a more complicated fashion. For instance, if the target of the drug and of the siRNA were equivalent, leading to the same effect on the cells, then we would biologically expect that $\beta_1$ = $\beta_2$ . We would also expect that their combination has no further effect, so that $\beta_{12}$ = $-\beta_1$ .

If, on the other hand, the targets of the drug and of the siRNA are in parallel pathways that can buffer each other, we would expect that $\beta_1$ and $\beta_2$ are both relatively small, but the combined effect is synergistic, and $\beta_{12}$ is large.

We don’t always care about interactions. Many experiments are designed with multiple factors where we care most about each of their individual effects. In that case, the combinatorial treatment might not be present in the experimental design, and the model to use for the analysis is a version of the equation with the rightmost term removed.

We can succinctly encode the design of the experiment in the design matrix. The combinatorial experiment would be:

$x_0$	$x_1$	$x_2$
1	0	0
1	1	0
1	0	1
1	1	1

The columns of the design matrix correspond to the experimental factors, and its rows represent the different experimental conditions, four in this case. If the combinatorial treatment is not performed, then the design matrix is reduced to only the first three rows.

Noise and replicates

The equation:

$\begin{equation} y = \beta_0 + x_1 \beta_1 + x_2 \beta_2 + x_1 x_2 \beta_{12} \end{equation}$

provides a conceptual decomposition of the observed data into the effects caused by the different experimental variables. If our data (the $y$ s) were absolutely precise, we could set up a linear system of equations, one equation for each of the four possible experimental conditions represented by the $x$ s, and solve for the $\beta$ s.

However real data is usually affected by noise and we need replicates to estimate the levels of noise and assess the uncertainty of our estimated $\beta$ s. Only then can we empirically assess whether any of the observed changes between conditions are significantly larger than those occurring due just to experimental or natural variation. We extend the equation to:

$y_j = x_{j0} \beta_0 + x_{j1} \beta_1 + x_{j2} \beta_2 + x_{j1} x_{j2} \beta_{12} + \epsilon_j$

We have included the index $j$ and a new term $\epsilon_j$ . The index $j$ now explicitly counts over our individual replicate experiments; for example, if for each of the four conditions we perform three replicates, then $j$ counts from 1 to 12. The design matrix now has 12 rows, and $x_{jk}$ is the value of the matrix in its $j$ th row and $k$ th column.

The additional terms $\epsilon_j$ , which we call the residuals, are there to absorb differences between replicates. However, one additional modeling component is needed: the system of 12 equations would be underdetermined without further information, since it now has more variables (12 $\epsilon$ s and four $\beta$ s) than it has equations (12, one for each $j$ ). To address this, we require that the $\epsilon_j$ be small. One popular way to overcome this is to minimise the sum of squared residuals,

$\sum_j \epsilon^2_j \rightarrow min.$

It turns out that with this requirement satisfied, the $\beta$ s represent the average effects of each of the experimental factors, while the residuals $\epsilon_j$ reflect the experimental fluntuations around the mean between the replicates. This approach, which is called least sum of squares fitting, is mathematically convenient, since it can be achieved by straight-forward matrix algebra and it is what the lm() function does.

sessionInfo()

R version 4.4.0 (2024-04-24)
Platform: x86_64-pc-linux-gnu
Running under: Ubuntu 22.04.4 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so;  LAPACK version 3.10.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

time zone: Etc/UTC
tzcode source: system (glibc)

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
 [1] pheatmap_1.0.12             hexbin_1.28.3              
 [3] pasilla_1.32.0              DEXSeq_1.50.0              
 [5] RColorBrewer_1.1-3          AnnotationDbi_1.66.0       
 [7] BiocParallel_1.38.0         DESeq2_1.44.0              
 [9] SummarizedExperiment_1.34.0 Biobase_2.64.0             
[11] MatrixGenerics_1.16.0       matrixStats_1.3.0          
[13] GenomicRanges_1.56.1        GenomeInfoDb_1.40.1        
[15] IRanges_2.38.1              S4Vectors_0.42.1           
[17] BiocGenerics_0.50.0         lubridate_1.9.3            
[19] forcats_1.0.0               stringr_1.5.1              
[21] dplyr_1.1.4                 purrr_1.0.2                
[23] readr_2.1.5                 tidyr_1.3.1                
[25] tibble_3.2.1                ggplot2_3.5.1              
[27] tidyverse_2.0.0             workflowr_1.7.1            

loaded via a namespace (and not attached):
  [1] rstudioapi_0.16.0       jsonlite_1.8.8          magrittr_2.0.3         
  [4] farver_2.1.2            rmarkdown_2.27          fs_1.6.4               
  [7] zlibbioc_1.50.0         vctrs_0.6.5             memoise_2.0.1          
 [10] Rsamtools_2.20.0        htmltools_0.5.8.1       S4Arrays_1.4.1         
 [13] progress_1.2.3          curl_5.2.1              SparseArray_1.4.8      
 [16] sass_0.4.9              bslib_0.7.0             httr2_1.0.1            
 [19] cachem_1.1.0            whisker_0.4.1           lifecycle_1.0.4        
 [22] pkgconfig_2.0.3         Matrix_1.7-0            R6_2.5.1               
 [25] fastmap_1.2.0           GenomeInfoDbData_1.2.12 digest_0.6.35          
 [28] colorspace_2.1-0        ps_1.7.6                rprojroot_2.0.4        
 [31] geneplotter_1.82.0      RSQLite_2.3.7           hwriter_1.3.2.1        
 [34] filelock_1.0.3          labeling_0.4.3          fansi_1.0.6            
 [37] timechange_0.3.0        httr_1.4.7              abind_1.4-5            
 [40] compiler_4.4.0          bit64_4.0.5             withr_3.0.0            
 [43] DBI_1.2.3               highr_0.11              biomaRt_2.60.1         
 [46] rappdirs_0.3.3          DelayedArray_0.30.1     tools_4.4.0            
 [49] httpuv_1.6.15           glue_1.7.0              callr_3.7.6            
 [52] promises_1.3.0          grid_4.4.0              getPass_0.2-4          
 [55] generics_0.1.3          gtable_0.3.5            tzdb_0.4.0             
 [58] hms_1.1.3               xml2_1.3.6              utf8_1.2.4             
 [61] XVector_0.44.0          pillar_1.9.0            vroom_1.6.5            
 [64] genefilter_1.86.0       later_1.3.2             splines_4.4.0          
 [67] BiocFileCache_2.12.0    lattice_0.22-6          survival_3.5-8         
 [70] bit_4.0.5               annotate_1.82.0         tidyselect_1.2.1       
 [73] locfit_1.5-9.9          Biostrings_2.72.1       knitr_1.47             
 [76] git2r_0.33.0            xfun_0.44               statmod_1.5.0          
 [79] stringi_1.8.4           UCSC.utils_1.0.0        yaml_2.3.8             
 [82] evaluate_0.24.0         codetools_0.2-20        cli_3.6.2              
 [85] xtable_1.8-4            munsell_0.5.1           processx_3.8.4         
 [88] jquerylib_0.1.4         Rcpp_1.0.12             dbplyr_2.5.0           
 [91] png_0.1-8               XML_3.99-0.16.1         parallel_4.4.0         
 [94] blob_1.2.4              prettyunits_1.2.0       bitops_1.0-7           
 [97] scales_1.3.0            crayon_1.5.2            rlang_1.1.4            
[100] KEGGREST_1.44.1