Differential expression analysis for studies with a 2x2 factorial design

Last updated: 2018-05-23

workflowr checks: (Click a bullet for more information)

✖ R Markdown file: uncommitted changes
The R Markdown file has unstaged changes. To know which version of the R Markdown file created these results, you’ll want to first commit it to the Git repo. If you’re still working on the analysis, you can ignore this warning. When you’re finished, you can run wflow_publish to commit the R Markdown file and build the HTML.
✔ Environment: empty

Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.
✔ Seed: set.seed(12345)

The command set.seed(12345) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.
✔ Session information: recorded

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.
✔ Repository version: cc6353a
Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility. The version displayed above was the version of the Git repository at the time these results were generated.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:
```
Ignored files:
    Ignored:    .DS_Store
    Ignored:    .Rhistory
    Ignored:    .Rproj.user/
    Ignored:    figure/.DS_Store
    Ignored:    figure/ch02/.DS_Store

Untracked files:
    Untracked:  _workflowr.yml

Unstaged changes:
    Modified:   analysis/_site.yml
    Modified:   analysis/about.Rmd
    Modified:   analysis/ch01.Rmd
    Modified:   analysis/ch02.Rmd
    Modified:   analysis/ch03.Rmd
    Modified:   analysis/ch04.Rmd
    Deleted:    analysis/chunks.R
    Modified:   analysis/index.Rmd
    Modified:   analysis/license.Rmd
```
Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.

Expand here to see past versions:

File	Version	Author	Date	Message
html	67f13a4	John Blischak	2018-03-09	Build site.
Rmd	a2fc9a7	John Blischak	2018-03-09	Minor edits to ch04 solutions
html	b2dc653	John Blischak	2018-01-25	Build site.
Rmd	dc82382	John Blischak	2018-01-25	Minor updates to Ch 04 analysis.
html	7f4d46b	John Blischak	2018-01-11	Build site.
Rmd	5db0175	John Blischak	2018-01-11	Organize Ch 04

Analayze microarray data from Zhang et al., 2012, which measured gene expression in hearts from wild type and Top2b null mice treated with doxorubicin or a control.

library("Biobase")
library("ggplot2")
library("limma")
eset <- readRDS("../data/ch04.rds")

Studies with a 2x2 factorial design (Video)

Describe the scientific question, the experimental design, and the data collected for the 2x2 factorial study.

Expression level of Top2b

Create a boxplot of the expression level of Top2b to confirm the null mice have lower levels of Top2b expression. Use pData and table to access and explore the phenotype measurements to be used in the plot.

head(fData(eset))

           probe   chr     start       end symbol            ensembl
1  A_55_P2051983 chr18  62753954  62753895 Spink7 ENSMUST00000076194
2   A_52_P169082 chr11  76031955  76032014  Dbil5                   
3 A_30_P01028193 chr14  47982421  47982480                          
4   A_52_P237997  chr3 156834624 156834683  Negr1                   
5   A_51_P414243  chr9 121890800 121890741 C85492 ENSMUST00000084743
6  A_55_P2136348  chr9 120003293 120003352   Ccr8 ENSMUST00000048777
        refseq entrez                                                name
1 NM_001001803 408198 serine peptidase inhibitor, Kazal type 7 (putative)
2    NM_021294  13168                   diazepam binding inhibitor-like 5
3                  NA                                                    
4    NM_177274 320840                         neuronal growth regulator 1
5    NM_153540 215494                           expressed sequence C85492
6    NM_007720  12776                    chemokine (C-C motif) receptor 8

top2b <- exprs(eset)[fData(eset)$symbol == "Top2b", ]
head(pData(eset))

                                                  title genotype treatment
WT.PBS.r1         WT Cardiomyocytes 16hr after PBS rep1       WT       PBS
WT.PBS.r2         WT Cardiomyocytes 16hr after PBS rep2       WT       PBS
WT.PBS.r3         WT Cardiomyocytes 16hr after PBS rep3       WT       PBS
WT.Dox.r1 WT Cardiomyocytes 16hr after doxorubicin rep1       WT       Dox
WT.Dox.r2 WT Cardiomyocytes 16hr after doxorubicin rep2       WT       Dox
WT.Dox.r3 WT Cardiomyocytes 16hr after doxorubicin rep3       WT       Dox
          rep
WT.PBS.r1  r1
WT.PBS.r2  r2
WT.PBS.r3  r3
WT.Dox.r1  r1
WT.Dox.r2  r2
WT.Dox.r3  r3

d <- data.frame(pData(eset), top2b)
ggplot(d, aes(x = treatment, y = top2b, fill = genotype)) +
  geom_boxplot()

Expand here to see past versions of unnamed-chunk-1-1.png:

Version	Author	Date
67f13a4	John Blischak	2018-03-09
b2dc653	John Blischak	2018-01-25
7f4d46b	John Blischak	2018-01-11

Visualize gene expression distribution with boxplots

Use boxplots as an alternative to density plots for checking the distribution of gene expression levels. Note pro (easier to see which sample belongs to each distribution) and con (hides any multimodality).

boxplot(exprs(eset))

Expand here to see past versions of unnamed-chunk-2-1.png:

Version	Author	Date
67f13a4	John Blischak	2018-03-09
b2dc653	John Blischak	2018-01-25
7f4d46b	John Blischak	2018-01-11

boxplot(log(exprs(eset)))

Expand here to see past versions of unnamed-chunk-2-2.png:

Version	Author	Date
67f13a4	John Blischak	2018-03-09

boxplot(log(exprs(eset)), las = 2)

Expand here to see past versions of unnamed-chunk-2-3.png:

Version	Author	Date
67f13a4	John Blischak	2018-03-09

Perform PCA

Use prcomp to calculate PCs and plot PC2 vs. PC1 to confirm that the samples are separated by genotype and treatment. Note that Top2b mice all cluster with untreated WT samples since they are resistant to the treatment.

pca <- prcomp(t(exprs(eset)), scale. = TRUE)
plot(pca)

Expand here to see past versions of unnamed-chunk-3-1.png:

Version	Author	Date
67f13a4	John Blischak	2018-03-09
b2dc653	John Blischak	2018-01-25
7f4d46b	John Blischak	2018-01-11

d <- data.frame(pData(eset), pca$x)
ggplot(d, aes(x = PC1, y = PC2, color = genotype, shape = treatment)) +
  geom_point()

Expand here to see past versions of unnamed-chunk-3-2.png:

Version	Author	Date
67f13a4	John Blischak	2018-03-09
b2dc653	John Blischak	2018-01-25
7f4d46b	John Blischak	2018-01-11

ggplot(d, aes(x = treatment, y = PC1, fill = genotype)) +
  geom_boxplot()

Expand here to see past versions of unnamed-chunk-3-3.png:

Version	Author	Date
67f13a4	John Blischak	2018-03-09
b2dc653	John Blischak	2018-01-25
7f4d46b	John Blischak	2018-01-11

ggplot(d, aes(x = treatment, y = PC2, fill = genotype)) +
  geom_boxplot()

Expand here to see past versions of unnamed-chunk-3-4.png:

Version	Author	Date
67f13a4	John Blischak	2018-03-09
b2dc653	John Blischak	2018-01-25
7f4d46b	John Blischak	2018-01-11

Differential expression for 2x2 factorial designs (Video)

Discuss how to construct a design and contrasts matrix for a 2x2 factorial design. Note how to create a combined factor and how to write the contrast for the interaction term.

Create design matrix for 2x2 factorial design

Create a combined factor using paste. Use model.matrix to create a linear model with 4 binary variables (group-means parametrization).

f <- paste(pData(eset)$genotype, pData(eset)$treatment, sep = ".")
design <- model.matrix(~0 + f)
colnames(design) <- sub("f", "", colnames(design))
colSums(design)

Tob2b.Dox Tob2b.PBS    WT.Dox    WT.PBS 
        3         3         3         3

Create contrasts matrix for 2x2 factorial design

Use limma::makeContrasts to test the main effects and the interaction between genotype and treatment.

cont_mat <- makeContrasts(DoxWT = WT.Dox - WT.PBS,
                          DoxTop = Tob2b.Dox - Tob2b.PBS,
                          Interaction = (Tob2b.Dox - Tob2b.PBS) -
                                        (WT.Dox - WT.PBS),
                          levels = design)
cont_mat

           Contrasts
Levels      DoxWT DoxTop Interaction
  Tob2b.Dox     0      1           1
  Tob2b.PBS     0     -1          -1
  WT.Dox        1      0          -1
  WT.PBS       -1      0           1

Fit and test 2x2 factorial design

Use limma::lmFit, limma::contrasts.fit, limma::eBayes, and limma::decideTests to fit and test the model. Use limma::vennDiagram to visualize overlap of differentially expressed genes.

fit <- lmFit(eset, design)
head(fit$coefficients)

    Tob2b.Dox   Tob2b.PBS      WT.Dox      WT.PBS
1   0.2975107   0.2945810   0.3539563   0.3620450
2   4.2250150   4.6055087   3.6762460   3.1691957
3   0.3001070   0.3196167   0.6157077   0.3739970
4   0.3012637   0.2955227   0.3532647   0.3616193
5 264.7589623 245.4630817 294.2404533 274.2917363
6   0.3032463   0.2958423   0.3525713   0.3611247

fit2 <- contrasts.fit(fit, contrasts = cont_mat)
head(fit2$coefficients)

   Contrasts
           DoxWT       DoxTop Interaction
  1 -0.008088667  0.002929667  0.01101833
  2  0.507050333 -0.380493667 -0.88754400
  3  0.241710667 -0.019509667 -0.26122033
  4 -0.008354667  0.005741000  0.01409567
  5 19.948717000 19.295880667 -0.65283633
  6 -0.008553333  0.007404000  0.01595733

fit2 <- eBayes(fit2)
results <- decideTests(fit2)
summary(results)

   DoxWT DoxTop Interaction
-1  2311      0         320
0  55050  59304       58071
1   1944      1         914

vennDiagram(results)

Expand here to see past versions of unnamed-chunk-6-1.png:

Version	Author	Date
67f13a4	John Blischak	2018-03-09
b2dc653	John Blischak	2018-01-25
7f4d46b	John Blischak	2018-01-11

Enrichment analysis (Video)

Describe common approaches for testing for enrichment of gene sets. Discuss Gene Onotlogy categories and KEGG pathways. Introduce Fisher’s Exact Test. Stress the importance of choosing the correct background set of genes and caution against over-interpretation (especially for directional effect). Mention at the end that there are more sophisticated algorithms such as limma::camera, limma::roast, and goseq::goseq.

Enrichment of KEGG pathways

Use limma::kegga and limma::topKEGG to test for enrichment of differentially expressed genes in KEGG pathways.

entrez <- fit2$genes$entrez
enrich_kegg_wt <- kegga(fit2, coef = "DoxWT",
                        geneid = entrez, species = "Mm")
topKEGG(enrich_kegg_wt)

                                                             Pathway   N
path:mmu05322                           Systemic lupus erythematosus 110
path:mmu05034                                             Alcoholism 165
path:mmu03008                      Ribosome biogenesis in eukaryotes  77
path:mmu05203                                   Viral carcinogenesis 212
path:mmu05412 Arrhythmogenic right ventricular cardiomyopathy (ARVC)  72
path:mmu04137                                     Mitophagy - animal  63
path:mmu04261                 Adrenergic signaling in cardiomyocytes 145
path:mmu04022                             cGMP-PKG signaling pathway 166
path:mmu05414                           Dilated cardiomyopathy (DCM)  88
path:mmu04217                                            Necroptosis 158
path:mmu04971                                 Gastric acid secretion  74
path:mmu04144                                            Endocytosis 262
path:mmu05410                      Hypertrophic cardiomyopathy (HCM)  83
path:mmu05202                Transcriptional misregulation in cancer 174
path:mmu04012                                 ErbB signaling pathway  84
path:mmu03040                                            Spliceosome 129
path:mmu03013                                          RNA transport 166
path:mmu04066                                HIF-1 signaling pathway 103
path:mmu04115                                  p53 signaling pathway  67
path:mmu04919                      Thyroid hormone signaling pathway 115
              Up Down         P.Up       P.Down
path:mmu05322 25    1 1.275512e-10 9.997136e-01
path:mmu05034 29   12 2.754611e-09 5.089928e-01
path:mmu03008 19    2 5.184560e-09 9.767149e-01
path:mmu05203 33   11 5.356776e-09 8.957936e-01
path:mmu05412  0   20 1.000000e+00 7.142217e-08
path:mmu04137 15    7 3.628333e-07 1.572432e-01
path:mmu04261  5   27 8.642262e-01 2.843226e-06
path:mmu04022  4   29 9.717775e-01 4.646843e-06
path:mmu05414  0   19 1.000000e+00 9.524657e-06
path:mmu04217 21   14 4.219986e-05 2.306917e-01
path:mmu04971  2   16 8.948275e-01 4.706902e-05
path:mmu04144 29   20 5.284433e-05 4.005582e-01
path:mmu05410  0   17 1.000000e+00 5.716438e-05
path:mmu05202 22   10 6.030451e-05 7.996927e-01
path:mmu04012  5   17 4.206570e-01 6.704920e-05
path:mmu03040 18   11 7.835780e-05 3.079244e-01
path:mmu03013 21    8 8.782808e-05 9.111483e-01
path:mmu04066  8   19 1.491963e-01 9.619306e-05
path:mmu04115 12    2 1.127063e-04 9.565283e-01
path:mmu04919  7   20 3.627170e-01 1.484823e-04

enrich_kegg_inter <- kegga(fit2, coef = "Interaction",
                        geneid = entrez, species = "Mm")
topKEGG(enrich_kegg_inter)

                                                             Pathway   N
path:mmu03008                      Ribosome biogenesis in eukaryotes  77
path:mmu05322                           Systemic lupus erythematosus 110
path:mmu05034                                             Alcoholism 165
path:mmu05203                                   Viral carcinogenesis 212
path:mmu04919                      Thyroid hormone signaling pathway 115
path:mmu00230                                      Purine metabolism 175
path:mmu05414                           Dilated cardiomyopathy (DCM)  88
path:mmu05412 Arrhythmogenic right ventricular cardiomyopathy (ARVC)  72
path:mmu05212                                      Pancreatic cancer  75
path:mmu05220                               Chronic myeloid leukemia  76
path:mmu05205                                Proteoglycans in cancer 199
path:mmu04930                              Type II diabetes mellitus  48
path:mmu04211                           Longevity regulating pathway  90
path:mmu04261                 Adrenergic signaling in cardiomyocytes 145
path:mmu05222                                 Small cell lung cancer  92
path:mmu04210                                              Apoptosis 134
path:mmu05213                                     Endometrial cancer  58
path:mmu04933   AGE-RAGE signaling pathway in diabetic complications 100
path:mmu05217                                   Basal cell carcinoma  63
path:mmu01522                                   Endocrine resistance  93
              Up Down         P.Up       P.Down
path:mmu03008  0   10 1.0000000000 1.007965e-08
path:mmu05322  0   10 1.0000000000 3.200229e-07
path:mmu05034  5   11 0.5239903502 1.823095e-06
path:mmu05203  3   12 0.9487940322 3.363101e-06
path:mmu04919 11    2 0.0004774097 3.663623e-01
path:mmu00230 14    3 0.0005362122 3.077808e-01
path:mmu05414  9    0 0.0009610474 1.000000e+00
path:mmu05412  8    0 0.0010648869 1.000000e+00
path:mmu05212  3    5 0.3710324131 1.411325e-03
path:mmu05220  4    5 0.1779928540 1.497676e-03
path:mmu05205 14    5 0.0018741551 7.000390e-02
path:mmu04930  6    1 0.0024581255 4.159308e-01
path:mmu04211  3    5 0.4867180093 3.156205e-03
path:mmu04261 11    1 0.0031788883 8.049817e-01
path:mmu05222  1    5 0.9342324019 3.471129e-03
path:mmu04210  6    6 0.1928783824 3.643308e-03
path:mmu05213  3    4 0.2361147774 3.816237e-03
path:mmu04933  6    5 0.0698159306 4.959486e-03
path:mmu05217  4    4 0.1092379866 5.137883e-03
path:mmu01522  8    3 0.0053948301 8.431867e-02

Enrichment of Gene Ontology categories

Use limma::goana and limma::topGO to test for enrichment of differentially expressed genes in Gene Ontology categories.

enrich_go_wt <- goana(fit2, coef = "DoxWT",
                      geneid = entrez, species = "Mm")
topGO(enrich_go_wt, ontology = "BP")

                                                       Term Ont    N  Up
GO:0090304                   nucleic acid metabolic process  BP 4256 411
GO:0016070                            RNA metabolic process  BP 3810 373
GO:0044260         cellular macromolecule metabolic process  BP 7072 577
GO:0006139 nucleobase-containing compound metabolic process  BP 4786 426
GO:0034641     cellular nitrogen compound metabolic process  BP 5404 460
GO:0046483                    heterocycle metabolic process  BP 4882 427
GO:0043170                  macromolecule metabolic process  BP 7739 598
GO:0006725     cellular aromatic compound metabolic process  BP 4945 429
GO:0010467                                  gene expression  BP 4539 403
GO:0006807              nitrogen compound metabolic process  BP 5725 474
GO:1901360        organic cyclic compound metabolic process  BP 5109 432
GO:0006396                                   RNA processing  BP  742 113
GO:0044237                       cellular metabolic process  BP 8723 633
GO:0022613             ribonucleoprotein complex biogenesis  BP  366  71
GO:0044238                        primary metabolic process  BP 8743 626
GO:0034645      cellular macromolecule biosynthetic process  BP 4075 344
GO:0071704              organic substance metabolic process  BP 9212 645
GO:0051252              regulation of RNA metabolic process  BP 3121 280
GO:0032774                         RNA biosynthetic process  BP 3092 278
GO:0097659             nucleic acid-templated transcription  BP 3078 277
           Down         P.Up       P.Down
GO:0090304  331 1.504047e-37 8.626750e-03
GO:0016070  306 1.468665e-34 1.970377e-03
GO:0044260  589 2.147486e-34 9.800194e-09
GO:0006139  393 1.214879e-30 5.157099e-05
GO:0034641  453 4.594157e-29 8.906329e-07
GO:0046483  406 5.642472e-29 1.051022e-05
GO:0043170  630 7.077190e-29 8.377610e-08
GO:0006725  409 2.423666e-28 1.647989e-05
GO:0010467  378 2.692286e-28 2.196723e-05
GO:0006807  472 2.722933e-27 3.295379e-06
GO:1901360  420 4.358967e-26 2.178695e-05
GO:0006396   65 7.616058e-24 3.022888e-02
GO:0044237  731 3.723539e-23 1.744190e-12
GO:0022613   26 3.096042e-21 4.775639e-01
GO:0044238  725 4.779276e-21 2.905773e-11
GO:0034645  334 1.552262e-19 2.752467e-04
GO:0071704  765 2.262336e-19 1.992779e-12
GO:0051252  254 4.844028e-19 2.593424e-03
GO:0032774  244 5.051745e-19 1.343578e-02
GO:0097659  242 5.273962e-19 1.615435e-02

enrich_go_inter <- goana(fit2, coef = "Interaction",
                         geneid = entrez, species = "Mm")
topGO(enrich_go_inter, ontology = "BP")

                                                    Term Ont    N  Up Down
GO:0034470                              ncRNA processing  BP  300  10   20
GO:0042254                           ribosome biogenesis  BP  249   4   18
GO:0006334                           nucleosome assembly  BP   90   1   12
GO:0006396                                RNA processing  BP  742  28   30
GO:0034728                       nucleosome organization  BP  114   2   13
GO:0034660                       ncRNA metabolic process  BP  402  12   22
GO:0071103                       DNA conformation change  BP  174   3   15
GO:0031497                            chromatin assembly  BP  106   1   12
GO:0022613          ribonucleoprotein complex biogenesis  BP  366  11   20
GO:0016043               cellular component organization  BP 5401 230   67
GO:0006364                               rRNA processing  BP  169   3   14
GO:0060047                             heart contraction  BP  187  24    2
GO:0003015                                 heart process  BP  190  24    2
GO:0016072                        rRNA metabolic process  BP  174   3   14
GO:0090304                nucleic acid metabolic process  BP 4256 143   81
GO:0061337                            cardiac conduction  BP   45  12    0
GO:0086003               cardiac muscle cell contraction  BP   46  12    0
GO:0006333             chromatin assembly or disassembly  BP  127   2   12
GO:0071840 cellular component organization or biogenesis  BP 5552 232   78
GO:0030490                        maturation of SSU-rRNA  BP   47   0    8
                   P.Up       P.Down
GO:0034470 4.276026e-01 4.956310e-11
GO:0042254 9.468523e-01 1.280676e-10
GO:0006334 9.380859e-01 1.522812e-10
GO:0006396 1.404747e-01 1.829287e-10
GO:0034728 8.649106e-01 1.912084e-10
GO:0034660 5.655911e-01 2.325658e-10
GO:0071103 9.018931e-01 4.099080e-10
GO:0031497 9.622922e-01 1.058346e-09
GO:0022613 5.571261e-01 1.637937e-09
GO:0016043 2.509447e-09 5.204254e-02
GO:0006364 8.908148e-01 2.688905e-09
GO:0060047 2.742485e-09 5.797084e-01
GO:0003015 3.784836e-09 5.883840e-01
GO:0016072 9.018931e-01 3.928343e-09
GO:0090304 9.322263e-02 4.457810e-09
GO:0061337 6.378454e-09 1.000000e+00
GO:0086003 8.393293e-09 1.000000e+00
GO:0006333 9.016483e-01 8.547694e-09
GO:0071840 1.190080e-08 1.306441e-03
GO:0030490 1.000000e+00 2.600262e-08

Session information

sessionInfo()

R version 3.3.3 (2017-03-06)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X Yosemite 10.10.5

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] parallel  stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
[1] limma_3.30.13       ggplot2_2.2.1       Biobase_2.34.0     
[4] BiocGenerics_0.20.0

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.14         pillar_1.2.2         git2r_0.21.0        
 [4] plyr_1.8.4           workflowr_1.0.1      R.methodsS3_1.7.1   
 [7] R.utils_2.6.0        tools_3.3.3          digest_0.6.13       
[10] bit_1.1-12           memoise_1.1.0        evaluate_0.10.1     
[13] RSQLite_2.0          tibble_1.4.2         gtable_0.2.0        
[16] pkgconfig_2.0.1      rlang_0.2.0          DBI_0.8             
[19] yaml_2.1.16          org.Mm.eg.db_3.4.0   stringr_1.3.0       
[22] knitr_1.20           IRanges_2.8.2        S4Vectors_0.12.2    
[25] stats4_3.3.3         rprojroot_1.3-2      bit64_0.9-7         
[28] grid_3.3.3           AnnotationDbi_1.36.2 rmarkdown_1.9.12    
[31] GO.db_3.4.0          blob_1.1.1           magrittr_1.5        
[34] whisker_0.3-2        backports_1.1.2      scales_0.5.0        
[37] htmltools_0.3.6      colorspace_1.3-2     labeling_0.3        
[40] stringi_1.1.7        lazyeval_0.2.1       munsell_0.4.3       
[43] R.oo_1.22.0

This reproducible R Markdown analysis was created with workflowr 1.0.1