Last updated: 2024-08-16
Checks: 7 0
Knit directory: muse/
This reproducible R Markdown analysis was created with workflowr (version 1.7.1). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.
Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.
Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.
The command set.seed(20200712)
was run prior to running
the code in the R Markdown file. Setting a seed ensures that any results
that rely on randomness, e.g. subsampling or permutations, are
reproducible.
Great job! Recording the operating system, R version, and package versions is critical for reproducibility.
Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.
Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.
Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.
The results in this page were generated with repository version 71fae59. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.
Note that you need to be careful to ensure that all relevant files for
the analysis have been committed to Git prior to generating the results
(you can use wflow_publish
or
wflow_git_commit
). workflowr only checks the R Markdown
file, but you know if there are other scripts or data files that it
depends on. Below is the status of the Git repository when the results
were generated:
Ignored files:
Ignored: .Rhistory
Ignored: .Rproj.user/
Ignored: r_packages_4.3.3/
Ignored: r_packages_4.4.0/
Untracked files:
Untracked: data/msigdb.v7.5.1.hs.EZID.rds
Untracked: data/msigdb.v7.5.1.hs.SYM.rds
Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.
These are the previous versions of the repository in which changes were
made to the R Markdown (analysis/msigdb.Rmd
) and HTML
(docs/msigdb.html
) files. If you’ve configured a remote Git
repository (see ?wflow_git_remote
), click on the hyperlinks
in the table below to view the files as they were in that past version.
File | Version | Author | Date | Message |
---|---|---|---|---|
Rmd | 71fae59 | Dave Tang | 2024-08-16 | Export Entrez IDs |
html | 7f4de36 | Dave Tang | 2024-08-15 | Build site. |
Rmd | e8b87fb | Dave Tang | 2024-08-15 | Update notebook |
html | 4241975 | Dave Tang | 2023-12-13 | Build site. |
Rmd | 5b0e194 | Dave Tang | 2023-12-13 | Downloading molecular signatures in R |
Following the vignette.
The molecular signatures database (MSigDB) is one of the largest collections of molecular signatures or gene expression signatures. A variety of gene expression signatures are hosted on this database including experimentally derived signatures and signatures representing pathways and ontologies from other curated databases. This rich collection of gene expression signatures (>25,000) can facilitate a wide variety of signature-based analyses, the most popular being gene set enrichment analyses. These signatures can be used to perform enrichment analysis in a DE experiment using tools such as {GSEA}, {fry} (from {limma}) and {camera} (from {limma}). Alternatively, they can be used to perform single-sample gene-set analysis of individual transcriptomic profiles using approaches such as {singscore}, {ssGSEA} and {GSVA}.
This package provides the gene sets in the MSigDB in the form of
GeneSet
objects. This data structure is specifically designed to store information about gene sets, including their member genes and metadata. Other packages, such as {msigdbr} and {EGSEAdata} provide these gene sets too, however, they do so by storing them as lists or tibbles. These structures are not specific to gene sets therefore do not allow storage of important metadata associated with each gene set, for example, their short and long descriptions. Additionally, the lack of structure allows creation of invalid gene sets. Accessory functions implemented in the {GSEABase} package provide a neat interface to interact withGeneSet
objects.
Install {msigdb}. (Dependencies are listed in the Imports section in the DESCRIPTION file.)
if (!require("BiocManager", quietly = TRUE))
install.packages("BiocManager")
if (!require("msigdb", quietly = TRUE))
BiocManager::install("msigdb")
Load package.
library(msigdb)
packageVersion("msigdb")
[1] '1.12.0'
In order to download the MSigDB database, we need to load {ExperimentHub} and {GSEABase}.
suppressPackageStartupMessages(library(ExperimentHub))
suppressPackageStartupMessages(library(GSEABase))
Query an ExperimentHub
object.
eh <- ExperimentHub(ask = FALSE)
AnnotationHub::query(x = eh, pattern = 'msigdb')
ExperimentHub with 49 records
# snapshotDate(): 2024-04-29
# $dataprovider: Broad Institute, Emory University, EBI
# $species: Homo sapiens, Mus musculus
# $rdataclass: GSEABase::GeneSetCollection, list, data.frame
# additional mcols(): taxonomyid, genome, description,
# coordinate_1_based, maintainer, rdatadateadded, preparerclass, tags,
# rdatapath, sourceurl, sourcetype
# retrieve records with, e.g., 'object[["EH5421"]]'
title
EH5421 | msigdb.v7.2.hs.SYM
EH5422 | msigdb.v7.2.hs.EZID
EH5423 | msigdb.v7.2.mm.SYM
EH5424 | msigdb.v7.2.mm.EZID
EH6727 | MSigDB C8 MANNO MIDBRAIN
... ...
EH8296 | msigdb.v7.5.1.hs.SYM
EH8297 | msigdb.v7.5.1.mm.EZID
EH8298 | msigdb.v7.5.1.mm.idf
EH8299 | msigdb.v7.5.1.mm.SYM
EH8300 | imex_hsmm_0722
Specify a more specific pattern to look for only human collections.
AnnotationHub::query(x = eh, pattern = 'msigdb.*hs.SYM')
ExperimentHub with 7 records
# snapshotDate(): 2024-04-29
# $dataprovider: Broad Institute
# $species: Homo sapiens
# $rdataclass: GSEABase::GeneSetCollection
# additional mcols(): taxonomyid, genome, description,
# coordinate_1_based, maintainer, rdatadateadded, preparerclass, tags,
# rdatapath, sourceurl, sourcetype
# retrieve records with, e.g., 'object[["EH5421"]]'
title
EH5421 | msigdb.v7.2.hs.SYM
EH6772 | msigdb.v7.3.hs.SYM
EH6778 | msigdb.v7.4.hs.SYM
EH7359 | msigdb.v7.5.hs.SYM
EH8284 | msigdb.v2022.1.hs.SYM
EH8290 | msigdb.v2023.1.hs.SYM
EH8296 | msigdb.v7.5.1.hs.SYM
The experiment hubs seem to be ordered from earliest to latest.
AnnotationHub::query(x = eh, pattern = 'msigdb.*hs.SYM') |>
tail(1) -> msigdb_hs_latest
names(msigdb_hs_latest)
[1] "EH8296"
msigdb_hs_latest
ExperimentHub with 1 record
# snapshotDate(): 2024-04-29
# names(): EH8296
# package(): msigdb
# $dataprovider: Broad Institute
# $species: Homo sapiens
# $rdataclass: GSEABase::GeneSetCollection
# $rdatadateadded: 2023-07-03
# $title: msigdb.v7.5.1.hs.SYM
# $description: Gene expression signatures (Homo sapiens) from the Molecular...
# $taxonomyid: 9606
# $genome: NA
# $sourcetype: XML
# $sourceurl: https://data.broadinstitute.org/gsea-msigdb/msigdb/release/7.5...
# $sourcesize: NA
# $tags: c("Homo_sapiens_Data", "Mus_musculus_Data")
# retrieve record with 'object[["EH8296"]]'
Data can be downloaded using the unique ID.
eh[[names(msigdb_hs_latest)]]
Data can also be downloaded using
msigdb::getMsigdb()
.
msigdb_ver <- sub(pattern = "msigdb.v(.*).hs.SYM", replacement = "\\1", msigdb_hs_latest$title)
msigdb_hs_sym <- msigdb::getMsigdb(org = "hs", id = "SYM", version = msigdb_ver)
see ?msigdb and browseVignettes('msigdb') for documentation
loading from cache
msigdb_hs_ezid <- msigdb::getMsigdb(org = "hs", id = "EZID", version = msigdb_ver)
see ?msigdb and browseVignettes('msigdb') for documentation
loading from cache
msigdb_hs_sym
GeneSetCollection
names: chr1p11, chr1p12, ..., GOMF_STARCH_BINDING (45226 total)
unique identifiers: RPL22P6, NBPF8, ..., POM121L15P (41072 total)
types in collection:
geneIdType: SymbolIdentifier (1 total)
collectionType: BroadCollection (1 total)
msigdb_hs_ezid
GeneSetCollection
names: chr1p11, chr1p12, ..., GOMF_STARCH_BINDING (45226 total)
unique identifiers: 100132047, 728841, ..., 100101629 (40905 total)
types in collection:
geneIdType: EntrezIdentifier (1 total)
collectionType: BroadCollection (1 total)
Save as RDS.
sym_outfile <- paste0('data/msigdb.v', msigdb_ver, '.hs.SYM.rds')
ezid_outfile <- paste0('data/msigdb.v', msigdb_ver, '.hs.EZID.rds')
saveRDS(object = msigdb_hs_sym, file = sym_outfile)
saveRDS(object = msigdb_hs_ezid, file = ezid_outfile)
File size in bytes.
file.size(sym_outfile)
[1] 24207930
file.size(ezid_outfile)
[1] 19925429
Load RDS and compare.
test <- readRDS(sym_outfile)
identical(test, msigdb_hs_sym)
[1] TRUE
A GeneSetCollection
object is effectively a list and
therefore all list processing functions work.
str(msigdb_hs_sym, max.level = 2)
Formal class 'GeneSetCollection' [package "GSEABase"] with 1 slot
..@ .Data:List of 45226
Each signature is stored in a GeneSet
object and can be
processed using functions from the {GSEABase} package.
gs <- msigdb_hs_sym[[1984]]
gs
setName: BAUS_TFF2_TARGETS_DN
geneIds: BEX4, NAT8, ..., THRSP (total: 12)
geneIdType: Symbol
collectionType: Broad
bcCategory: c2 (Curated)
bcSubCategory: CGP
details: use 'details(object)'
Get gene IDs.
geneIds(gs)
[1] "BEX4" "NAT8" "RBP4" "ART3" "DDX6" "PRDX2" "HEBP1" "CTSC" "PCP4"
[10] "OR2H2" "BAG2" "THRSP"
Details of a gene set.
details(gs)
setName: BAUS_TFF2_TARGETS_DN
geneIds: BEX4, NAT8, ..., THRSP (total: 12)
geneIdType: Symbol
collectionType: Broad
bcCategory: c2 (Curated)
bcSubCategory: CGP
setIdentifier: LVY1HGGWMJ7:35020:Fri May 26 12:20:46 2023:95704
description: Genes down-regulated in pyloric atrium with knockout of TFF2 [GeneID=7032].
(longDescription available)
organism: Mus musculus
pubMedIds: 16121031
urls: https://data.broadinstitute.org/gsea-msigdb/msigdb/release/7.5.1/msigdb_v7.5.1.xml
contributor: Arthur Liberzon
setVersion: 7.5.1
creationDate:
table(sapply(lapply(msigdb_hs_sym, collectionType), bcCategory))
c1 c2 c3 c4 c5 c6 c7 c8 h
299 6180 3726 858 28005 189 5219 700 50
Create vector to subset hallmark gene sets.
wanted <- sapply(lapply(msigdb_hs_sym, collectionType), bcCategory) == "h"
table(wanted)
wanted
FALSE TRUE
45176 50
Hallmark gene sets.
hallmark_gs <- msigdb_hs_sym[wanted]
hallmark_gs
GeneSetCollection
names: HALLMARK_TNFA_SIGNALING_VIA_NFKB, HALLMARK_HYPOXIA, ..., HALLMARK_PANCREAS_BETA_CELLS (50 total)
unique identifiers: JUNB, CXCL2, ..., SRP14 (4383 total)
types in collection:
geneIdType: SymbolIdentifier (1 total)
collectionType: BroadCollection (1 total)
Genes in the HALLMARK_TNFA_SIGNALING_VIA_NFKB gene set.
geneIds(hallmark_gs[[1]])
[1] "JUNB" "CXCL2" "ATF3" "NFKBIA" "TNFAIP3" "PTGS2"
[7] "CXCL1" "IER3" "CD83" "CCL20" "CXCL3" "MAFF"
[13] "NFKB2" "TNFAIP2" "HBEGF" "KLF6" "BIRC3" "PLAUR"
[19] "ZFP36" "ICAM1" "JUN" "EGR3" "IL1B" "BCL2A1"
[25] "PPP1R15A" "ZC3H12A" "SOD2" "NR4A2" "IL1A" "RELB"
[31] "TRAF1" "BTG2" "DUSP1" "MAP3K8" "ETS2" "F3"
[37] "SDC4" "EGR1" "IL6" "TNF" "KDM6B" "NFKB1"
[43] "LIF" "PTX3" "FOSL1" "NR4A1" "JAG1" "CCL4"
[49] "GCH1" "CCL2" "RCAN1" "DUSP2" "EHD1" "IER2"
[55] "REL" "CFLAR" "RIPK2" "NFKBIE" "NR4A3" "PHLDA1"
[61] "IER5" "TNFSF9" "GEM" "GADD45A" "CXCL10" "PLK2"
[67] "BHLHE40" "EGR2" "SOCS3" "SLC2A6" "PTGER4" "DUSP5"
[73] "SERPINB2" "NFIL3" "SERPINE1" "TRIB1" "TIPARP" "RELA"
[79] "BIRC2" "CXCL6" "LITAF" "TNFAIP6" "CD44" "INHBA"
[85] "PLAU" "MYC" "TNFRSF9" "SGK1" "TNIP1" "NAMPT"
[91] "FOSL2" "PNRC1" "ID2" "CD69" "IL7R" "EFNA1"
[97] "PHLDA2" "PFKFB3" "CCL5" "YRDC" "IFNGR2" "SQSTM1"
[103] "BTG3" "GADD45B" "KYNU" "G0S2" "BTG1" "MCL1"
[109] "VEGFA" "MAP2K3" "CDKN1A" "CCN1" "TANK" "IFIT2"
[115] "IL18" "TUBB2A" "IRF1" "FOS" "OLR1" "RHOB"
[121] "AREG" "NINJ1" "ZBTB10" "PLPP3" "KLF4" "CXCL11"
[127] "SAT1" "CSF1" "GPR183" "PMEPA1" "PTPRE" "TLR2"
[133] "ACKR3" "KLF10" "MARCKS" "LAMB3" "CEBPB" "TRIP10"
[139] "F2RL1" "KLF9" "LDLR" "TGIF1" "RNF19B" "DRAM1"
[145] "B4GALT1" "DNAJB4" "CSF2" "PDE4B" "SNN" "PLEK"
[151] "STAT5A" "DENND5A" "CCND1" "DDX58" "SPHK1" "CD80"
[157] "TNFAIP8" "CCNL1" "FUT4" "CCRL2" "SPSB1" "TSC22D1"
[163] "B4GALT5" "SIK1" "CLCF1" "NFE2L2" "FOSB" "PER1"
[169] "NFAT5" "ATP2B1" "IL12B" "IL6ST" "SLC16A6" "ABCA1"
[175] "HES1" "BCL6" "IRS2" "SLC2A3" "CEBPD" "IL23A"
[181] "SMAD3" "TAP1" "MSC" "IFIH1" "IL15RA" "TNIP2"
[187] "BCL3" "PANX1" "FJX1" "EDN1" "EIF1" "BMP2"
[193] "DUSP4" "PDLIM5" "ICOSLG" "GFPT2" "KLF2" "TNC"
[199] "SERPINB8" "MXD1"
sessionInfo()
R version 4.4.0 (2024-04-24)
Platform: x86_64-pc-linux-gnu
Running under: Ubuntu 22.04.4 LTS
Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so; LAPACK version 3.10.0
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
time zone: Etc/UTC
tzcode source: system (glibc)
attached base packages:
[1] stats4 stats graphics grDevices utils datasets methods
[8] base
other attached packages:
[1] GSEABase_1.66.0 graph_1.82.0 annotate_1.82.0
[4] XML_3.99-0.16.1 AnnotationDbi_1.66.0 IRanges_2.38.1
[7] S4Vectors_0.42.1 Biobase_2.64.0 ExperimentHub_2.12.0
[10] AnnotationHub_3.12.0 BiocFileCache_2.12.0 dbplyr_2.5.0
[13] BiocGenerics_0.50.0 msigdb_1.12.0 BiocManager_1.30.23
[16] workflowr_1.7.1
loaded via a namespace (and not attached):
[1] KEGGREST_1.44.1 xfun_0.44 bslib_0.7.0
[4] processx_3.8.4 callr_3.7.6 vctrs_0.6.5
[7] tools_4.4.0 ps_1.7.6 generics_0.1.3
[10] curl_5.2.1 tibble_3.2.1 fansi_1.0.6
[13] RSQLite_2.3.7 blob_1.2.4 pkgconfig_2.0.3
[16] GenomeInfoDbData_1.2.12 lifecycle_1.0.4 compiler_4.4.0
[19] stringr_1.5.1 git2r_0.33.0 Biostrings_2.72.1
[22] getPass_0.2-4 GenomeInfoDb_1.40.1 httpuv_1.6.15
[25] htmltools_0.5.8.1 sass_0.4.9 yaml_2.3.8
[28] later_1.3.2 pillar_1.9.0 crayon_1.5.2
[31] jquerylib_0.1.4 whisker_0.4.1 cachem_1.1.0
[34] mime_0.12 tidyselect_1.2.1 digest_0.6.35
[37] stringi_1.8.4 purrr_1.0.2 dplyr_1.1.4
[40] BiocVersion_3.19.1 rprojroot_2.0.4 fastmap_1.2.0
[43] cli_3.6.2 magrittr_2.0.3 utf8_1.2.4
[46] withr_3.0.0 UCSC.utils_1.0.0 filelock_1.0.3
[49] promises_1.3.0 rappdirs_0.3.3 bit64_4.0.5
[52] XVector_0.44.0 rmarkdown_2.27 httr_1.4.7
[55] bit_4.0.5 png_0.1-8 memoise_2.0.1
[58] evaluate_0.24.0 knitr_1.47 rlang_1.1.4
[61] Rcpp_1.0.12 xtable_1.8-4 glue_1.7.0
[64] DBI_1.2.3 rstudioapi_0.16.0 jsonlite_1.8.8
[67] R6_2.5.1 zlibbioc_1.50.0 fs_1.6.4