• Package
  • Downloading the MSigDB database
  • Accessing the GeneSet and GeneSetCollection objects

Last updated: 2024-08-16

Checks: 7 0

Knit directory: muse/

Following the vignette.

The molecular signatures database (MSigDB) is one of the largest collections of molecular signatures or gene expression signatures. A variety of gene expression signatures are hosted on this database including experimentally derived signatures and signatures representing pathways and ontologies from other curated databases. This rich collection of gene expression signatures (>25,000) can facilitate a wide variety of signature-based analyses, the most popular being gene set enrichment analyses. These signatures can be used to perform enrichment analysis in a DE experiment using tools such as {GSEA}, {fry} (from {limma}) and {camera} (from {limma}). Alternatively, they can be used to perform single-sample gene-set analysis of individual transcriptomic profiles using approaches such as {singscore}, {ssGSEA} and {GSVA}.

This package provides the gene sets in the MSigDB in the form of GeneSet objects. This data structure is specifically designed to store information about gene sets, including their member genes and metadata. Other packages, such as {msigdbr} and {EGSEAdata} provide these gene sets too, however, they do so by storing them as lists or tibbles. These structures are not specific to gene sets therefore do not allow storage of important metadata associated with each gene set, for example, their short and long descriptions. Additionally, the lack of structure allows creation of invalid gene sets. Accessory functions implemented in the {GSEABase} package provide a neat interface to interact with GeneSet objects.


Install {msigdb}. (Dependencies are listed in the Imports section in the DESCRIPTION file.)

if (!require("BiocManager", quietly = TRUE))

if (!require("msigdb", quietly = TRUE))

Load package.

[1] '1.12.0'

Downloading the MSigDB database

In order to download the MSigDB database, we need to load {ExperimentHub} and {GSEABase}.


Query an ExperimentHub object.

eh <- ExperimentHub(ask = FALSE)
AnnotationHub::query(x = eh, pattern = 'msigdb')
ExperimentHub with 49 records
# snapshotDate(): 2024-04-29
# $dataprovider: Broad Institute, Emory University, EBI
# $species: Homo sapiens, Mus musculus
# $rdataclass: GSEABase::GeneSetCollection, list, data.frame
# additional mcols(): taxonomyid, genome, description,
#   coordinate_1_based, maintainer, rdatadateadded, preparerclass, tags,
#   rdatapath, sourceurl, sourcetype 
# retrieve records with, e.g., 'object[["EH5421"]]' 

  EH5421 | msigdb.v7.2.hs.SYM      
  EH5422 | msigdb.v7.2.hs.EZID     
  EH5423 | msigdb.v7.2.mm.SYM      
  EH5424 | msigdb.v7.2.mm.EZID     
  ...      ...                     
  EH8296 | msigdb.v7.5.1.hs.SYM    
  EH8297 | msigdb.v7.5.1.mm.EZID   
  EH8298 | msigdb.v7.5.1.mm.idf    
  EH8299 | msigdb.v7.5.1.mm.SYM    
  EH8300 | imex_hsmm_0722          

Specify a more specific pattern to look for only human collections.

AnnotationHub::query(x = eh, pattern = 'msigdb.*hs.SYM')
ExperimentHub with 7 records
# snapshotDate(): 2024-04-29
# $dataprovider: Broad Institute
# $species: Homo sapiens
# $rdataclass: GSEABase::GeneSetCollection
# additional mcols(): taxonomyid, genome, description,
#   coordinate_1_based, maintainer, rdatadateadded, preparerclass, tags,
#   rdatapath, sourceurl, sourcetype 
# retrieve records with, e.g., 'object[["EH5421"]]' 

  EH5421 | msigdb.v7.2.hs.SYM   
  EH6772 | msigdb.v7.3.hs.SYM   
  EH6778 | msigdb.v7.4.hs.SYM   
  EH7359 | msigdb.v7.5.hs.SYM   
  EH8284 | msigdb.v2022.1.hs.SYM
  EH8290 | msigdb.v2023.1.hs.SYM
  EH8296 | msigdb.v7.5.1.hs.SYM 

The experiment hubs seem to be ordered from earliest to latest.

AnnotationHub::query(x = eh, pattern = 'msigdb.*hs.SYM') |>
  tail(1) -> msigdb_hs_latest

[1] "EH8296"
ExperimentHub with 1 record
# snapshotDate(): 2024-04-29
# names(): EH8296
# package(): msigdb
# $dataprovider: Broad Institute
# $species: Homo sapiens
# $rdataclass: GSEABase::GeneSetCollection
# $rdatadateadded: 2023-07-03
# $title: msigdb.v7.5.1.hs.SYM
# $description: Gene expression signatures (Homo sapiens) from the Molecular...
# $taxonomyid: 9606
# $genome: NA
# $sourcetype: XML
# $sourceurl: https://data.broadinstitute.org/gsea-msigdb/msigdb/release/7.5...
# $sourcesize: NA
# $tags: c("Homo_sapiens_Data", "Mus_musculus_Data") 
# retrieve record with 'object[["EH8296"]]' 

Data can be downloaded using the unique ID.


Data can also be downloaded using msigdb::getMsigdb().

msigdb_ver <- sub(pattern = "msigdb.v(.*).hs.SYM", replacement = "\\1", msigdb_hs_latest$title)

msigdb_hs_sym <- msigdb::getMsigdb(org = "hs", id = "SYM", version = msigdb_ver)
see ?msigdb and browseVignettes('msigdb') for documentation
loading from cache
msigdb_hs_ezid <- msigdb::getMsigdb(org = "hs", id = "EZID", version = msigdb_ver)
see ?msigdb and browseVignettes('msigdb') for documentation
loading from cache
  names: chr1p11, chr1p12, ..., GOMF_STARCH_BINDING (45226 total)
  unique identifiers: RPL22P6, NBPF8, ..., POM121L15P (41072 total)
  types in collection:
    geneIdType: SymbolIdentifier (1 total)
    collectionType: BroadCollection (1 total)
  names: chr1p11, chr1p12, ..., GOMF_STARCH_BINDING (45226 total)
  unique identifiers: 100132047, 728841, ..., 100101629 (40905 total)
  types in collection:
    geneIdType: EntrezIdentifier (1 total)
    collectionType: BroadCollection (1 total)

Save as RDS.

sym_outfile <- paste0('data/msigdb.v', msigdb_ver, '.hs.SYM.rds')
ezid_outfile <- paste0('data/msigdb.v', msigdb_ver, '.hs.EZID.rds')

saveRDS(object = msigdb_hs_sym, file = sym_outfile)
saveRDS(object = msigdb_hs_ezid, file = ezid_outfile)

File size in bytes.

[1] 24207930
[1] 19925429

Load RDS and compare.

test <- readRDS(sym_outfile)
identical(test, msigdb_hs_sym)
[1] TRUE

Accessing the GeneSet and GeneSetCollection objects

A GeneSetCollection object is effectively a list and therefore all list processing functions work.

str(msigdb_hs_sym, max.level = 2)
Formal class 'GeneSetCollection' [package "GSEABase"] with 1 slot
  ..@ .Data:List of 45226

Each signature is stored in a GeneSet object and can be processed using functions from the {GSEABase} package.

gs <- msigdb_hs_sym[[1984]]
geneIds: BEX4, NAT8, ..., THRSP (total: 12)
geneIdType: Symbol
collectionType: Broad
  bcCategory: c2 (Curated)
  bcSubCategory: CGP
details: use 'details(object)'

Get gene IDs.

 [1] "BEX4"  "NAT8"  "RBP4"  "ART3"  "DDX6"  "PRDX2" "HEBP1" "CTSC"  "PCP4" 
[10] "OR2H2" "BAG2"  "THRSP"

Details of a gene set.

geneIds: BEX4, NAT8, ..., THRSP (total: 12)
geneIdType: Symbol
collectionType: Broad
  bcCategory: c2 (Curated)
  bcSubCategory: CGP
setIdentifier: LVY1HGGWMJ7:35020:Fri May 26 12:20:46 2023:95704
description: Genes down-regulated in pyloric atrium with knockout of TFF2 [GeneID=7032].
  (longDescription available)
organism: Mus musculus
pubMedIds: 16121031
urls: https://data.broadinstitute.org/gsea-msigdb/msigdb/release/7.5.1/msigdb_v7.5.1.xml
contributor: Arthur Liberzon
setVersion: 7.5.1

Human collection types:

  • H - hallmark gene sets are coherently expressed signatures derived by aggregating many MSigDB gene sets to represent well-defined biological states or processes.
  • C1 - positional gene sets corresponding to human chromosome cytogenetic bands.
  • C2 - curated gene sets from online pathway databases, publications in PubMed, and knowledge of domain experts.
  • C3 - regulatory target gene sets based on gene target predictions for microRNA seed sequences and predicted transcription factor binding sites.
  • C4 - computational gene sets defined by mining large collections of cancer-oriented expression data.
  • C5 - ontology gene sets consist of genes annotated by the same ontology term.
  • C6 - oncogenic signature gene sets defined directly from microarray gene expression data from cancer gene perturbations.
  • C7 - immunologic signature gene sets represent cell states and perturbations within the immune system.
  • C8 - cell type signature gene sets curated from cluster markers identified in single-cell sequencing studies of human tissue.
table(sapply(lapply(msigdb_hs_sym, collectionType), bcCategory))

   c1    c2    c3    c4    c5    c6    c7    c8     h 
  299  6180  3726   858 28005   189  5219   700    50 

Create vector to subset hallmark gene sets.

wanted <- sapply(lapply(msigdb_hs_sym, collectionType), bcCategory) == "h"
45176    50 

Hallmark gene sets.

hallmark_gs <- msigdb_hs_sym[wanted]
  unique identifiers: JUNB, CXCL2, ..., SRP14 (4383 total)
  types in collection:
    geneIdType: SymbolIdentifier (1 total)
    collectionType: BroadCollection (1 total)


  [1] "JUNB"     "CXCL2"    "ATF3"     "NFKBIA"   "TNFAIP3"  "PTGS2"   
  [7] "CXCL1"    "IER3"     "CD83"     "CCL20"    "CXCL3"    "MAFF"    
 [13] "NFKB2"    "TNFAIP2"  "HBEGF"    "KLF6"     "BIRC3"    "PLAUR"   
 [19] "ZFP36"    "ICAM1"    "JUN"      "EGR3"     "IL1B"     "BCL2A1"  
 [25] "PPP1R15A" "ZC3H12A"  "SOD2"     "NR4A2"    "IL1A"     "RELB"    
 [31] "TRAF1"    "BTG2"     "DUSP1"    "MAP3K8"   "ETS2"     "F3"      
 [37] "SDC4"     "EGR1"     "IL6"      "TNF"      "KDM6B"    "NFKB1"   
 [43] "LIF"      "PTX3"     "FOSL1"    "NR4A1"    "JAG1"     "CCL4"    
 [49] "GCH1"     "CCL2"     "RCAN1"    "DUSP2"    "EHD1"     "IER2"    
 [55] "REL"      "CFLAR"    "RIPK2"    "NFKBIE"   "NR4A3"    "PHLDA1"  
 [61] "IER5"     "TNFSF9"   "GEM"      "GADD45A"  "CXCL10"   "PLK2"    
 [67] "BHLHE40"  "EGR2"     "SOCS3"    "SLC2A6"   "PTGER4"   "DUSP5"   
 [73] "SERPINB2" "NFIL3"    "SERPINE1" "TRIB1"    "TIPARP"   "RELA"    
 [79] "BIRC2"    "CXCL6"    "LITAF"    "TNFAIP6"  "CD44"     "INHBA"   
 [85] "PLAU"     "MYC"      "TNFRSF9"  "SGK1"     "TNIP1"    "NAMPT"   
 [91] "FOSL2"    "PNRC1"    "ID2"      "CD69"     "IL7R"     "EFNA1"   
 [97] "PHLDA2"   "PFKFB3"   "CCL5"     "YRDC"     "IFNGR2"   "SQSTM1"  
[103] "BTG3"     "GADD45B"  "KYNU"     "G0S2"     "BTG1"     "MCL1"    
[109] "VEGFA"    "MAP2K3"   "CDKN1A"   "CCN1"     "TANK"     "IFIT2"   
[115] "IL18"     "TUBB2A"   "IRF1"     "FOS"      "OLR1"     "RHOB"    
[121] "AREG"     "NINJ1"    "ZBTB10"   "PLPP3"    "KLF4"     "CXCL11"  
[127] "SAT1"     "CSF1"     "GPR183"   "PMEPA1"   "PTPRE"    "TLR2"    
[133] "ACKR3"    "KLF10"    "MARCKS"   "LAMB3"    "CEBPB"    "TRIP10"  
[139] "F2RL1"    "KLF9"     "LDLR"     "TGIF1"    "RNF19B"   "DRAM1"   
[145] "B4GALT1"  "DNAJB4"   "CSF2"     "PDE4B"    "SNN"      "PLEK"    
[151] "STAT5A"   "DENND5A"  "CCND1"    "DDX58"    "SPHK1"    "CD80"    
[157] "TNFAIP8"  "CCNL1"    "FUT4"     "CCRL2"    "SPSB1"    "TSC22D1" 
[163] "B4GALT5"  "SIK1"     "CLCF1"    "NFE2L2"   "FOSB"     "PER1"    
[169] "NFAT5"    "ATP2B1"   "IL12B"    "IL6ST"    "SLC16A6"  "ABCA1"   
[175] "HES1"     "BCL6"     "IRS2"     "SLC2A3"   "CEBPD"    "IL23A"   
[181] "SMAD3"    "TAP1"     "MSC"      "IFIH1"    "IL15RA"   "TNIP2"   
[187] "BCL3"     "PANX1"    "FJX1"     "EDN1"     "EIF1"     "BMP2"    
[193] "DUSP4"    "PDLIM5"   "ICOSLG"   "GFPT2"    "KLF2"     "TNC"     
[199] "SERPINB8" "MXD1"    

R version 4.4.0 (2024-04-24)
Platform: x86_64-pc-linux-gnu
Running under: Ubuntu 22.04.4 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so;  LAPACK version 3.10.0

 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            

time zone: Etc/UTC
tzcode source: system (glibc)

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
 [1] GSEABase_1.66.0      graph_1.82.0         annotate_1.82.0     
 [4] XML_3.99-0.16.1      AnnotationDbi_1.66.0 IRanges_2.38.1      
 [7] S4Vectors_0.42.1     Biobase_2.64.0       ExperimentHub_2.12.0
[10] AnnotationHub_3.12.0 BiocFileCache_2.12.0 dbplyr_2.5.0        
[13] BiocGenerics_0.50.0  msigdb_1.12.0        BiocManager_1.30.23 
[16] workflowr_1.7.1     

loaded via a namespace (and not attached):
 [1] KEGGREST_1.44.1         xfun_0.44               bslib_0.7.0            
 [4] processx_3.8.4          callr_3.7.6             vctrs_0.6.5            
 [7] tools_4.4.0             ps_1.7.6                generics_0.1.3         
[10] curl_5.2.1              tibble_3.2.1            fansi_1.0.6            
[13] RSQLite_2.3.7           blob_1.2.4              pkgconfig_2.0.3        
[16] GenomeInfoDbData_1.2.12 lifecycle_1.0.4         compiler_4.4.0         
[19] stringr_1.5.1           git2r_0.33.0            Biostrings_2.72.1      
[22] getPass_0.2-4           GenomeInfoDb_1.40.1     httpuv_1.6.15          
[25] htmltools_0.5.8.1       sass_0.4.9              yaml_2.3.8             
[28] later_1.3.2             pillar_1.9.0            crayon_1.5.2           
[31] jquerylib_0.1.4         whisker_0.4.1           cachem_1.1.0           
[34] mime_0.12               tidyselect_1.2.1        digest_0.6.35          
[37] stringi_1.8.4           purrr_1.0.2             dplyr_1.1.4            
[40] BiocVersion_3.19.1      rprojroot_2.0.4         fastmap_1.2.0          
[43] cli_3.6.2               magrittr_2.0.3          utf8_1.2.4             
[46] withr_3.0.0             UCSC.utils_1.0.0        filelock_1.0.3         
[49] promises_1.3.0          rappdirs_0.3.3          bit64_4.0.5            
[52] XVector_0.44.0          rmarkdown_2.27          httr_1.4.7             
[55] bit_4.0.5               png_0.1-8               memoise_2.0.1          
[58] evaluate_0.24.0         knitr_1.47              rlang_1.1.4            
[61] Rcpp_1.0.12             xtable_1.8-4            glue_1.7.0             
[64] DBI_1.2.3               rstudioapi_0.16.0       jsonlite_1.8.8         
[67] R6_2.5.1                zlibbioc_1.50.0         fs_1.6.4