Last updated: 2019-02-18

All 4,026 gene sets used in Zhu and Stephens (2018) are freely available at xiangzhu/rss-gsea, where the folder biological_pathway contains 3,913 biological pathways, and the folder tissue_set contains 113 GTEx tissue-based gene sets. These gene sets can be referenced in a journal’s “Data availability” section as DOI.

├── biological_pathway
│   ├── gene_37.3.mat
│   └── pathway.mat
└── tissue_set
    ├── de_genes
    ├── he_genes
    └── se_genes

5 directories, 3 files

Biological pathways

The 3,913 GTEx biological pathway used in Zhu and Stephens (2018) are available in the folder biological_pathway, which are represented by two files gene_37.3.mat and pathway.mat.

The file gene_37.3.mat contains basic information of genes.

>> load gene_37.3.mat
>> gene
gene =
  struct with fields:
        id: [18732x1 double]
    symbol: {18732x1 cell}
       chr: [18732x1 double]
      desc: {18732x1 cell}
     start: [18732x1 double]
      stop: [18732x1 double]

>> [ gene.chr(10) gene.start(10) gene.stop(10)]
ans =
          18          16     8768444     8878432

>> gene.symbol(10)
ans =
  1x1 cell array

>> gene.desc(10)
ans =
  1x1 cell array
    {'4-aminobutyrate aminotransferase'}

Note that only 18,313 genes mapped to reference sequence were used in our analyses.

>> [min(gene.start) min(gene.stop)]
ans =
    -1    -1

>> inref_genes = ~(gene.start == -1 | gene.stop == -1);
>> sum(inref_genes)
ans =

The file pathway.mat contains basic information of pathways.

>> load pathway.mat
>> pathway
pathway =
  struct with fields:
       label: {4076x1 cell}
    database: {4076x1 cell}
      source: {4076x1 cell}
       genes: [18732x4076 double]
    synonyms: {4076x1 cell}

>> pathway.label(100)
ans =
  1x1 cell array
    {'Activation of NOXA and translocation to mitochondria'}

>> pathway.database(100)
ans =
  1x1 cell array

>> pathway.source(100)
ans =
  1x1 cell array

The gene-pathway information is represented as a sparse zero-one matrix pathway.genes, where genes(i,j)==1 if gene i is a member of pathway j and genes(i,j)==0 otherwise.

>> genes = pathway.genes;
>> whos genes
  Name           Size                Bytes  Class     Attributes
  genes      18732x4076            3257512  double    sparse

>> genes(:,100)

ans =
      (1243,1)              1
      (3410,1)              1
      (4567,1)              1
      (4668,1)              1  

Finally, our analyses only used 3,913 of 4,076 pathways that

  • include 2-499 RefSeq-mapped genes;
  • have clear database and source definitions;
  • exclude one pathway Viral RNP Complexes in the Host Cell Nucleus (PC, reactome) (because no HapMap3 SNP was mapped to this pathway).
>> numgenes = pathway.genes' * inref_genes;
>> size(numgenes)
ans =
        4076           1

>> paths = find(numgenes > 1 & numgenes < 500);
>> size(paths)
ans =
        3916           1

>> database = pathway.database;
>> source = pathway.source;
>> database_na = find(not(cellfun('isempty', strfind(database, 'NA'))));
>> source_na = find(not(cellfun('isempty', strfind(source, 'NA'))));
>> length(union(database_na, source_na))
ans =

>> label = pathway.label;
>> pathway_exclude = 'Viral RNP Complexes in the Host Cell Nucleus';
>> label_include = find(cellfun('isempty', strfind(label, pathway_exclude)));
>> label_exclude = setdiff(1:4076, label_include);
>> label(label_exclude)
ans =
  1x1 cell array
    {'Viral RNP Complexes in the Host Cell Nucleus'}

>> database(label_exclude)
ans =
  1x1 cell array

>> source(label_exclude)
ans =
  1x1 cell array

Tissue-based gene sets

The 113 GTEx tissue-based gene sets used in Zhu and Stephens (2018) are available in the folder tissue_set. There are 44 “highly expressed” (HE) gene sets, 49 “selectively expressed” (SE) gene sets and 20 “distinctively expressed” (DE) gene sets. The creation of SE sets uses a method described in Yang et al (2018). The creation of DE sets uses a method described in Dey et al (2017).


Each of the tissue-based gene sets has the following format.

ensembl_gene_id chromosome_name start_position  end_position
ENSG00000002933 7   150497491   150502208
ENSG00000072778 17  7120444 7128592
ENSG00000075624 7   5566782 5603415
ENSG00000087086 19  49468558    49470135

Note that the gene information of tissue-based sets was provided by GTEx, which may not be the same as gene_37.3.mat above.

