Last updated: 2018-10-19
workflowr checks: (Click a bullet for more information) ✔ R Markdown file: up-to-date
Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.
✔ Environment: empty
Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.
✔ Seed:
set.seed(20180626)
The command set.seed(20180626)
was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.
✔ Session information: recorded
Great job! Recording the operating system, R version, and package versions is critical for reproducibility.
✔ Repository version: 0515710
wflow_publish
or wflow_git_commit
). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:
Ignored files:
Ignored: .Rhistory
Ignored: .Rproj.user/
Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.
File | Version | Author | Date | Message |
---|---|---|---|---|
Rmd | 0515710 | Xiang Zhu | 2018-10-19 | wflow_publish(“analysis/gene_set.Rmd”) |
html | 1c85967 | Xiang Zhu | 2018-10-05 | Build site. |
html | 0324740 | Xiang Zhu | 2018-09-18 | Build site. |
Rmd | bcd0424 | Xiang Zhu | 2018-09-18 | wflow_publish(“analysis/gene_set.Rmd”) |
html | ddd8480 | Xiang Zhu | 2018-09-17 | Build site. |
Rmd | decce81 | Xiang Zhu | 2018-09-17 | wflow_publish(“analysis/gene_set.Rmd”) |
html | 32d3a60 | Xiang Zhu | 2018-09-16 | Build site. |
Rmd | 8572f1a | Xiang Zhu | 2018-09-16 | wflow_publish(“analysis/gene_set.Rmd”) |
All 4,026 gene sets used in Zhu and Stephens (2018) are freely available at xiangzhu/rss-gsea
, where the folder biological_pathway
contains 3,913 biological pathways, and the folder tissue_set
contains 113 GTEx tissue-based gene sets. These gene sets can be referenced in a journal’s “Data availability” section as .
data/
├── README.md
├── biological_pathway
│ ├── gene_37.3.mat
│ └── pathway.mat
└── tissue_set
├── de_genes
├── he_genes
└── se_genes
5 directories, 3 files
The 3,913 GTEx biological pathway used in Zhu and Stephens (2018) are available in the folder biological_pathway
, which are represented by two files gene_37.3.mat
and pathway.mat
.
The file gene_37.3.mat
contains basic information of genes.
>> load gene_37.3.mat
>> gene
gene =
struct with fields:
id: [18732x1 double]
symbol: {18732x1 cell}
chr: [18732x1 double]
desc: {18732x1 cell}
start: [18732x1 double]
stop: [18732x1 double]
>> [gene.id(10) gene.chr(10) gene.start(10) gene.stop(10)]
ans =
18 16 8768444 8878432
>> gene.symbol(10)
ans =
1x1 cell array
{'ABAT'}
>> gene.desc(10)
ans =
1x1 cell array
{'4-aminobutyrate aminotransferase'}
Note that only 18,313 genes mapped to reference sequence were used in our analyses.
>> [min(gene.start) min(gene.stop)]
ans =
-1 -1
>> inref_genes = ~(gene.start == -1 | gene.stop == -1);
>> sum(inref_genes)
ans =
18313
The file pathway.mat
contains basic information of pathways.
>> load pathway.mat
>> pathway
pathway =
struct with fields:
label: {4076x1 cell}
database: {4076x1 cell}
source: {4076x1 cell}
genes: [18732x4076 double]
synonyms: {4076x1 cell}
>> pathway.label(100)
ans =
1x1 cell array
{'Activation of NOXA and translocation to mitochondria'}
>> pathway.database(100)
ans =
1x1 cell array
{'PC'}
>> pathway.source(100)
ans =
1x1 cell array
{'reactome'}
The gene-pathway information is represented as a sparse zero-one matrix pathway.genes
, where genes(i,j)==1
if gene i
is a member of pathway j
and genes(i,j)==0
otherwise.
>> genes = pathway.genes;
>> whos genes
Name Size Bytes Class Attributes
genes 18732x4076 3257512 double sparse
>> genes(:,100)
ans =
(1243,1) 1
(3410,1) 1
(4567,1) 1
(4668,1) 1
Finally, our analyses only used 3,913 of 4,076 pathways that
database
and source
definitions;Viral RNP Complexes in the Host Cell Nucleus (PC, reactome)
(because no HapMap3 SNP was mapped to this pathway).>> numgenes = pathway.genes' * inref_genes;
>> size(numgenes)
ans =
4076 1
>> paths = find(numgenes > 1 & numgenes < 500);
>> size(paths)
ans =
3916 1
>> database = pathway.database;
>> source = pathway.source;
>> database_na = find(not(cellfun('isempty', strfind(database, 'NA'))));
>> source_na = find(not(cellfun('isempty', strfind(source, 'NA'))));
>> length(union(database_na, source_na))
ans =
2
>> label = pathway.label;
>> pathway_exclude = 'Viral RNP Complexes in the Host Cell Nucleus';
>> label_include = find(cellfun('isempty', strfind(label, pathway_exclude)));
>> label_exclude = setdiff(1:4076, label_include);
>> label(label_exclude)
ans =
1x1 cell array
{'Viral RNP Complexes in the Host Cell Nucleus'}
>> database(label_exclude)
ans =
1x1 cell array
{'PC'}
>> source(label_exclude)
ans =
1x1 cell array
{'reactome'}
The 113 GTEx tissue-based gene sets used in Zhu and Stephens (2018) are available in the folder tissue_set
. There are 44 “highly expressed” (HE) gene sets, 49 “selectively expressed” (SE) gene sets and 20 “distinctively expressed” (DE) gene sets. The creation of SE sets uses a method described in Yang et al (2018). The creation of DE sets uses a method described in Dey et al (2017).
44
49
20
Each of the tissue-based gene sets has the following format.
ensembl_gene_id chromosome_name start_position end_position
ENSG00000002933 7 150497491 150502208
ENSG00000072778 17 7120444 7128592
ENSG00000075624 7 5566782 5603415
ENSG00000087086 19 49468558 49470135
Note that the gene information of tissue-based sets was provided by GTEx, which may not be the same as gene_37.3.mat
above.
Session info -------------------------------------------------------------
setting value
version R version 3.5.1 (2018-07-02)
system x86_64, darwin15.6.0
ui X11
language (EN)
collate en_US.UTF-8
tz America/Los_Angeles
date 2018-10-19
Packages -----------------------------------------------------------------
package * version date source
backports 1.1.2 2017-12-13 CRAN (R 3.5.0)
base * 3.5.1 2018-07-05 local
compiler 3.5.1 2018-07-05 local
datasets * 3.5.1 2018-07-05 local
devtools 1.13.6 2018-06-27 CRAN (R 3.5.0)
digest 0.6.17 2018-09-12 CRAN (R 3.5.0)
evaluate 0.11 2018-07-17 CRAN (R 3.5.0)
git2r 0.23.0 2018-07-17 CRAN (R 3.5.0)
graphics * 3.5.1 2018-07-05 local
grDevices * 3.5.1 2018-07-05 local
htmltools 0.3.6 2017-04-28 CRAN (R 3.5.0)
knitr 1.20 2018-02-20 CRAN (R 3.5.0)
magrittr 1.5 2014-11-22 CRAN (R 3.5.0)
memoise 1.1.0 2017-04-21 CRAN (R 3.5.0)
methods * 3.5.1 2018-07-05 local
R.methodsS3 1.7.1 2016-02-16 CRAN (R 3.5.0)
R.oo 1.22.0 2018-04-22 CRAN (R 3.5.0)
R.utils 2.7.0 2018-08-27 CRAN (R 3.5.0)
Rcpp 0.12.19 2018-10-01 CRAN (R 3.5.0)
rmarkdown 1.10 2018-06-11 CRAN (R 3.5.0)
rprojroot 1.3-2 2018-01-03 CRAN (R 3.5.0)
stats * 3.5.1 2018-07-05 local
stringi 1.2.4 2018-07-20 CRAN (R 3.5.0)
stringr 1.3.1 2018-05-10 CRAN (R 3.5.0)
tools 3.5.1 2018-07-05 local
utils * 3.5.1 2018-07-05 local
whisker 0.3-2 2013-04-28 CRAN (R 3.5.0)
withr 2.1.2 2018-03-15 CRAN (R 3.5.0)
workflowr 1.1.1 2018-07-06 CRAN (R 3.5.0)
yaml 2.2.0 2018-07-25 CRAN (R 3.5.0)
This reproducible R Markdown analysis was created with workflowr 1.1.1