Last updated: 2018-09-18
workflowr checks: (Click a bullet for more information) ✔ R Markdown file: up-to-date
Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.
✔ Environment: empty
Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.
✔ Seed:
set.seed(20180626)
The command set.seed(20180626)
was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.
✔ Session information: recorded
Great job! Recording the operating system, R version, and package versions is critical for reproducibility.
✔ Repository version: bcd0424
wflow_publish
or wflow_git_commit
). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:
Ignored files:
Ignored: .Rhistory
Ignored: .Rproj.user/
Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.
File | Version | Author | Date | Message |
---|---|---|---|---|
Rmd | bcd0424 | Xiang Zhu | 2018-09-18 | wflow_publish(“analysis/gene_set.Rmd”) |
html | ddd8480 | Xiang Zhu | 2018-09-17 | Build site. |
Rmd | decce81 | Xiang Zhu | 2018-09-17 | wflow_publish(“analysis/gene_set.Rmd”) |
html | 32d3a60 | Xiang Zhu | 2018-09-16 | Build site. |
Rmd | 8572f1a | Xiang Zhu | 2018-09-16 | wflow_publish(“analysis/gene_set.Rmd”) |
All 4,026 gene sets used in Zhu and Stephens (2017) are freely available at xiangzhu/rss-gsea
, where the folder biological_pathway
contains 3,913 biological pathways, and the folder tissue_set
contains 113 GTEx tissue-based gene sets. These gene sets can be referenced in a journal’s “Data availability” section as .
data/
├── README.md
├── biological_pathway
│ ├── gene_37.3.mat
│ └── pathway.mat
└── tissue_set
├── de_genes
├── he_genes
└── se_genes
5 directories, 3 files
The 3,913 GTEx biological pathway used in Zhu and Stephens (2017) are available in the folder biological_pathway
, which are represented by two files gene_37.3.mat
and pathway.mat
.
The file gene_37.3.mat
contains basic information of genes.
>> load gene_37.3.mat
>> gene
gene =
struct with fields:
id: [18732x1 double]
symbol: {18732x1 cell}
chr: [18732x1 double]
desc: {18732x1 cell}
start: [18732x1 double]
stop: [18732x1 double]
>> [gene.id(10) gene.chr(10) gene.start(10) gene.stop(10)]
ans =
18 16 8768444 8878432
>> gene.symbol(10)
ans =
1x1 cell array
{'ABAT'}
>> gene.desc(10)
ans =
1x1 cell array
{'4-aminobutyrate aminotransferase'}
Note that only 18,313 genes mapped to reference sequence were used in our analyses.
>> [min(gene.start) min(gene.stop)]
ans =
-1 -1
>> inref_genes = ~(gene.start == -1 | gene.stop == -1);
>> sum(inref_genes)
ans =
18313
The file pathway.mat
contains basic information of pathways.
>> load pathway.mat
>> pathway
pathway =
struct with fields:
label: {4076x1 cell}
database: {4076x1 cell}
source: {4076x1 cell}
genes: [18732x4076 double]
synonyms: {4076x1 cell}
>> pathway.label(100)
ans =
1x1 cell array
{'Activation of NOXA and translocation to mitochondria'}
>> pathway.database(100)
ans =
1x1 cell array
{'PC'}
>> pathway.source(100)
ans =
1x1 cell array
{'reactome'}
The gene-pathway information is represented as a sparse zero-one matrix pathway.genes
, where genes(i,j)==1
if gene i
is a member of pathway j
and genes(i,j)==0
otherwise.
>> genes = pathway.genes;
>> whos genes
Name Size Bytes Class Attributes
genes 18732x4076 3257512 double sparse
>> genes(:,100)
ans =
(1243,1) 1
(3410,1) 1
(4567,1) 1
(4668,1) 1
Finally, our analyses only used 3,913 of 4,076 pathways that
database
and source
definitions;Viral RNP Complexes in the Host Cell Nucleus (PC, reactome)
(because no HapMap3 SNP was mapped to this pathway).>> numgenes = pathway.genes' * inref_genes;
>> size(numgenes)
ans =
4076 1
>> paths = find(numgenes > 1 & numgenes < 500);
>> size(paths)
ans =
3916 1
>> database = pathway.database;
>> source = pathway.source;
>> database_na = find(not(cellfun('isempty', strfind(database, 'NA'))));
>> source_na = find(not(cellfun('isempty', strfind(source, 'NA'))));
>> length(union(database_na, source_na))
ans =
2
>> label = pathway.label;
>> pathway_exclude = 'Viral RNP Complexes in the Host Cell Nucleus';
>> label_include = find(cellfun('isempty', strfind(label, pathway_exclude)));
>> label_exclude = setdiff(1:4076, label_include);
>> label(label_exclude)
ans =
1x1 cell array
{'Viral RNP Complexes in the Host Cell Nucleus'}
>> database(label_exclude)
ans =
1x1 cell array
{'PC'}
>> source(label_exclude)
ans =
1x1 cell array
{'reactome'}
The 113 GTEx tissue-based gene sets used in Zhu and Stephens (2017) are available in the folder tissue_set
. There are 44 “highly expressed” (HE) gene sets, 49 “selectively expressed” (SE) gene sets and 20 “distinctively expressed” (DE) gene sets. The creation of SE sets uses a method described in Yang et al (2018). The creation of DE sets uses a method described in Dey et al (2017).
44
49
20
Each of the tissue-based gene sets has the following format.
ensembl_gene_id chromosome_name start_position end_position
ENSG00000002933 7 150497491 150502208
ENSG00000072778 17 7120444 7128592
ENSG00000075624 7 5566782 5603415
ENSG00000087086 19 49468558 49470135
Note that the gene information of tissue-based sets was provided by GTEx, which may not be the same as gene_37.3.mat
above.
R version 3.5.1 (2018-07-02)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS High Sierra 10.13.6
Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
loaded via a namespace (and not attached):
[1] workflowr_1.1.1 Rcpp_0.12.18 digest_0.6.17
[4] rprojroot_1.3-2 R.methodsS3_1.7.1 backports_1.1.2
[7] git2r_0.23.0 magrittr_1.5 evaluate_0.11
[10] stringi_1.2.4 whisker_0.3-2 R.oo_1.22.0
[13] R.utils_2.7.0 rmarkdown_1.10 tools_3.5.1
[16] stringr_1.3.1 yaml_2.2.0 compiler_3.5.1
[19] htmltools_0.3.6 knitr_1.20
This reproducible R Markdown analysis was created with workflowr 1.1.1