Last updated: 2020-03-08
Checks: 7 0
Knit directory: rss-net/
This reproducible R Markdown analysis was created with workflowr (version 1.6.0). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.
Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.
Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.
The command set.seed(20190823)
was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.
Great job! Recording the operating system, R version, and package versions is critical for reproducibility.
Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.
Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.
Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility. The version displayed above was the version of the Git repository at the time these results were generated.
Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish
or wflow_git_commit
). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:
Ignored files:
Ignored: .Rhistory
Ignored: .Rproj.user/
Ignored: examples/ibd2015_nkcell/ibd2015_Primary_Natural_Killer_cells_from_peripheral_blood_NK_out_66.mat
Ignored: examples/ibd2015_nkcell/randinit_results/
Ignored: examples/ibd2015_nkcell/results/
Ignored: examples/wtccc_bcell/results/
Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.
These are the previous versions of the R Markdown and HTML files. If you’ve configured a remote Git repository (see ?wflow_git_remote
), click on the hyperlinks in the table below to view them.
File | Version | Author | Date | Message |
---|---|---|---|---|
html | 5c5e8e1 | xiangzhu | 2020-03-08 | Build site. |
html | 31f9064 | xiangzhu | 2020-03-06 | Build site. |
Rmd | 9469f2d | xiangzhu | 2020-03-06 | wflow_publish(“rmd/wtccc_bcell.Rmd”) |
html | f2c772c | xiangzhu | 2020-03-06 | Build site. |
Rmd | 26bfe19 | xiangzhu | 2020-03-06 | wflow_publish(“rmd/wtccc_bcell.Rmd”) |
html | 991b323 | xiangzhu | 2020-03-06 | Build site. |
Rmd | e3daa54 | xiangzhu | 2020-03-06 | wflow_publish(“rmd/wtccc_bcell.Rmd”) |
html | 3703a3d | xiangzhu | 2020-03-06 | Build site. |
Rmd | 4de49b7 | xiangzhu | 2020-03-06 | wflow_publish(“rmd/wtccc_bcell.Rmd”) |
html | 5467783 | xiangzhu | 2020-03-06 | Build site. |
Rmd | 9729b44 | xiangzhu | 2020-03-06 | wflow_publish(“rmd/wtccc_bcell.Rmd”) |
html | c2ee73a | xiangzhu | 2020-03-06 | Build site. |
Rmd | b109ab5 | xiangzhu | 2020-03-06 | wflow_publish(“rmd/wtccc_bcell.Rmd”) |
Here we describe an end-to-end RSS-NET analysis of two synthetic datasets that are simulated from real genotypes of 348965 genome-wide SNPs from 1458 individuals in the UK Blood Service Control Group (Wellcome Trust Case Control Consortium, 2007) and a gene regulatory network inferred for B cells (ENCODE Project Consortium, 2012). This example gives a quick view of how RSS-NET works.
To reproduce results of this example, please use scripts in the directory wtccc_bcell/
, and follow the step-by-step guide below. Before running any script in wtccc_bcell/
, please install RSS-NET.
With the software installed and the input data downloaded, one should be able to run this example by simply typing the following line in shell:
If a different directory is used to store the software, input data and/or output data, please modify file paths in the given scripts accordingly.
All data files required to run this example are freely available at Zenodo . Please contact me (xiangzhu[at]stanford.edu
) if you have any trouble accessing these files. After a complete download, you should see the following files.
$ tree .
.
├── m0_data.mat
├── m0_true_gene.mat
├── m1_data.mat
├── m1_true_gene.mat
├── Primary_B_cells_from_peripheral_blood_gene2gene.mat
├── wtccc_bcell_data.md5
├── wtccc_bcell_results.zip
├── WTCCC_Primary_B_cells_from_peripheral_blood_snp2gene_cis.mat
├── WTCCC_Primary_B_cells_from_peripheral_blood_snp2net.mat
├── WTCCC_snp2gene.mat
└── wtccc_tidy_snp2gene_ncbi35togrch37.mat
0 directories, 11 files
To help readers confirm if they use the same files as we do, we report 128-bit MD5 hashes of all files in wtccc_bcell_data.md5
.
To help readers confirm if they can reproduce results of this example, we also provide the results (wtccc_bcell_results.zip
) in the same Zenodo deposit.
For simplicity and generality, we introduce the following short-hand notations.
m*_data.mat
: simulated GWAS summary statistics and reference LD estimatesHere we simulate two synthetic datasets from WTCCC genotypes and B cell network: one (m0_data.mat
) based on RSS-NET baseline model \(M_0\), and the other (m1_data.mat
) based on RSS-NET enrichment model \(M_1\). We match m*_data.mat
by the number of trait-associated SNPs and the proportion of phenotypic variation explained by all SNPs. By this design, m*_data.mat
have the same amount of genetic signals, and they only differ in how genetic signals are distributed: for the enrichment dataset m1_data.mat
genetic associations are enriched in the B cell network; for the baseline dataset m0_data.mat
genetic associations are randomly distributed across the genome.
$ md5sum m*_data.mat
5d652796b4cc358843920b1edf2819fc m0_data.mat
996eae1410f89607f1debefb31ec7811 m1_data.mat
$ du -sh m*_data.mat
3.0G m0_data.mat
3.0G m1_data.mat
Let’s look at the contents of m*_data.mat
.
>> sumstat=matfile('m0_data.mat');
>> sumstat
matlab.io.MatFile
Properties:
Properties.Source: 'm0_data.mat'
Properties.Writable: false
SiRiS: [22x1 cell]
betahat: [22x1 cell]
chr: [22x1 cell]
pos: [22x1 cell]
se: [22x1 cell]
GWAS summary statistics and LD estimates are stored as cell arrays. RSS-NET only uses the following variables:
betahat{j,1}
, single-SNP effect size estimates of all SNPs on chromosome j
;se{j,1}
, standard errors of betahat{j, 1}
;chr{j,1}
and pos{j, 1}
, physical positions of these SNPs (GRCh37 build);SiRiS{j,1}
, a sparse matrix, defined as repmat((1./se),1,p) .* R .* repmat((1./se)',p,1)
, where R
is the estimated LD matrix of these p
SNPs.${gwas}_snp2gene.mat
: physical distance between SNPs and genesThis file contains the physical distance between each GWAS SNP and each protein-coding gene, within 1 Mb. This file corresponds to \({\bf G}_j\) in the RSS-NET model.
$ md5sum WTCCC_snp2gene.mat
50da1887601df3a0914380e2ea9b1be7 WTCCC_snp2gene.mat
$ du -sh WTCCC_snp2gene.mat
645M WTCCC_snp2gene.mat
In this example, there are 18334 genes and 348965 SNPs.
>> snp2gene=matfile('WTCCC_snp2gene.mat');
>> snp2gene
matlab.io.MatFile
Properties:
Properties.Source: 'WTCCC_snp2gene.mat'
Properties.Writable: false
chr: [348965x1 double]
colid: [41897736x1 int32]
ncbi35_pos: [348965x1 int32]
numgene: [1x1 int32]
numsnp: [1x1 int32]
pos: [348965x1 int32]
rowid: [41897736x1 int32]
val: [41897736x1 double]
>> [snp2gene.numgene snp2gene.numsnp]
18334 348965
The SNP-to-gene distance information is captured by a three-column matrix [colid rowid val]
. For example, the distance between gene 7
and SNP 161
is 2855196
bps.
${net}_gene2gene.mat
: gene regulatory networkThis file contains information of gene-to-gene connections in a given regulatory network.
$ md5sum Primary_B_cells_from_peripheral_blood_gene2gene.mat
0098dd6d85762b18d5d594da6c3cfe93 Primary_B_cells_from_peripheral_blood_gene2gene.mat
$ du -sh Primary_B_cells_from_peripheral_blood_gene2gene.mat
1.8M Primary_B_cells_from_peripheral_blood_gene2gene.mat
>> gene2gene=matfile('Primary_B_cells_from_peripheral_blood_gene2gene.mat');
>> gene2gene
matlab.io.MatFile
Properties:
Properties.Source: 'Primary_B_cells_from_peripheral_blood_gene2gene.mat'
Properties.Writable: false
colid: [110062x1 int32]
numgene: [1x1 int32]
rowid: [110062x1 int32]
val: [110062x1 double]
For implementation convenience, this file contains the trivial case where each gene is mapped to itself with val=1
.
>> colid=gene2gene.colid; rowid=gene2gene.rowid; val=gene2gene.val;
>> [gene2gene.numgene sum(colid==rowid) unique(val(colid==rowid))]
18334 18334 1
For a given network, transcription factors (TFs) are stored in rowid
and target genes (TGs) are stored in colid
. In this example there are 3018 TGs and 436 TFs. Among these TFs and TGs, there are 91728 edges. The edge weights range from 0.61 to 1. These TF-to-TG connections and edge weights correspond to \(\{{\bf T}_g,v_{gt}\}\) in the RSS-NET model.
>> [length(unique(colid(colid ~= rowid))) length(unique(rowid(colid ~= rowid)))]
3018 436
>> [length(colid(colid ~= rowid)) length(rowid(colid ~= rowid))]
91728 91728
>> val_tftg = val(colid ~= rowid);
>> [min(val_tftg) quantile(val_tftg, 0.25) median(val_tftg) quantile(val_tftg, 0.75) max(val_tftg)]
0.6101 0.6278 0.6515 0.6885 1.0000
${gwas}_${net}_snp2net.mat
: SNP-to-network proximity annotationThis file contains binary annotations whether a SNP is “near” the given network, that is, within 100 kb of any network element (TF, TG or associated regulatory elements).
$ md5sum WTCCC_Primary_B_cells_from_peripheral_blood_snp2net.mat
701264f811c0f5edd582ee3eec5cfd15 WTCCC_Primary_B_cells_from_peripheral_blood_snp2net.mat
$ du -sh WTCCC_Primary_B_cells_from_peripheral_blood_snp2net.mat
9.4M WTCCC_Primary_B_cells_from_peripheral_blood_snp2net.mat
In this example, there are 348965 GWAS SNPs and 121308 of them are near the B cell network (i.e. val=1
).
>> snp2net=matfile('WTCCC_Primary_B_cells_from_peripheral_blood_snp2net.mat');
>> snp2net
matlab.io.MatFile
Properties:
Properties.Source: 'WTCCC_Primary_B_cells_from_peripheral_blood_snp2net.mat'
Properties.Writable: false
chr: [348965x1 double]
ncbi35_pos: [348965x1 int32]
pos: [348965x1 int32]
snpid: [348965x1 int32]
val: [348965x1 double]
window: [1x1 double]
>> [length(snp2net.val) sum(snp2net.val) snp2net.window]
348965 121308 100000
>> unique(snp2net.val)'
0 1
${gwas}_${net}_snp2gene_cis.mat
: SNP-to-gene cis regulationThis file contains the SNP-to-gene cis regulation scores. (For simplicity we do not use context-matching cis eQTL studies in our simulations.) This file corresponds to \((c_{jg}-1)\) in the RSS-NET model.
$ md5sum WTCCC_Primary_B_cells_from_peripheral_blood_snp2gene_cis.mat
b77eebd576eee6d19ac3c8ed94681f00 WTCCC_Primary_B_cells_from_peripheral_blood_snp2gene_cis.mat
$ du -sh WTCCC_Primary_B_cells_from_peripheral_blood_snp2gene_cis.mat
4.5M WTCCC_Primary_B_cells_from_peripheral_blood_snp2gene_cis.mat
>> snp2gene_cis = matfile('WTCCC_Primary_B_cells_from_peripheral_blood_snp2gene_cis.mat');
>> snp2gene_cis
matlab.io.MatFile
Properties:
Properties.Source: 'WTCCC_Primary_B_cells_from_peripheral_blood_snp2gene_cis.mat'
Properties.Writable: false
colid: [291181x1 int32]
l0: [1x1 double]
l1: [1x1 double]
numgene: [1x1 int32]
numsnp: [1x1 int32]
rowid: [291181x1 int32]
val: [291181x1 double]
window: [1x1 double]
In this example, there are 291181 SNP-gene pairs with cis regulation scores available, consisting of 111097 SNPs and 5345 genes. The cis regulation scores (val
) range from 0.02 to 0.90.
To facilitate running RSS-NET on real data, we provide a generic script simulation_template.m
. For the present example, the RSS-NET analysis of two synthetic datasets is implemented by run_simulation.m
and run_simulation.sbatch
.
Please see scripts in wtccc_bcell/
for full details. Below we only highlight the important parts.
First we need to tell RSS-NET where to find all required input files, by creating the following structure array data
. Please ensure that you save all downloaded files in dat_path
.
data.sumstats_file = strcat(dat_path,trial_name,'_data.mat');
data.snp2net_file = strcat(dat_path,gwas_name,'_',net_name,'_snp2net.mat');
data.snp2gene_file = strcat(dat_path,gwas_name,'_snp2gene.mat');
data.gene2gene_file = strcat(dat_path,net_name,'_gene2gene.mat');
data.snp2gene_cis_file = strcat(dat_path,gwas_name,'_',net_name,'_snp2gene_cis.mat');
Next we need feed a hyper-parameter grid to RSS-NET, by creating the following structure array hyper
. The total computational cost of RSS-NET is proportional to the grid size. Hence, if you want to finish running this example faster, you can always reduce the grid size.
We can provide initial values for variational parameters in RSS-NET as follows. If we do not provide initial values, RSS-NET will use a random initialization.
We use rss_net.m
to fit the RSS-NET model on a given dataset. The model fitting results consist of variational lower bounds logw
and variational parameter estimates [alpha mu s]
for the given hyper-parameter grid.
To assess whether a regulatory network is enriched for genetic associations with a trait, we evaluate a Bayes factor (BF) comparing the baseline model (\(M_0:\theta=0~\text{and}~\sigma^2=0\)) in RSS-NET with the enrichment model (\(M_1:\theta>0~\text{or}~\sigma^2>0\)). The BF computation is implemented in calc_bf.m
.
To summarize association between a locus and a trait, we compute \(P_1^{\sf net}\), the posterior probability that at least one SNP \(j\) in the locus is associated with the trait (\(\beta_j\neq 0\)): \[
P_1^{\sf net}=1-\Pr(\beta_j=0,~\forall j\in\text{locus}~|~\text{data},M_1).
\] As in Zhu et al (2020), a locus is defined as the transcribed region of a gene plus 100 kb upstream and downstream. The locus definition is provided in wtccc_tidy_snp2gene_ncbi35togrch37.mat
. The \(P_1^{\sf net}\) computation is implemented in compute_pip.m
.
After understanding the basic workflow of RSS-NET, we submit two jobs of running RSS-NET on the synthetic datasets m*_data.mat
. To this end, we write a simple sbatch script run_simulation.sbatch
, and submit it to a cluster with Slurm
available.
$ squeue -u xiangzhu
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
66258089_0 broadwl run_simu xiangzhu R 2:48:17 1 midway2-0009
66258089_1 broadwl run_simu xiangzhu R 2:48:17 1 midway2-0011
For each job, we request 1 node with 8 CPUs and 32 Gb total memory. The running time of analyzing m0_data.mat
and m1_data.mat
is around 4 hours each.
Running run_simulation.sbatch
yields two MAT-files m*_results.mat
.
As shown below, RSS-NET yields a small BF (\(0.13\)) for the baseline dataset m0_data.mat
and a large BF (\(5.85\times 10^{55}\)) for the enrichment dataset m1_data.mat
.
We also compare gene-level associations inferred from RSS-NET (m*_res.pp
) with ground truth (m*_true_gene.mat
). RSS-NET shows a higher statistical power in identifying gene-level associations on the enrichment dataset m1_data.mat
(\(\text{AUC}=0.78\)) than the baseline dataset m0_data.mat
(\(\text{AUC}=0.60\)). This is because RSS-NET automatically exploits the underlying network enrichment in m1_data.mat
when assessing genetic associations.
>> m1_gene=matfile('m1_true_gene.mat');
>> m1_mdl = fitglm(m1_res.pp,m1_gene.true_gene,'Distribution','binomial','Link','logit');
>> [X,Y,T,m1_auc] = perfcurve(m1_gene.true_gene,m1_mdl.Fitted.Probability,1);
>> m1_auc
0.7767
>> m0_gene=matfile('m0_true_gene.mat');
>> m0_mdl = fitglm(m0_res.pp,m0_gene.true_gene,'Distribution','binomial','Link','logit');
>> [X,Y,T,m0_auc] = perfcurve(m0_gene.true_gene,m0_mdl.Fitted.Probability,1);
>> m0_auc
0.5954
─ Session info ───────────────────────────────────────────────────────────────
setting value
version R version 3.6.3 (2020-02-29)
os Ubuntu 18.04.4 LTS
system x86_64, linux-gnu
ui X11
language (EN)
collate en_US.UTF-8
ctype en_US.UTF-8
tz America/Los_Angeles
date 2020-03-08
─ Packages ───────────────────────────────────────────────────────────────────
package * version date lib source
assertthat 0.2.1 2019-03-21 [1] CRAN (R 3.6.0)
backports 1.1.5 2019-10-02 [1] CRAN (R 3.6.1)
callr 3.4.1 2020-01-24 [1] CRAN (R 3.6.2)
cli 2.0.1 2020-01-08 [1] CRAN (R 3.6.2)
crayon 1.3.4 2017-09-16 [1] CRAN (R 3.6.0)
desc 1.2.0 2018-05-01 [1] CRAN (R 3.6.0)
devtools 2.2.1 2019-09-24 [1] CRAN (R 3.6.1)
digest 0.6.23 2019-11-23 [1] CRAN (R 3.6.1)
ellipsis 0.3.0 2019-09-20 [1] CRAN (R 3.6.1)
evaluate 0.14 2019-05-28 [1] CRAN (R 3.6.0)
fansi 0.4.1 2020-01-08 [1] CRAN (R 3.6.2)
fs 1.3.1 2019-05-06 [1] CRAN (R 3.6.0)
git2r 0.26.1 2019-06-29 [1] CRAN (R 3.6.0)
glue 1.3.1 2019-03-12 [1] CRAN (R 3.6.0)
htmltools 0.4.0 2019-10-04 [1] CRAN (R 3.6.1)
httpuv 1.5.2 2019-09-11 [1] CRAN (R 3.6.1)
knitr 1.28 2020-02-06 [1] CRAN (R 3.6.2)
later 1.0.0 2019-10-04 [1] CRAN (R 3.6.1)
magrittr 1.5 2014-11-22 [1] CRAN (R 3.6.0)
memoise 1.1.0 2017-04-21 [1] CRAN (R 3.6.0)
pkgbuild 1.0.6 2019-10-09 [1] CRAN (R 3.6.1)
pkgload 1.0.2 2018-10-29 [1] CRAN (R 3.6.0)
prettyunits 1.1.1 2020-01-24 [1] CRAN (R 3.6.2)
processx 3.4.1 2019-07-18 [1] CRAN (R 3.6.1)
promises 1.1.0 2019-10-04 [1] CRAN (R 3.6.1)
ps 1.3.0 2018-12-21 [1] CRAN (R 3.6.0)
R6 2.4.1 2019-11-12 [1] CRAN (R 3.6.1)
Rcpp 1.0.3 2019-11-08 [1] CRAN (R 3.6.1)
remotes 2.1.0 2019-06-24 [1] CRAN (R 3.6.0)
rlang 0.4.4 2020-01-28 [1] CRAN (R 3.6.2)
rmarkdown 2.1 2020-01-20 [1] CRAN (R 3.6.2)
rprojroot 1.3-2 2018-01-03 [1] CRAN (R 3.6.0)
sessioninfo 1.1.1 2018-11-05 [1] CRAN (R 3.6.0)
stringi 1.4.5 2020-01-11 [1] CRAN (R 3.6.2)
stringr 1.4.0 2019-02-10 [1] CRAN (R 3.6.0)
testthat 2.3.1 2019-12-01 [1] CRAN (R 3.6.2)
usethis 1.5.1 2019-07-04 [1] CRAN (R 3.6.1)
whisker 0.4 2019-08-28 [1] CRAN (R 3.6.1)
withr 2.1.2 2018-03-15 [1] CRAN (R 3.6.0)
workflowr * 1.6.0 2019-12-19 [1] CRAN (R 3.6.2)
xfun 0.12 2020-01-13 [1] CRAN (R 3.6.2)
yaml 2.2.1 2020-02-01 [1] CRAN (R 3.6.2)
[1] /home/maimaizhu/R/x86_64-pc-linux-gnu-library/3.6
[2] /usr/local/lib/R/site-library
[3] /usr/lib/R/site-library
[4] /usr/lib/R/library