Last updated: 2020-03-06

Checks: 7 0

Knit directory: rss-net/

This reproducible R Markdown analysis was created with workflowr (version 1.6.0). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.


Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.

Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.

The command set.seed(20190823) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.

Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.

Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility. The version displayed above was the version of the Git repository at the time these results were generated.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Ignored files:
    Ignored:    .Rhistory
    Ignored:    .Rproj.user/
    Ignored:    examples/ibd2015_nkcell/ibd2015_Primary_Natural_Killer_cells_from_peripheral_blood_NK_out_66.mat
    Ignored:    examples/ibd2015_nkcell/randinit_results/
    Ignored:    examples/ibd2015_nkcell/results/
    Ignored:    examples/wtccc_bcell/results/

Unstaged changes:
    Modified:   rmd/index.Rmd

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.


These are the previous versions of the R Markdown and HTML files. If you’ve configured a remote Git repository (see ?wflow_git_remote), click on the hyperlinks in the table below to view them.

File Version Author Date Message
Rmd 9469f2d xiangzhu 2020-03-06 wflow_publish(“rmd/wtccc_bcell.Rmd”)
html f2c772c xiangzhu 2020-03-06 Build site.
Rmd 26bfe19 xiangzhu 2020-03-06 wflow_publish(“rmd/wtccc_bcell.Rmd”)
html 991b323 xiangzhu 2020-03-06 Build site.
Rmd e3daa54 xiangzhu 2020-03-06 wflow_publish(“rmd/wtccc_bcell.Rmd”)
html 3703a3d xiangzhu 2020-03-06 Build site.
Rmd 4de49b7 xiangzhu 2020-03-06 wflow_publish(“rmd/wtccc_bcell.Rmd”)
html 5467783 xiangzhu 2020-03-06 Build site.
Rmd 9729b44 xiangzhu 2020-03-06 wflow_publish(“rmd/wtccc_bcell.Rmd”)
html c2ee73a xiangzhu 2020-03-06 Build site.
Rmd b109ab5 xiangzhu 2020-03-06 wflow_publish(“rmd/wtccc_bcell.Rmd”)

Overview

Here we describe an end-to-end RSS-NET analysis of two synthetic datasets that are simulated from real genotypes of 348965 genome-wide SNPs from 1458 individuals in the UK Blood Service Control Group (Wellcome Trust Case Control Consortium, 2007) and a gene regulatory network inferred for B cells (ENCODE Project Consortium, 2012). This example gives a quick view of how RSS-NET works.

To reproduce results of this example, please use scripts in the directory wtccc_bcell/, and follow the step-by-step guide below. Before running any script in wtccc_bcell/, please install RSS-NET.

With the software installed and the input data downloaded, one should be able to run this example by simply typing the following line in shell:

If a different directory is used to store the software, input data and/or output data, please modify file paths in the given scripts accordingly.

Step-by-step illustration

Download data files

All data files required to run this example are freely available at Zenodo DOI. Please contact me (xiangzhu[at]stanford.edu) if you have any trouble accessing these files. After a complete download, you should see the following files.

To help readers confirm if they use the same files as we do, we report 128-bit MD5 hashes of all files in wtccc_bcell_data.md5.

To help readers confirm if they can reproduce results of this example, we also provide the results (wtccc_bcell_results.zip) in the same Zenodo deposit.

For simplicity and generality, we introduce the following short-hand notations.

1. m*_data.mat: simulated GWAS summary statistics and reference LD estimates

Here we simulate two synthetic datasets from WTCCC genotypes and B cell network: one (m0_data.mat) based on RSS-NET baseline model \(M_0\), and the other (m1_data.mat) based on RSS-NET enrichment model \(M_1\). We match m*_data.mat by the number of trait-associated SNPs and the proportion of phenotypic variation explained by all SNPs. By this design, m*_data.mat have the same amount of genetic signals, and they only differ in how genetic signals are distributed: for the enrichment dataset m1_data.mat genetic associations are enriched in the B cell network; for the baseline dataset m0_data.mat genetic associations are randomly distributed across the genome.

Let’s look at the contents of m*_data.mat.

GWAS summary statistics and LD estimates are stored as cell arrays. RSS-NET only uses the following variables:

  • betahat{j,1}, single-SNP effect size estimates of all SNPs on chromosome j;
  • se{j,1}, standard errors of betahat{j, 1};
  • chr{j,1} and pos{j, 1}, physical positions of these SNPs (GRCh37 build);
  • SiRiS{j,1}, a sparse matrix, defined as repmat((1./se),1,p) .* R .* repmat((1./se)',p,1), where R is the estimated LD matrix of these p SNPs.

2. ${gwas}_snp2gene.mat: physical distance between SNPs and genes

This file contains the physical distance between each GWAS SNP and each protein-coding gene, within 1 Mb. This file corresponds to \({\bf G}_j\) in the RSS-NET model.

In this example, there are 18334 genes and 348965 SNPs.

The SNP-to-gene distance information is captured by a three-column matrix [colid rowid val]. For example, the distance between gene 7 and SNP 161 is 2855196 bps.

3. ${net}_gene2gene.mat: gene regulatory network

This file contains information of gene-to-gene connections in a given regulatory network.

For implementation convenience, this file contains the trivial case where each gene is mapped to itself with val=1.

For a given network, transcription factors (TFs) are stored in rowid and target genes (TGs) are stored in colid. In this example there are 3018 TGs and 436 TFs. Among these TFs and TGs, there are 91728 edges. The edge weights range from 0.61 to 1. These TF-to-TG connections and edge weights correspond to \(\{{\bf T}_g,v_{gt}\}\) in the RSS-NET model.

Run RSS-NET analysis

To facilitate running RSS-NET on real data, we provide a generic script simulation_template.m. For the present example, the RSS-NET analysis of two synthetic datasets is implemented by run_simulation.m and run_simulation.sbatch.

Please see scripts in wtccc_bcell/ for full details. Below we only highlight the important parts.

2. Specify a hyper-parameter grid

Next we need feed a hyper-parameter grid to RSS-NET, by creating the following structure array hyper. The total computational cost of RSS-NET is proportional to the grid size. Hence, if you want to finish running this example faster, you can always reduce the grid size.

3. Initialize variational parameters

We can provide initial values for variational parameters in RSS-NET as follows. If we do not provide initial values, RSS-NET will use a random initialization.

4. Fit the RSS-NET model

We use rss_net.m to fit the RSS-NET model on a given dataset. The model fitting results consist of variational lower bounds logw and variational parameter estimates [alpha mu s] for the given hyper-parameter grid.

5. Summarize network-level enrichments

To assess whether a regulatory network is enriched for genetic associations with a trait, we evaluate a Bayes factor (BF) comparing the baseline model (\(M_0:\theta=0~\text{and}~\sigma^2=0\)) in RSS-NET with the enrichment model (\(M_1:\theta>0~\text{or}~\sigma^2>0\)). The BF computation is implemented in calc_bf.m.

6. Summarize gene-level associations

To summarize association between a locus and a trait, we compute \(P_1^{\sf net}\), the posterior probability that at least one SNP \(j\) in the locus is associated with the trait (\(\beta_j\neq 0\)): \[ P_1^{\sf net}=1-\Pr(\beta_j=0,~\forall j\in\text{locus}~|~\text{data},M_1). \] As in Zhu et al (2020), a locus is defined as the transcribed region of a gene plus 100 kb upstream and downstream. The locus definition is provided in wtccc_tidy_snp2gene_ncbi35togrch37.mat. The \(P_1^{\sf net}\) computation is implemented in compute_pip.m.

7. Submit jobs

After understanding the basic workflow of RSS-NET, we submit two jobs of running RSS-NET on the synthetic datasets m*_data.mat. To this end, we write a simple sbatch script run_simulation.sbatch, and submit it to a cluster with Slurm available.

For each job, we request 1 node with 8 CPUs and 32 Gb total memory. The running time of analyzing m0_data.mat and m1_data.mat is around 4 hours each.

8. Understand analysis results

Running run_simulation.sbatch yields two MAT-files m*_results.mat.

As shown below, RSS-NET yields a small BF (\(0.13\)) for the baseline dataset m0_data.mat and a large BF (\(5.85\times 10^{55}\)) for the enrichment dataset m1_data.mat.

We also compare gene-level associations inferred from RSS-NET (m*_res.pp) with ground truth (m*_true_gene.mat). RSS-NET shows a higher statistical power in identifying gene-level associations on the enrichment dataset m1_data.mat (\(\text{AUC}=0.78\)) than the baseline dataset m0_data.mat (\(\text{AUC}=0.60\)). This is because RSS-NET automatically exploits the underlying network enrichment in m1_data.mat when assessing genetic associations.


─ Session info ───────────────────────────────────────────────────────────────
 setting  value                       
 version  R version 3.6.3 (2020-02-29)
 os       Ubuntu 18.04.4 LTS          
 system   x86_64, linux-gnu           
 ui       X11                         
 language (EN)                        
 collate  en_US.UTF-8                 
 ctype    en_US.UTF-8                 
 tz       America/Los_Angeles         
 date     2020-03-06                  

─ Packages ───────────────────────────────────────────────────────────────────
 package     * version date       lib source        
 assertthat    0.2.1   2019-03-21 [1] CRAN (R 3.6.0)
 backports     1.1.5   2019-10-02 [1] CRAN (R 3.6.1)
 callr         3.4.1   2020-01-24 [1] CRAN (R 3.6.2)
 cli           2.0.1   2020-01-08 [1] CRAN (R 3.6.2)
 crayon        1.3.4   2017-09-16 [1] CRAN (R 3.6.0)
 desc          1.2.0   2018-05-01 [1] CRAN (R 3.6.0)
 devtools      2.2.1   2019-09-24 [1] CRAN (R 3.6.1)
 digest        0.6.23  2019-11-23 [1] CRAN (R 3.6.1)
 ellipsis      0.3.0   2019-09-20 [1] CRAN (R 3.6.1)
 evaluate      0.14    2019-05-28 [1] CRAN (R 3.6.0)
 fansi         0.4.1   2020-01-08 [1] CRAN (R 3.6.2)
 fs            1.3.1   2019-05-06 [1] CRAN (R 3.6.0)
 git2r         0.26.1  2019-06-29 [1] CRAN (R 3.6.0)
 glue          1.3.1   2019-03-12 [1] CRAN (R 3.6.0)
 htmltools     0.4.0   2019-10-04 [1] CRAN (R 3.6.1)
 httpuv        1.5.2   2019-09-11 [1] CRAN (R 3.6.1)
 knitr         1.28    2020-02-06 [1] CRAN (R 3.6.2)
 later         1.0.0   2019-10-04 [1] CRAN (R 3.6.1)
 magrittr      1.5     2014-11-22 [1] CRAN (R 3.6.0)
 memoise       1.1.0   2017-04-21 [1] CRAN (R 3.6.0)
 pkgbuild      1.0.6   2019-10-09 [1] CRAN (R 3.6.1)
 pkgload       1.0.2   2018-10-29 [1] CRAN (R 3.6.0)
 prettyunits   1.1.1   2020-01-24 [1] CRAN (R 3.6.2)
 processx      3.4.1   2019-07-18 [1] CRAN (R 3.6.1)
 promises      1.1.0   2019-10-04 [1] CRAN (R 3.6.1)
 ps            1.3.0   2018-12-21 [1] CRAN (R 3.6.0)
 R6            2.4.1   2019-11-12 [1] CRAN (R 3.6.1)
 Rcpp          1.0.3   2019-11-08 [1] CRAN (R 3.6.1)
 remotes       2.1.0   2019-06-24 [1] CRAN (R 3.6.0)
 rlang         0.4.4   2020-01-28 [1] CRAN (R 3.6.2)
 rmarkdown     2.1     2020-01-20 [1] CRAN (R 3.6.2)
 rprojroot     1.3-2   2018-01-03 [1] CRAN (R 3.6.0)
 sessioninfo   1.1.1   2018-11-05 [1] CRAN (R 3.6.0)
 stringi       1.4.5   2020-01-11 [1] CRAN (R 3.6.2)
 stringr       1.4.0   2019-02-10 [1] CRAN (R 3.6.0)
 testthat      2.3.1   2019-12-01 [1] CRAN (R 3.6.2)
 usethis       1.5.1   2019-07-04 [1] CRAN (R 3.6.1)
 whisker       0.4     2019-08-28 [1] CRAN (R 3.6.1)
 withr         2.1.2   2018-03-15 [1] CRAN (R 3.6.0)
 workflowr   * 1.6.0   2019-12-19 [1] CRAN (R 3.6.2)
 xfun          0.12    2020-01-13 [1] CRAN (R 3.6.2)
 yaml          2.2.1   2020-02-01 [1] CRAN (R 3.6.2)

[1] /home/maimaizhu/R/x86_64-pc-linux-gnu-library/3.6
[2] /usr/local/lib/R/site-library
[3] /usr/lib/R/site-library
[4] /usr/lib/R/library