Last updated: 2020-06-29

Checks: 7 0

Knit directory: rss-net/

This reproducible R Markdown analysis was created with workflowr (version 1.6.2). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.


Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.

Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.

The command set.seed(20190823) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.

Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.

Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.

The results in this page were generated with repository version ddaf878. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Ignored files:
    Ignored:    .Rhistory
    Ignored:    .Rproj.user/
    Ignored:    examples/ibd2015_nkcell/ibd2015_Primary_Natural_Killer_cells_from_peripheral_blood_NK_out_66.mat
    Ignored:    examples/ibd2015_nkcell/randinit_results/
    Ignored:    examples/ibd2015_nkcell/results/
    Ignored:    examples/wtccc_bcell/results/

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.


These are the previous versions of the repository in which changes were made to the R Markdown (rmd/wtccc_bcell.Rmd) and HTML (docs/wtccc_bcell.html) files. If you’ve configured a remote Git repository (see ?wflow_git_remote), click on the hyperlinks in the table below to view the files as they were in that past version.

File Version Author Date Message
Rmd ddaf878 xiangzhu 2020-06-29 wflow_publish(“rmd/wtccc_bcell.Rmd”)
html 93ce892 xiangzhu 2020-06-29 Build site.
html 260f36c xiangzhu 2020-03-15 Build site.
Rmd baf07ec xiangzhu 2020-03-15 wflow_publish(“rmd/wtccc_bcell.Rmd”)
html a424171 xiangzhu 2020-03-08 Build site.
html 5c5e8e1 xiangzhu 2020-03-08 Build site.
html 31f9064 xiangzhu 2020-03-06 Build site.
Rmd 9469f2d xiangzhu 2020-03-06 wflow_publish(“rmd/wtccc_bcell.Rmd”)
html f2c772c xiangzhu 2020-03-06 Build site.
Rmd 26bfe19 xiangzhu 2020-03-06 wflow_publish(“rmd/wtccc_bcell.Rmd”)
html 991b323 xiangzhu 2020-03-06 Build site.
Rmd e3daa54 xiangzhu 2020-03-06 wflow_publish(“rmd/wtccc_bcell.Rmd”)
html 3703a3d xiangzhu 2020-03-06 Build site.
Rmd 4de49b7 xiangzhu 2020-03-06 wflow_publish(“rmd/wtccc_bcell.Rmd”)
html 5467783 xiangzhu 2020-03-06 Build site.
Rmd 9729b44 xiangzhu 2020-03-06 wflow_publish(“rmd/wtccc_bcell.Rmd”)
html c2ee73a xiangzhu 2020-03-06 Build site.
Rmd b109ab5 xiangzhu 2020-03-06 wflow_publish(“rmd/wtccc_bcell.Rmd”)

Overview

Here we describe an end-to-end RSS-NET analysis of two synthetic datasets that are simulated from real genotypes of 348965 genome-wide SNPs from 1458 individuals in the UK Blood Service Control Group (Wellcome Trust Case Control Consortium, 2007) and a gene regulatory network inferred for B cells (ENCODE Project Consortium, 2012). This example gives a quick view of how RSS-NET works.

To reproduce results of this example, please use scripts in the directory wtccc_bcell/, and follow the step-by-step guide below. Before running any script in wtccc_bcell/, please install RSS-NET.

With the software installed and the input data downloaded, one should be able to run this example by simply typing the following line in shell:

$ sbatch run_simulation.sbatch

If a different directory is used to store the software, input data and/or output data, please modify file paths in the given scripts accordingly.

Step-by-step illustration

Download data files

All data files required to run this example are freely available at Zenodo DOI. Please contact me (xiangzhu[at]stanford.edu) if you have any trouble accessing these files. After a complete download, you should see the following files.

$ tree .
.
├── m0_data.mat
├── m0_true_gene.mat
├── m1_data.mat
├── m1_true_gene.mat
├── Primary_B_cells_from_peripheral_blood_gene2gene.mat
├── wtccc_bcell_data.md5
├── wtccc_bcell_results.zip
├── WTCCC_Primary_B_cells_from_peripheral_blood_snp2gene_cis.mat
├── WTCCC_Primary_B_cells_from_peripheral_blood_snp2net.mat
├── WTCCC_snp2gene.mat
└── wtccc_tidy_snp2gene_ncbi35togrch37.mat

0 directories, 11 files

To help readers confirm if they use the same files as we do, we report 128-bit MD5 hashes of all files in wtccc_bcell_data.md5.

To help readers confirm if they can reproduce results of this example, we also provide the results (wtccc_bcell_results.zip) in the same Zenodo deposit.

For simplicity and generality, we introduce the following short-hand notations.

gwas = 'WTCCC';
net  = 'Primary_B_cells_from_peripheral_blood'; 

1. m*_data.mat: simulated GWAS summary statistics and reference LD estimates

Here we simulate two synthetic datasets from WTCCC genotypes and B cell network: one (m0_data.mat) based on RSS-NET baseline model \(M_0\), and the other (m1_data.mat) based on RSS-NET enrichment model \(M_1\). We match m*_data.mat by the number of trait-associated SNPs and the proportion of phenotypic variation explained by all SNPs. By this design, m*_data.mat have the same amount of genetic signals, and they only differ in how genetic signals are distributed: for the enrichment dataset m1_data.mat genetic associations are enriched in the B cell network; for the baseline dataset m0_data.mat genetic associations are randomly distributed across the genome.

$ md5sum m*_data.mat
5d652796b4cc358843920b1edf2819fc  m0_data.mat
996eae1410f89607f1debefb31ec7811  m1_data.mat

$ du -sh m*_data.mat
3.0G    m0_data.mat
3.0G    m1_data.mat

Let’s look at the contents of m*_data.mat.

>> sumstat=matfile('m0_data.mat'); 
>> sumstat
  matlab.io.MatFile
  Properties:
      Properties.Source: 'm0_data.mat'
    Properties.Writable: false                                                  
                  SiRiS: [22x1 cell]                                            
                betahat: [22x1 cell]                                            
                    chr: [22x1 cell]                                            
                    pos: [22x1 cell]                                            
                     se: [22x1 cell]

GWAS summary statistics and LD estimates are stored as cell arrays. RSS-NET only uses the following variables:

  • betahat{j,1}, single-SNP effect size estimates of all SNPs on chromosome j;
  • se{j,1}, standard errors of betahat{j, 1};
  • chr{j,1} and pos{j, 1}, physical positions of these SNPs (GRCh37 build);
  • SiRiS{j,1}, a sparse matrix, defined as repmat((1./se),1,p) .* R .* repmat((1./se)',p,1), where R is the estimated LD matrix of these p SNPs.

2. ${gwas}_snp2gene.mat: physical distance between SNPs and genes

This file contains the physical distance between each GWAS SNP and each protein-coding gene, within 1 Mb. This file corresponds to \({\bf G}_j\) in the RSS-NET model.

$ md5sum WTCCC_snp2gene.mat 
50da1887601df3a0914380e2ea9b1be7  WTCCC_snp2gene.mat

$ du -sh WTCCC_snp2gene.mat
645M    WTCCC_snp2gene.mat

In this example, there are 18334 genes and 348965 SNPs.

>> snp2gene=matfile('WTCCC_snp2gene.mat');
>> snp2gene
  matlab.io.MatFile
  Properties:
      Properties.Source: 'WTCCC_snp2gene.mat'
    Properties.Writable: false                                                         
                    chr: [348965x1   double]                                           
                  colid: [41897736x1 int32]                                            
             ncbi35_pos: [348965x1   int32]                                            
                numgene: [1x1        int32]                                            
                 numsnp: [1x1        int32]                                            
                    pos: [348965x1   int32]                                            
                  rowid: [41897736x1 int32]                                            
                    val: [41897736x1 double]                                           

>> [snp2gene.numgene snp2gene.numsnp]
    18334   348965

The SNP-to-gene distance information is captured by a three-column matrix [colid rowid val]. For example, the distance between gene 7 and SNP 161 is 2855196 bps.

>> colid=snp2gene.colid; rowid=snp2gene.rowid; val=snp2gene.val;
>> [colid(6688) rowid(6688) val(6688)]
         7       161   2855196

3. ${net}_gene2gene.mat: gene regulatory network

This file contains information of gene-to-gene connections in a given regulatory network.

$ md5sum Primary_B_cells_from_peripheral_blood_gene2gene.mat
0098dd6d85762b18d5d594da6c3cfe93  Primary_B_cells_from_peripheral_blood_gene2gene.mat

$ du -sh Primary_B_cells_from_peripheral_blood_gene2gene.mat
1.8M    Primary_B_cells_from_peripheral_blood_gene2gene.mat
>> gene2gene=matfile('Primary_B_cells_from_peripheral_blood_gene2gene.mat');
>> gene2gene
  matlab.io.MatFile
  Properties:
      Properties.Source: 'Primary_B_cells_from_peripheral_blood_gene2gene.mat'
    Properties.Writable: false                                                                                          
                  colid: [110062x1 int32]                                                                               
                numgene: [1x1      int32]                                                                               
                  rowid: [110062x1 int32]                                                                               
                    val: [110062x1 double]

For implementation convenience, this file contains the trivial case where each gene is mapped to itself with val=1.

>> colid=gene2gene.colid; rowid=gene2gene.rowid; val=gene2gene.val;             
>> [gene2gene.numgene sum(colid==rowid) unique(val(colid==rowid))]
   18334   18334       1

For a given network, transcription factors (TFs) are stored in rowid and target genes (TGs) are stored in colid. In this example there are 3018 TGs and 436 TFs. Among these TFs and TGs, there are 91728 edges. The edge weights range from 0.61 to 1. These TF-to-TG connections and edge weights correspond to \(\{{\bf T}_g,v_{gt}\}\) in the RSS-NET model.

>> [length(unique(colid(colid ~= rowid))) length(unique(rowid(colid ~= rowid)))]
        3018         436

>> [length(colid(colid ~= rowid)) length(rowid(colid ~= rowid))]
       91728       91728

>> val_tftg = val(colid ~= rowid);
>> [min(val_tftg) quantile(val_tftg, 0.25) median(val_tftg) quantile(val_tftg, 0.75) max(val_tftg)]
    0.6101    0.6278    0.6515    0.6885    1.0000      

4. ${gwas}_${net}_snp2net.mat: SNP-to-network proximity annotation

This file contains binary annotations whether a SNP is “near” the given network, that is, within 100 kb of any network element (TF, TG or associated regulatory elements).

$ md5sum WTCCC_Primary_B_cells_from_peripheral_blood_snp2net.mat
701264f811c0f5edd582ee3eec5cfd15  WTCCC_Primary_B_cells_from_peripheral_blood_snp2net.mat

$ du -sh WTCCC_Primary_B_cells_from_peripheral_blood_snp2net.mat
9.4M    WTCCC_Primary_B_cells_from_peripheral_blood_snp2net.mat

In this example, there are 348965 GWAS SNPs and 121308 of them are near the B cell network (i.e. val=1).

>> snp2net=matfile('WTCCC_Primary_B_cells_from_peripheral_blood_snp2net.mat');
>> snp2net
  matlab.io.MatFile
  Properties:
      Properties.Source: 'WTCCC_Primary_B_cells_from_peripheral_blood_snp2net.mat'
    Properties.Writable: false                                                                                              
                    chr: [348965x1 double]                                                                                  
             ncbi35_pos: [348965x1 int32]                                                                                   
                    pos: [348965x1 int32]                                                                                   
                  snpid: [348965x1 int32]                                                                                   
                    val: [348965x1 double]                                                                                  
                 window: [1x1      double]                                                                                  

>> [length(snp2net.val) sum(snp2net.val) snp2net.window]
      348965      121308      100000

>> unique(snp2net.val)'
     0     1

5. ${gwas}_${net}_snp2gene_cis.mat: SNP-to-gene cis regulation

This file contains the SNP-to-gene cis regulation scores. (For simplicity we do not use context-matching cis eQTL studies in our simulations.) This file corresponds to \((c_{jg}-1)\) in the RSS-NET model.

$ md5sum WTCCC_Primary_B_cells_from_peripheral_blood_snp2gene_cis.mat
b77eebd576eee6d19ac3c8ed94681f00  WTCCC_Primary_B_cells_from_peripheral_blood_snp2gene_cis.mat

$ du -sh WTCCC_Primary_B_cells_from_peripheral_blood_snp2gene_cis.mat
4.5M    WTCCC_Primary_B_cells_from_peripheral_blood_snp2gene_cis.mat
>> snp2gene_cis = matfile('WTCCC_Primary_B_cells_from_peripheral_blood_snp2gene_cis.mat');
>> snp2gene_cis
  matlab.io.MatFile
  Properties:
      Properties.Source: 'WTCCC_Primary_B_cells_from_peripheral_blood_snp2gene_cis.mat'
    Properties.Writable: false                                                                                                   
                  colid: [291181x1 int32]                                                                                        
                     l0: [1x1      double]                                                                                       
                     l1: [1x1      double]                                                                                       
                numgene: [1x1      int32]                                                                                        
                 numsnp: [1x1      int32]                                                                                        
                  rowid: [291181x1 int32]                                                                                        
                    val: [291181x1 double]                                                                                       
                 window: [1x1      double]

In this example, there are 291181 SNP-gene pairs with cis regulation scores available, consisting of 111097 SNPs and 5345 genes. The cis regulation scores (val) range from 0.02 to 0.90.

>> [snp2gene_cis.numsnp length(unique(snp2gene_cis.rowid)) snp2gene_cis.numgene length(unique(snp2gene_cis.colid))]
   348965   111097    18334     5345

>> val=snp2gene_cis.val;
>> [min(val) quantile(val, 0.25) median(val) quantile(val, 0.75) max(val)]
    0.0202    0.1615    0.2661    0.3864    0.8963

Run RSS-NET analysis

To facilitate running RSS-NET on real data, we provide a generic script simulation_template.m. For the present example, the RSS-NET analysis of two synthetic datasets is implemented by run_simulation.m and run_simulation.sbatch.

Please see scripts in wtccc_bcell/ for full details. Below we only highlight the important parts.

1. Specify input data files

First we need to tell RSS-NET where to find all required input files, by creating the following structure array data. Please ensure that you save all downloaded files in dat_path.

data.sumstats_file     = strcat(dat_path,trial_name,'_data.mat');
data.snp2net_file      = strcat(dat_path,gwas_name,'_',net_name,'_snp2net.mat');
data.snp2gene_file     = strcat(dat_path,gwas_name,'_snp2gene.mat');
data.gene2gene_file    = strcat(dat_path,net_name,'_gene2gene.mat');
data.snp2gene_cis_file = strcat(dat_path,gwas_name,'_',net_name,'_snp2gene_cis.mat');

2. Specify a hyper-parameter grid

Next we need feed a hyper-parameter grid to RSS-NET, by creating the following structure array hyper. The total computational cost of RSS-NET is proportional to the grid size. Hence, if you want to finish running this example faster, you can always reduce the grid size.

hyper.theta0 = true_theta0;
hyper.sigma0 = true_sigma0;
hyper.theta  = [0 ((true_theta-0.5):0.25:(true_theta+0.5))];
hyper.sigma  = [0 ((true_sigma-0.5):0.25:(true_sigma+0.5))];

3. Initialize variational parameters

We can provide initial values for variational parameters in RSS-NET as follows. If we do not provide initial values, RSS-NET will use a random initialization.

options.alpha = alpha0;
options.mu    = mu0;

4. Fit the RSS-NET model

We use rss_net.m to fit the RSS-NET model on a given dataset. The model fitting results consist of variational lower bounds logw and variational parameter estimates [alpha mu s] for the given hyper-parameter grid.

[logw,alpha,mu,s,param] = rss_net(data,hyper,options);

5. Summarize network-level enrichments

To assess whether a regulatory network is enriched for genetic associations with a trait, we evaluate a Bayes factor (BF) comparing the baseline model (\(M_0:\theta=0~\text{and}~\sigma^2=0\)) in RSS-NET with the enrichment model (\(M_1:\theta>0~\text{or}~\sigma^2>0\)). The BF computation is implemented in calc_bf.m.

log10_bf = calc_bf(logw,param);

6. Summarize gene-level associations

To summarize association between a locus and a trait, we compute \(P_1^{\sf net}\), the posterior probability that at least one SNP \(j\) in the locus is associated with the trait (\(\beta_j\neq 0\)): \[ P_1^{\sf net}=1-\Pr(\beta_j=0,~\forall j\in\text{locus}~|~\text{data},M_1). \] As in Zhu et al (2020), a locus is defined as the transcribed region of a gene plus 100 kb upstream and downstream. The locus definition is provided in wtccc_tidy_snp2gene_ncbi35togrch37.mat. The \(P_1^{\sf net}\) computation is implemented in compute_pip.m.

seg_path = strcat(dat_path,'wtccc_tidy_snp2gene_ncbi35togrch37.mat');
p1_net   = compute_pip(seg_path,logw,alpha);

7. Submit jobs

After understanding the basic workflow of RSS-NET, we submit two jobs of running RSS-NET on the synthetic datasets m*_data.mat. To this end, we write a simple sbatch script run_simulation.sbatch, and submit it to a cluster with Slurm available.

$ squeue -u xiangzhu
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
        66258089_0   broadwl run_simu xiangzhu  R    2:48:17      1 midway2-0009
        66258089_1   broadwl run_simu xiangzhu  R    2:48:17      1 midway2-0011

For each job, we request 1 node with 8 CPUs and 32 Gb total memory. The memory utilized by these two jobs is around 16 Gb each. The running time of these two jobs is around 4 hours each.

8. Understand analysis results

Running run_simulation.sbatch yields two MAT-files m*_results.mat.

>> m0_res=matfile('m0_results.mat');
>> m1_res=matfile('m1_results.mat');

As shown below, RSS-NET yields a small BF (\(0.13\)) for the baseline dataset m0_data.mat and a large BF (\(5.85\times 10^{55}\)) for the enrichment dataset m1_data.mat.

>> [m0_res.log10_bf m1_res.log10_bf]
   -0.8917   55.7671

We also compare gene-level associations inferred from RSS-NET (m*_res.pp) with ground truth (m*_true_gene.mat). RSS-NET shows a higher statistical power in identifying gene-level associations on the enrichment dataset m1_data.mat (\(\text{AUC}=0.78\)) than the baseline dataset m0_data.mat (\(\text{AUC}=0.60\)). This is because RSS-NET automatically exploits the underlying network enrichment in m1_data.mat when assessing genetic associations.

>> m1_gene=matfile('m1_true_gene.mat');              
>> m1_mdl = fitglm(m1_res.pp,m1_gene.true_gene,'Distribution','binomial','Link','logit');
>> [X,Y,T,m1_auc] = perfcurve(m1_gene.true_gene,m1_mdl.Fitted.Probability,1);
>> m1_auc
    0.7767

>> m0_gene=matfile('m0_true_gene.mat');                                                  
>> m0_mdl = fitglm(m0_res.pp,m0_gene.true_gene,'Distribution','binomial','Link','logit');
>> [X,Y,T,m0_auc] = perfcurve(m0_gene.true_gene,m0_mdl.Fitted.Probability,1);
>> m0_auc
    0.5954

devtools::session_info()
─ Session info ───────────────────────────────────────────────────────────────
 setting  value                       
 version  R version 4.0.2 (2020-06-22)
 os       Ubuntu 20.04 LTS            
 system   x86_64, linux-gnu           
 ui       X11                         
 language (EN)                        
 collate  en_US.UTF-8                 
 ctype    en_US.UTF-8                 
 tz       America/Los_Angeles         
 date     2020-06-29                  

─ Packages ───────────────────────────────────────────────────────────────────
 package     * version date       lib source        
 assertthat    0.2.1   2019-03-21 [3] CRAN (R 4.0.0)
 backports     1.1.8   2020-06-17 [3] CRAN (R 4.0.1)
 callr         3.4.3   2020-03-28 [1] CRAN (R 4.0.2)
 cli           2.0.2   2020-02-28 [3] CRAN (R 4.0.0)
 crayon        1.3.4   2017-09-16 [3] CRAN (R 4.0.0)
 desc          1.2.0   2018-05-01 [3] CRAN (R 4.0.0)
 devtools      2.3.0   2020-04-10 [3] CRAN (R 4.0.0)
 digest        0.6.25  2020-02-23 [3] CRAN (R 4.0.0)
 ellipsis      0.3.1   2020-05-15 [3] CRAN (R 4.0.0)
 evaluate      0.14    2019-05-28 [3] CRAN (R 4.0.0)
 fansi         0.4.1   2020-01-08 [3] CRAN (R 4.0.0)
 fs            1.4.1   2020-04-04 [3] CRAN (R 4.0.0)
 git2r         0.27.1  2020-05-03 [3] CRAN (R 4.0.0)
 glue          1.4.1   2020-05-13 [3] CRAN (R 4.0.0)
 htmltools     0.5.0   2020-06-16 [3] CRAN (R 4.0.1)
 httpuv        1.5.4   2020-06-06 [3] CRAN (R 4.0.0)
 knitr         1.29    2020-06-23 [3] CRAN (R 4.0.2)
 later         1.1.0.1 2020-06-05 [3] CRAN (R 4.0.0)
 magrittr      1.5     2014-11-22 [3] CRAN (R 4.0.0)
 memoise       1.1.0   2017-04-21 [3] CRAN (R 4.0.0)
 pkgbuild      1.0.8   2020-05-07 [3] CRAN (R 4.0.0)
 pkgload       1.1.0   2020-05-29 [3] CRAN (R 4.0.0)
 prettyunits   1.1.1   2020-01-24 [3] CRAN (R 4.0.0)
 processx      3.4.2   2020-02-09 [3] CRAN (R 4.0.0)
 promises      1.1.1   2020-06-09 [3] CRAN (R 4.0.1)
 ps            1.3.3   2020-05-08 [3] CRAN (R 4.0.0)
 R6            2.4.1   2019-11-12 [3] CRAN (R 4.0.0)
 Rcpp          1.0.4.6 2020-04-09 [3] CRAN (R 4.0.0)
 remotes       2.1.1   2020-02-15 [3] CRAN (R 4.0.0)
 rlang         0.4.6   2020-05-02 [3] CRAN (R 4.0.0)
 rmarkdown     2.3     2020-06-18 [1] CRAN (R 4.0.2)
 rprojroot     1.3-2   2018-01-03 [3] CRAN (R 4.0.0)
 sessioninfo   1.1.1   2018-11-05 [3] CRAN (R 4.0.0)
 stringi       1.4.6   2020-02-17 [3] CRAN (R 4.0.0)
 stringr       1.4.0   2019-02-10 [3] CRAN (R 4.0.0)
 testthat      2.3.2   2020-03-02 [3] CRAN (R 4.0.0)
 usethis       1.6.1   2020-04-29 [3] CRAN (R 4.0.0)
 whisker       0.4     2019-08-28 [3] CRAN (R 4.0.0)
 withr         2.2.0   2020-04-20 [3] CRAN (R 4.0.0)
 workflowr   * 1.6.2   2020-04-30 [1] CRAN (R 4.0.0)
 xfun          0.15    2020-06-21 [3] CRAN (R 4.0.1)
 yaml          2.2.1   2020-02-01 [3] CRAN (R 4.0.0)

[1] /home/maimaizhu/R/x86_64-pc-linux-gnu-library/4.0
[2] /usr/local/lib/R/site-library
[3] /usr/lib/R/site-library
[4] /usr/lib/R/library