RSS-NET analysis of IBD GWAS summary statistics and NK cell regulatory network

Last updated: 2020-03-03

Checks: 7 0

Knit directory: rss-net/

This reproducible R Markdown analysis was created with workflowr (version 1.6.0). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.

R Markdown file: up-to-date

Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.

Environment: empty

Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.

Seed: set.seed(20190823)

The command set.seed(20190823) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.

Session information: recorded

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.

Cache: none

Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.

File paths: relative

Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.

Repository version: 6520dbc

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility. The version displayed above was the version of the Git repository at the time these results were generated.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Ignored files:
    Ignored:    .Rhistory
    Ignored:    .Rproj.user/

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.

These are the previous versions of the R Markdown and HTML files. If you’ve configured a remote Git repository (see ?wflow_git_remote), click on the hyperlinks in the table below to view them.

File	Version	Author	Date	Message
Rmd	6520dbc	Xiang Zhu	2020-03-03	wflow_publish(“rmd/ibd2015_nkcell.Rmd”)
html	be857e0	Xiang Zhu	2020-03-03	Build site.
Rmd	69e4772	Xiang Zhu	2020-03-03	wflow_publish(“rmd/ibd2015_nkcell.Rmd”)
html	5974d28	xiangzhu	2020-03-03	Build site.
Rmd	4a13f45	xiangzhu	2020-03-03	wflow_publish(“rmd/ibd2015_nkcell.Rmd”)
html	bf742e0	xiangzhu	2020-03-02	Build site.
Rmd	a16d453	xiangzhu	2020-03-02	wflow_publish(“rmd/ibd2015_nkcell.Rmd”)
html	785a423	xiangzhu	2020-03-01	Build site.
Rmd	bda7382	xiangzhu	2020-03-01	wflow_publish(“rmd/ibd2015_nkcell.Rmd”)
html	830b581	xiangzhu	2020-03-01	Build site.
Rmd	a5b592e	xiangzhu	2020-03-01	wflow_publish(“rmd/ibd2015_nkcell.Rmd”)
html	c3ca100	xiangzhu	2020-02-29	Build site.
Rmd	2e5247b	xiangzhu	2020-02-29	wflow_publish(“rmd/ibd2015_nkcell.Rmd”)

Overview

Here we describe an end-to-end RSS-NET analysis of inflammatory bowel disease (IBD) GWAS summary statistics (Liu et al, 2015) and a gene regulatory network inferred for natural killer (NK) cells. This example illustrates the actual data analyses performed in Zhu et al (2019).

To reproduce results of this example, please use scripts in the directory script_dir, and follow the step-by-step guide below. Before running any script in script_dir, please install RSS-NET.

Since a real genome-wide analysis is conducted here, this example is more complicated than the previous simulation example. It is advisable to go through the previous simulation example before diving into this real data example.

Note that the working directory here is assumed to be wdtba. Please modify scripts accordingly if a different directory is used.

Step-by-step illustration

Download input data files

1. `ibd2015_sumstat.mat`: processed GWAS summary statistics and LD matrix estimates

This file is large (43G) because it has a LD matrix of 1.1 million common SNPs. Please contact me (xiangzhu[at]stanford.edu) if you have trouble accessing this file.

$ md5sum ibd2015_sumstat.mat
ad1763079ee7e46b21722f74e037a230  ibd2015_sumstat.mat

$ du -sh ibd2015_sumstat.mat                                             
43G ibd2015_sumstat.mat

Let’s look at the contents of ibd2015_sumstat.mat.

>> sumstat = matfile('ibd2015_sumstat.mat');
>> sumstat
  matlab.io.MatFile
  Properties:
      Properties.Source: 'ibd2015_sumstat.mat'
    Properties.Writable: false                                                          
                     BR: [22x1 cell]                                                    
                      R: [22x1 cell]                                                    
                  SiRiS: [22x1 cell]                                                    
                betahat: [22x1 cell]                                                    
                    chr: [22x1 cell]                                                    
                    pos: [22x1 cell]                                                    
                     se: [22x1 cell]

GWAS summary statistics and LD estimates are stored as cell arrays. RSS-NET only uses the following variables:

betahat{j,1}, single-SNP effect size estimates of all SNPs on chromosome j;
se{j,1}, standard errors of betahat{j, 1};
chr{j,1} and pos{j, 1}, physical positions of these SNPs (GRCh37 build);
SiRiS{j,1}, a sparse matrix, defined as repmat((1./se),1,p) .* R .* repmat((1./se)',p,1), where R is the estimated LD matrix of these p SNPs.

2. `ibd2015_snp2gene.mat`: physical distance between SNPs and genes

This file contains the physical distance between each GWAS SNP and each protein-coding gene, within 1 Mb. This file corresponds to \({\bf G}_j\) in the RSS-NET model.

$ md5sum ibd2015_snp2gene.mat 
7832838e2675e4cf3b85f471fed95554  ibd2015_snp2gene.mat

$ du -sh ibd2015_snp2gene.mat 
224M    ibd2015_snp2gene.mat

In this example, there are 18334 SNPs and 1081481 genes.

>> snp2gene = matfile('ibd2015_snp2gene.mat');
>> snp2gene                                   
  matlab.io.MatFile
  Properties:
      Properties.Source: 'ibd2015_snp2gene.mat'
    Properties.Writable: false                                                                
                    chr: [1081481x1  int32]                                                   
                  colid: [14126805x1 int32]                                                   
                numgene: [1x1        int32]                                                   
                 numsnp: [1x1        int32]                                                   
                    pos: [1081481x1  int32]                                                   
                  rowid: [14126805x1 int32]                                                   
                    val: [14126805x1 double]

>> [snp2gene.numgene snp2gene.numsnp]                                                              
     18334   1081481

The SNP-to-gene distance information is captured by a three-column matrix [colid rowid val]. For example, the distance between gene 1 and SNP 6 is 978947 bps.

>> colid=snp2gene.colid; rowid=snp2gene.rowid; val=snp2gene.val;
>> [colid(6) rowid(6) val(6)]
        1        6   978947

3. `Primary_Natural_Killer_cells_from_peripheral_blood_gene2gene.mat`: gene regulatory network

This file contains information of gene-to-gene connections in a given regulatory network.

$ md5sum Primary_Natural_Killer_cells_from_peripheral_blood_gene2gene.mat 
35ac724b86f7777d87116cc48166caa2  Primary_Natural_Killer_cells_from_peripheral_blood_gene2gene.mat

$ du -sh Primary_Natural_Killer_cells_from_peripheral_blood_gene2gene.mat
1.7M    Primary_Natural_Killer_cells_from_peripheral_blood_gene2gene.mat

>> gene2gene = matfile('Primary_Natural_Killer_cells_from_peripheral_blood_gene2gene.mat');
>> gene2gene
  matlab.io.MatFile
  Properties:
      Properties.Source: 'Primary_Natural_Killer_cells_from_peripheral_blood_gene2gene.mat'
    Properties.Writable: false                                                                                                            
                  colid: [110733x1 int32]                                                                                                 
                numgene: [1x1      int32]                                                                                                 
                  rowid: [110733x1 int32]                                                                                                 
                    val: [110733x1 double]

For implementation convenience, this file contains the trivial case where each gene is mapped to itself with val=1.

>> colid=gene2gene.colid; rowid=gene2gene.rowid; val=gene2gene.val;             
>> [gene2gene.numgene sum(colid==rowid) unique(val(colid==rowid))]
   18334   18334       1

For a given network, transcription factors (TFs) are stored in rowid and target genes (TGs) are stored in colid. In this example there are 3105 TGs and 376 TFs. Among these TFs and TGs, there are 92399 edges. The edge weights range from 0.61 to 1. These TF-to-TG connections and edge weights correspond to \(\{{\bf T}_g,v_{gt}\}\) in the RSS-NET model.

>> [length(unique(colid(colid ~= rowid))) length(unique(rowid(colid ~= rowid)))]
        3105         376
>> [length(colid(colid ~= rowid)) length(rowid(colid ~= rowid))]
       92399       92399

>> val_tftg = val(colid ~= rowid);
>> [min(val_tftg) quantile(val_tftg, 0.25) median(val_tftg) quantile(val_tftg, 0.75) max(val_tftg)]
    0.6138    0.6324    0.6568    0.6949    1.0000

4. `ibd2015_Primary_Natural_Killer_cells_from_peripheral_blood_snp2net.mat`: SNP-to-network proximity annotation

This file contains binary annotations whether a SNP is “near” the given network, that is, within 100 kb of any network element (TF, TG or associated regulatory elements).

$ md5sum ibd2015_Primary_Natural_Killer_cells_from_peripheral_blood_snp2net.mat
d96cd9b32759f954cc37680dc6aeafd8  ibd2015_Primary_Natural_Killer_cells_from_peripheral_blood_snp2net.mat

$ du -sh ibd2015_Primary_Natural_Killer_cells_from_peripheral_blood_snp2net.mat
21M ibd2015_Primary_Natural_Killer_cells_from_peripheral_blood_snp2net.mat

In this example, there are 1081481 GWAS SNPs and 382443 of them are near the NK cell network (i.e. val=1).

>> snp2net = matfile('ibd2015_Primary_Natural_Killer_cells_from_peripheral_blood_snp2net.mat');
>> snp2net
  matlab.io.MatFile
  Properties:
      Properties.Source: 'ibd2015_Primary_Natural_Killer_cells_from_peripheral_blood_snp2net.mat'
    Properties.Writable: false
                    chr: [1081481x1 int32]
                    pos: [1081481x1 int32]
                  snpid: [1081481x1 int32]
                    val: [1081481x1 double]
                 window: [1x1       double]
                 
>> [length(snp2net.val) sum(snp2net.val) snp2net.window]
     1081481      382443      100000
>> unique(snp2net.val)'
     0     1

5. `ibd2015_NK_snp2gene_cis.mat`: SNP-to-gene cis regulation

This file contains the SNP-to-gene cis regulation scores derived from context-matching cis eQTL studies. This file corresponds to \((c_{jg}-1)\) in the RSS-NET model.

$ md5sum ibd2015_NK_snp2gene_cis.mat
dedad8e25773fad69576dbce0f7d9f93  ibd2015_NK_snp2gene_cis.mat

$ du -sh ibd2015_NK_snp2gene_cis.mat
165M    ibd2015_NK_snp2gene_cis.mat

>> snp2gene_cis = matfile('ibd2015_NK_snp2gene_cis.mat');
>> snp2gene_cis
  matlab.io.MatFile
  Properties:
      Properties.Source: 'ibd2015_NK_snp2gene_cis.mat'
    Properties.Writable: false
                  colid: [10790012x1 int32]
                numgene: [1x1        int32]
                 numsnp: [1x1        int32]
                  rowid: [10790012x1 int32]
                    val: [10790012x1 double]

In this example, there are 10790012 SNP-gene pairs with cis regulation scores available, consiting of 829280 SNPs and 18230 genes. The cis regulation scores (val) range from 0 to 0.76.

>> [snp2gene_cis.numsnp length(unique(snp2gene_cis.rowid)) snp2gene_cis.numgene length(unique(snp2gene_cis.colid))]
   1081481    829280     18334     18230
   
>> val=snp2gene_cis.val;
>> [min(val) quantile(val, 0.25) median(val) quantile(val, 0.75) max(val)]
         0    0.0226    0.0636    0.1172    0.7622

Run RSS-NET analysis

$ pwd
/Users/xiangzhu/GitHub/rss-net/examples/ibd2015_nkcell

$ tree -f
.
├── ./analysis_template.m
├── ./ibd2015_nkcell.m
└── ./ibd2015_nkcell.sbatch

0 directories, 3 files

3. Submit job arrays

For a given GWAS and a given regulatory network, all RSS-NET analysis tasks are almost identical and they only differs in hyper-parameter values. To exploit this, we run one RSS-NET analysis as a job array with multiple tasks that run in parallel.

To this end, we write a simple sbatch script ibd2015_nkcell.sbatch, and submit it to a cluster with Slurm available.

$ sbatch ibd2015_nkcell.sbatch

After the submission, multiple jobs should run in different nodes simultaneously.

$ squeue -u xiangzhu
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
      62554249_107    owners ibd2015_ xiangzhu  R       0:32      1 sh02-17n12
      62554249_108    owners ibd2015_ xiangzhu  R       0:32      1 sh02-17n12
      62554249_109    owners ibd2015_ xiangzhu  R       0:32      1 sh01-28n08
      62554249_110    owners ibd2015_ xiangzhu  R       0:32      1 sh01-17n18
      62554249_111    owners ibd2015_ xiangzhu  R       0:32      1 sh01-26n33
      62554249_112    owners ibd2015_ xiangzhu  R       0:32      1 sh01-27n30

Start at Mar 2, 2020, 3:04 PM.

End at Mar 2, 2020, 11:37 PM.

Job ID: 62554249
Array Job ID: 62554249_125
Cluster: sherlock
User/Group: xiangzhu/whwong
State: COMPLETED (exit code 0)
Nodes: 1
Cores per node: 8
CPU Utilized: 1-03:08:31
CPU Efficiency: 74.54% of 1-12:24:40 core-walltime
Job Wall-clock time: 04:33:05
Memory Utilized: 26.58 GB
Memory Efficiency: 85.06% of 31.25 GB

More examples

devtools::session_info()

─ Session info ───────────────────────────────────────────────────────────────
 setting  value                       
 version  R version 3.6.2 (2019-12-12)
 os       macOS Catalina 10.15.3      
 system   x86_64, darwin15.6.0        
 ui       X11                         
 language (EN)                        
 collate  en_US.UTF-8                 
 ctype    en_US.UTF-8                 
 tz       America/Los_Angeles         
 date     2020-03-03                  

─ Packages ───────────────────────────────────────────────────────────────────
 package     * version date       lib source        
 assertthat    0.2.1   2019-03-21 [1] CRAN (R 3.6.0)
 backports     1.1.5   2019-10-02 [1] CRAN (R 3.6.0)
 callr         3.4.2   2020-02-12 [1] CRAN (R 3.6.0)
 cli           2.0.2   2020-02-28 [1] CRAN (R 3.6.0)
 crayon        1.3.4   2017-09-16 [1] CRAN (R 3.6.0)
 desc          1.2.0   2018-05-01 [1] CRAN (R 3.6.0)
 devtools      2.2.2   2020-02-17 [1] CRAN (R 3.6.0)
 digest        0.6.25  2020-02-23 [1] CRAN (R 3.6.0)
 ellipsis      0.3.0   2019-09-20 [1] CRAN (R 3.6.0)
 evaluate      0.14    2019-05-28 [1] CRAN (R 3.6.0)
 fansi         0.4.1   2020-01-08 [1] CRAN (R 3.6.0)
 fs            1.3.1   2019-05-06 [1] CRAN (R 3.6.0)
 git2r         0.26.1  2019-06-29 [1] CRAN (R 3.6.0)
 glue          1.3.1   2019-03-12 [1] CRAN (R 3.6.0)
 htmltools     0.4.0   2019-10-04 [1] CRAN (R 3.6.0)
 httpuv        1.5.2   2019-09-11 [1] CRAN (R 3.6.0)
 knitr         1.28    2020-02-06 [1] CRAN (R 3.6.0)
 later         1.0.0   2019-10-04 [1] CRAN (R 3.6.0)
 magrittr      1.5     2014-11-22 [1] CRAN (R 3.6.0)
 memoise       1.1.0   2017-04-21 [1] CRAN (R 3.6.0)
 pkgbuild      1.0.6   2019-10-09 [1] CRAN (R 3.6.0)
 pkgload       1.0.2   2018-10-29 [1] CRAN (R 3.6.0)
 prettyunits   1.1.1   2020-01-24 [1] CRAN (R 3.6.0)
 processx      3.4.2   2020-02-09 [1] CRAN (R 3.6.0)
 promises      1.1.0   2019-10-04 [1] CRAN (R 3.6.0)
 ps            1.3.2   2020-02-13 [1] CRAN (R 3.6.0)
 R6            2.4.1   2019-11-12 [1] CRAN (R 3.6.1)
 Rcpp          1.0.3   2019-11-08 [1] CRAN (R 3.6.0)
 remotes       2.1.1   2020-02-15 [1] CRAN (R 3.6.0)
 rlang         0.4.5   2020-03-01 [1] CRAN (R 3.6.0)
 rmarkdown     2.1     2020-01-20 [1] CRAN (R 3.6.0)
 rprojroot     1.3-2   2018-01-03 [1] CRAN (R 3.6.0)
 sessioninfo   1.1.1   2018-11-05 [1] CRAN (R 3.6.0)
 stringi       1.4.6   2020-02-17 [1] CRAN (R 3.6.0)
 stringr       1.4.0   2019-02-10 [1] CRAN (R 3.6.0)
 testthat      2.3.2   2020-03-02 [1] CRAN (R 3.6.2)
 usethis       1.5.1   2019-07-04 [1] CRAN (R 3.6.0)
 whisker       0.4     2019-08-28 [1] CRAN (R 3.6.0)
 withr         2.1.2   2018-03-15 [1] CRAN (R 3.6.0)
 workflowr   * 1.6.0   2019-12-19 [1] CRAN (R 3.6.0)
 xfun          0.12    2020-01-13 [1] CRAN (R 3.6.0)
 yaml          2.2.1   2020-02-01 [1] CRAN (R 3.6.0)

[1] /Library/Frameworks/R.framework/Versions/3.6/Resources/library

RSS-NET analysis of IBD GWAS summary statistics and NK cell regulatory network

Xiang Zhu

2019-08-19

Overview

Step-by-step illustration

Download input data files

1. ibd2015_sumstat.mat: processed GWAS summary statistics and LD matrix estimates

2. ibd2015_snp2gene.mat: physical distance between SNPs and genes

3. Primary_Natural_Killer_cells_from_peripheral_blood_gene2gene.mat: gene regulatory network

4. ibd2015_Primary_Natural_Killer_cells_from_peripheral_blood_snp2net.mat: SNP-to-network proximity annotation

5. ibd2015_NK_snp2gene_cis.mat: SNP-to-gene cis regulation