Example 5: Enrichment and prioritization analysis of GWAS summary statistics using RSS (Part A)

Last updated: 2020-06-24

Checks: 2 0

Knit directory: rss/

This reproducible R Markdown analysis was created with workflowr (version 1.6.2). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.

R Markdown file: up-to-date

Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.

Repository version: 1e806af

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.

The results in this page were generated with repository version 1e806af. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Ignored files:
    Ignored:    .Rproj.user/
    Ignored:    .spelling
    Ignored:    examples/example5/.Rhistory
    Ignored:    examples/example5/Aseg_chr16.mat
    Ignored:    examples/example5/example5_simulated_data.mat
    Ignored:    examples/example5/example5_simulated_results.mat
    Ignored:    examples/example5/ibd2015_path2641_genes_results.mat

Untracked files:
    Untracked:  docs_old/

Unstaged changes:
    Modified:   rmd/_site.yml

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.

These are the previous versions of the repository in which changes were made to the R Markdown (rmd/example_5a.Rmd) and HTML (docs/example_5a.html) files. If you’ve configured a remote Git repository (see ?wflow_git_remote), click on the hyperlinks in the table below to view the files as they were in that past version.

File	Version	Author	Date	Message
html	ae24b64	Xiang Zhu	2020-06-24	Build site.
Rmd	7014671	Xiang Zhu	2020-06-24	wflow_publish(“rmd/example_5a.Rmd”)

Overview

This is Part A of Example 5, which illustrates how to perform enrichment and prioritization analysis of GWAS summary statistics based on variational Bayes (VB) inference of RSS-BVSR model. This part describes an end-to-end analysis of a synthetic dataset used in simulation studies of Zhu and Stephens (2018). This part gives users a quick view of how RSS works in enrichment and prioritization analysis.

The input dataset here is simulated under the enrichment model of RSS, using real genotypes of 12,758 SNPs on chromosome 16 from 1458 individuals in the UK Blood Service Control Group (Wellcome Trust Case Control Consortium, 2007). Please see the caption of Supplementary Figure 1 in Zhu and Stephens (2018) for the simulation details.

To reproduce results of Example 5 Part A, please use the script example5_simulated.m, and follow the step-by-step guide below. Before running example5_simulated.m, please install the VB subroutines of RSS. Please find installation instructions here.

Once the software is installed and the input data is downloaded, one should be able to run this part by simply typing the following line in a MATLAB console:

>> run example5_simulated.m

Note that the working directory here is assumed to be rss/examples/example5. Please modify example5_simulated.m accordingly if a different directory is used.

Step-by-step illustration

Step 1: Download input data

Please download and save example5_simulated_data.mat in the working directory. Please contact me (xiangzhu[at]uchicago.edu) if you have trouble accessing this file.

The data file example5_simulated_data.mat contains the following elements that will be used by RSS.

>> example_data = matfile('example5_simulated_data.mat');
>> example_data

example_data =
  matlab.io.MatFile
  Properties:
      Properties.Source: 'example5_simulated_data.mat'
    Properties.Writable: false
                      R: [12758x12758 double]
                betahat: [12758x1     double]
                     se: [12758x1     double]
                   snps: [676x1       double]

R: 12758 by 12758 matrix, LD matrix estimated from a reference panel
betahat: 12758 by 1 vector, single-SNP effect size estimate for each SNP
se: 12758 by 1 vector, standard errors of the single-SNP effect size estimates
snps: 676 by 1 vector, indices of SNPs that are “inside” the target pathway

Typically RSS only requires these four input variables for enrichment and prioritization analysis. To further reduce computation, RSS uses the sparse matrix SiRiS instead of R:

p     = length(betahat);
Si    = 1 ./ se(:);
SiRiS = sparse(repmat(Si,1,p) .* R .* repmat(Si',p,1));
clear Si R;

Note that example5_simulated_data.mat also contains ground truth about this simulated dataset: {gamma, pve, thetatype}. These variables are NOT used by RSS in any step of analysis, and are ONLY used to verify results in the end.

Step 2: Specify a grid for hyper-parameters

The total computational cost of RSS is proportional to the grid size, and thus please consider reducing the size of theta0 and/or theta if you want to finish running example5_simulated.m faster. (It takes 4 hours to complete the analysis on a single CPU using the grid below.)

% specify hyper-parameters
theta0 = (-4.5:0.05:-3.5)';      % grid for the genome-wide log-odds (base 10)
theta  = (1.5:0.05:2.5)';        % grid for the log-fold enrichment (base 10)
sigb   = 1;                      % prior SD of genetic effects

Step 3: Initialize variational parameters

As in Zhu and Stephens (2018), here we use a simple random start to set initial values of variational parameters {alpha,mu} for the baseline model.

% initialize the variational parameters
myseed = 200;

rng(myseed, 'twister');
alpha0 = rand(p,1);
alpha0 = alpha0 ./ sum(alpha0);

rng(myseed+1, 'twister');
mu0 = randn(p,1);

n0         = length(theta0);
alpha0_rss = repmat(alpha0, [1 n0]);
mu0_rss    = repmat(mu0, [1 n0]);

Step 4: Fit the baseline model

Since we set sigb=1 in Step 2, we use a wrapper function of RSS null_wrapper_fixsb.m that fixes sigb=1 for all elements in theta0. (In Example 5 Part B we will use a different wrapper function that can give a different sigb value for each element in theta0.)

[b_logw,b_alpha,b_mu,b_s] = null_wrapper_fixsb('squarem',betahat,se,SiRiS,...
                                               sigb,theta0,alpha0_rss,mu0_rss);

Step 5: Fit the enrichment model

Since we set sigb=1 in Step 2, we use a wrapper function gsea_wrapper_fixsb.m that fixes the value of sigb for all elements in theta0 and theta. (In Example 5 Part B we will use a different wrapper function that can give a different sigb value for each combination of theta0 and theta.)

[log10_bf,e_logw,e_alpha,e_mu,e_s] = gsea_wrapper_fixsb('squarem',betahat,se,SiRiS,snps,...
                                                        sigb,theta0,theta,b_logw,b_alpha,b_mu);

Note that here we use the variational parameter estimates {b_alpha,b_mu} from the baseline model fitting (Step 4) to set initial values of variational parameters for the enrichment model (Step 5).

Step 6: Perform gene prioritization analysis

The output of baseline and enrichment model fitting can be further used to prioritize genes within an enriched gene set. We provide a wrapper function compute_pip.m for prioritization analysis.

% specify pre-defined genomic segments (genes)
segs_file = 'Aseg_chr16.mat';

% generate gene-level results under baseline model
[b_p1,b_p2] = compute_pip(segs_file,b_logw,b_alpha);

% generate gene-level results under enrichment model
[e_p1,e_p2] = compute_pip(segs_file,e_logw_vec,e_alpha_mat);

Here [b/e]_p_[1/2] denotes the posterior probability of each locus containing at lease 1/2 trait-associated SNPs under the baseline/enrichment model.

The locus information is available in the file Aseg_chr16.mat. Please download and save it in the working directory. Specifically, Aseg_chr16.mat contains the following elements:

>> load Aseg_chr16.mat
>> whos
  Name                Size              Bytes  Class     Attributes

  Aseg            12758x878            391080  double    sparse
  chr             12758x1              102064  double
  gene_chr          878x1                7024  double
  gene_id           878x1                7024  double
  gene_start        878x1                7024  double
  gene_stop         878x1                7024  double
  pos             12758x1              102064  double

>> unique(full(Aseg(:)))'
ans =
     0     1

As shown above, there are 12,758 SNPs and 878 genes (both based on NCBI Build 35) in this simulated dataset. Here we assign a SNP j to a gene g (i.e. Aseg(j,g)==1) if and only if SNP j is within 100 kb of transcribed region of gene g (i.e. [gene_start-100e3 gene_stop+100e3]).

Step 7: Save analysis results

At the end of running example5_simulated.m, all analysis results are saved as example5_simulated_results.mat in the working directory. To help verify that your own results are as expected, we provide our result file example5_simulated_results.mat. (There may be some differences between two result files, especially when a smaller grid was used in Step 2.)

>> load example5_simulated_results.mat
>> whos
  Name              Size                  Bytes  Class     Attributes

  b_alpha       12758x21                2143344  double
  b_logw           21x1                     168  double
  b_mu          12758x21                2143344  double
  b_p1            878x1                    7024  double
  b_p2            878x1                    7024  double
  b_s           12758x21                2143344  double
  e_alpha       12758x21x21            45010224  double
  e_logw           21x21                   3528  double
  e_mu          12758x21x21            45010224  double
  e_p1            878x1                    7024  double
  e_p2            878x1                    7024  double
  e_s           12758x21x21            45010224  double
  log10_bf          1x1                       8  double
  rss_time          1x1                       8  double

Step 8: Understand analysis results

There are four groups of output variables in the result file above.

The first group consists of estimated variational parameters b_alpha & b_mu & b_s under the baseline model, where b_alpha(:,i) & b_mu(:,i) & b_s(:,i) correspond to estimation under the hyper-parameter setting [sigb, theta0(i)].

The second group consists of estimated variational parameters e_alpha & e_mu & e_s under the enrichment model, where e_alpha(:,i,j) & e_mu(:,i,j) & e_s(:,i,j) correspond to estimation under the hyper-parameter setting [sigb, theta0(i), theta(j)].

The third group consists of estimated variational lower bounds under the baseline (b_logw) and enrichment (e_logw) model, and log 10 enrichment Bayes factor (log10_bf).

Since this dataset is simulated from the enrichment model, RSS yields a large log 10 enrichment Bayes factor as expected:

>> fprintf('Log 10 enrichment Bayes factor: %.4f ...\n', log10_bf);
Log 10 enrichment Bayes factor: 19.3335 ...

Further, by combining *_logw with *_alpha, we can estimate the number of SNPs with non-zero effect, and compare with the truth:

>> fprintf('Total number of SNPs with non-zero effect: %d ...\n', sum(example_data.gamma));
Total number of SNPs with non-zero effect: 5 ...

>> b_w   = exp(b_logw - max(b_logw(:)));
>> b_w   = b_w / sum(b_w(:));
>> b_ens = sum(b_alpha);
>> b_ens = dot(b_ens(:), b_w(:));

>> e_w   = exp(e_logw - max(e_logw(:)));
>> e_w   = e_w / sum(e_w(:));
>> e_ens = sum(e_alpha);
>> e_ens = dot(e_ens(:), e_w(:));

>> disp([sum(example_data.gamma) b_ens e_ens])
    5.0000    5.0025    6.1924

The fourth group consists of gene prioritization results under the baseline (b_p1 & b_p2) and enrichment model (e_p1 & e_p2). Following Zhu and Stephens (2018), here we define a gene as “trait-associated” (i.e. gene_cau(j)==1) if at least one SNP within 100 kb of the transcribed region of this gene has non-zero effect.

>> disp([b_p1(1:10) e_p1(1:10) gene_cau(1:10)]);
    0.0005    0.0002         0
    0.0033    0.0013         0
    0.0007    0.0244         0
    0.0044    1.0000    1.0000
    0.0146    0.2846    1.0000
    0.0018    0.0007         0
    0.0012    0.0230         0
    0.0004    0.0167         0
    0.0001    0.0009         0
    1.0000    1.0000    1.0000

We can see that the enrichment model has higher power to identify trait-associated genes than the baseline model, since it exploits the underlying enrichment signal.

>> disp([b_p1(gene_cau==1) e_p1(gene_cau==1)]);
    0.0044    1.0000
    0.0146    0.2846
    1.0000    1.0000
    1.0000    1.0000
    0.6437    0.9949
    0.6437    0.9949
    0.0131    0.2308
    0.0113    0.1893
    0.6437    0.9949
    0.6437    0.9949
    1.0000    1.0000
    0.6441    0.9949
    0.9970    1.0000
    0.9970    1.0000
    0.2892    0.9947
    0.9970    1.0000