Last updated: 2020-06-24
Checks: 2 0
Knit directory: rss/
This reproducible R Markdown analysis was created with workflowr (version 1.6.2). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.
Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.
Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.
The results in this page were generated with repository version 1e806af. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.
Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish
or wflow_git_commit
). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:
Ignored files:
Ignored: .Rproj.user/
Ignored: .spelling
Ignored: examples/example5/.Rhistory
Ignored: examples/example5/Aseg_chr16.mat
Ignored: examples/example5/example5_simulated_data.mat
Ignored: examples/example5/example5_simulated_results.mat
Ignored: examples/example5/ibd2015_path2641_genes_results.mat
Untracked files:
Untracked: docs_old/
Unstaged changes:
Modified: rmd/_site.yml
Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.
These are the previous versions of the repository in which changes were made to the R Markdown (rmd/example_5a.Rmd
) and HTML (docs/example_5a.html
) files. If you’ve configured a remote Git repository (see ?wflow_git_remote
), click on the hyperlinks in the table below to view the files as they were in that past version.
File | Version | Author | Date | Message |
---|---|---|---|---|
html | ae24b64 | Xiang Zhu | 2020-06-24 | Build site. |
Rmd | 7014671 | Xiang Zhu | 2020-06-24 | wflow_publish(“rmd/example_5a.Rmd”) |
This is Part A of Example 5, which illustrates how to perform enrichment and prioritization analysis of GWAS summary statistics based on variational Bayes (VB) inference of RSS-BVSR model. This part describes an end-to-end analysis of a synthetic dataset used in simulation studies of Zhu and Stephens (2018). This part gives users a quick view of how RSS works in enrichment and prioritization analysis.
The input dataset here is simulated under the enrichment model of RSS, using real genotypes of 12,758 SNPs on chromosome 16 from 1458 individuals in the UK Blood Service Control Group (Wellcome Trust Case Control Consortium, 2007). Please see the caption of Supplementary Figure 1 in Zhu and Stephens (2018) for the simulation details.
To reproduce results of Example 5 Part A, please use the script example5_simulated.m
, and follow the step-by-step guide below. Before running example5_simulated.m
, please install the VB subroutines of RSS. Please find installation instructions here.
Once the software is installed and the input data is downloaded, one should be able to run this part by simply typing the following line in a MATLAB console:
Note that the working directory here is assumed to be rss/examples/example5
. Please modify example5_simulated.m
accordingly if a different directory is used.
Please download and save example5_simulated_data.mat
in the working directory. Please contact me (xiangzhu[at]uchicago.edu
) if you have trouble accessing this file.
The data file example5_simulated_data.mat
contains the following elements that will be used by RSS.
>> example_data = matfile('example5_simulated_data.mat');
>> example_data
example_data =
matlab.io.MatFile
Properties:
Properties.Source: 'example5_simulated_data.mat'
Properties.Writable: false
R: [12758x12758 double]
betahat: [12758x1 double]
se: [12758x1 double]
snps: [676x1 double]
R
: 12758 by 12758 matrix, LD matrix estimated from a reference panelbetahat
: 12758 by 1 vector, single-SNP effect size estimate for each SNPse
: 12758 by 1 vector, standard errors of the single-SNP effect size estimatessnps
: 676 by 1 vector, indices of SNPs that are “inside” the target pathwayTypically RSS only requires these four input variables for enrichment and prioritization analysis. To further reduce computation, RSS uses the sparse matrix SiRiS
instead of R
:
p = length(betahat);
Si = 1 ./ se(:);
SiRiS = sparse(repmat(Si,1,p) .* R .* repmat(Si',p,1));
clear Si R;
Note that example5_simulated_data.mat
also contains ground truth about this simulated dataset: {gamma, pve, thetatype}
. These variables are NOT used by RSS in any step of analysis, and are ONLY used to verify results in the end.
The total computational cost of RSS is proportional to the grid size, and thus please consider reducing the size of theta0
and/or theta
if you want to finish running example5_simulated.m
faster. (It takes 4 hours to complete the analysis on a single CPU using the grid below.)
As in Zhu and Stephens (2018), here we use a simple random start to set initial values of variational parameters {alpha,mu}
for the baseline model.
Since we set sigb=1
in Step 2, we use a wrapper function of RSS null_wrapper_fixsb.m
that fixes sigb=1
for all elements in theta0
. (In Example 5 Part B we will use a different wrapper function that can give a different sigb
value for each element in theta0
.)
Since we set sigb=1
in Step 2, we use a wrapper function gsea_wrapper_fixsb.m
that fixes the value of sigb
for all elements in theta0
and theta
. (In Example 5 Part B we will use a different wrapper function that can give a different sigb
value for each combination of theta0
and theta
.)
[log10_bf,e_logw,e_alpha,e_mu,e_s] = gsea_wrapper_fixsb('squarem',betahat,se,SiRiS,snps,...
sigb,theta0,theta,b_logw,b_alpha,b_mu);
Note that here we use the variational parameter estimates {b_alpha,b_mu}
from the baseline model fitting (Step 4) to set initial values of variational parameters for the enrichment model (Step 5).
The output of baseline and enrichment model fitting can be further used to prioritize genes within an enriched gene set. We provide a wrapper function compute_pip.m
for prioritization analysis.
% specify pre-defined genomic segments (genes)
segs_file = 'Aseg_chr16.mat';
% generate gene-level results under baseline model
[b_p1,b_p2] = compute_pip(segs_file,b_logw,b_alpha);
% generate gene-level results under enrichment model
[e_p1,e_p2] = compute_pip(segs_file,e_logw_vec,e_alpha_mat);
Here [b/e]_p_[1/2]
denotes the posterior probability of each locus containing at lease 1/2 trait-associated SNPs under the baseline/enrichment model.
The locus information is available in the file Aseg_chr16.mat
. Please download and save it in the working directory. Specifically, Aseg_chr16.mat
contains the following elements:
>> load Aseg_chr16.mat
>> whos
Name Size Bytes Class Attributes
Aseg 12758x878 391080 double sparse
chr 12758x1 102064 double
gene_chr 878x1 7024 double
gene_id 878x1 7024 double
gene_start 878x1 7024 double
gene_stop 878x1 7024 double
pos 12758x1 102064 double
>> unique(full(Aseg(:)))'
ans =
0 1
As shown above, there are 12,758 SNPs and 878 genes (both based on NCBI Build 35) in this simulated dataset. Here we assign a SNP j
to a gene g
(i.e. Aseg(j,g)==1
) if and only if SNP j
is within 100 kb of transcribed region of gene g
(i.e. [gene_start-100e3 gene_stop+100e3]
).
At the end of running example5_simulated.m
, all analysis results are saved as example5_simulated_results.mat
in the working directory. To help verify that your own results are as expected, we provide our result file example5_simulated_results.mat
. (There may be some differences between two result files, especially when a smaller grid was used in Step 2.)
>> load example5_simulated_results.mat
>> whos
Name Size Bytes Class Attributes
b_alpha 12758x21 2143344 double
b_logw 21x1 168 double
b_mu 12758x21 2143344 double
b_p1 878x1 7024 double
b_p2 878x1 7024 double
b_s 12758x21 2143344 double
e_alpha 12758x21x21 45010224 double
e_logw 21x21 3528 double
e_mu 12758x21x21 45010224 double
e_p1 878x1 7024 double
e_p2 878x1 7024 double
e_s 12758x21x21 45010224 double
log10_bf 1x1 8 double
rss_time 1x1 8 double
There are four groups of output variables in the result file above.
The first group consists of estimated variational parameters b_alpha
& b_mu
& b_s
under the baseline model, where b_alpha(:,i)
& b_mu(:,i)
& b_s(:,i)
correspond to estimation under the hyper-parameter setting [sigb, theta0(i)]
.
The second group consists of estimated variational parameters e_alpha
& e_mu
& e_s
under the enrichment model, where e_alpha(:,i,j)
& e_mu(:,i,j)
& e_s(:,i,j)
correspond to estimation under the hyper-parameter setting [sigb, theta0(i), theta(j)]
.
The third group consists of estimated variational lower bounds under the baseline (b_logw
) and enrichment (e_logw
) model, and log 10 enrichment Bayes factor (log10_bf
).
Since this dataset is simulated from the enrichment model, RSS yields a large log 10 enrichment Bayes factor as expected:
>> fprintf('Log 10 enrichment Bayes factor: %.4f ...\n', log10_bf);
Log 10 enrichment Bayes factor: 19.3335 ...
Further, by combining *_logw
with *_alpha
, we can estimate the number of SNPs with non-zero effect, and compare with the truth:
>> fprintf('Total number of SNPs with non-zero effect: %d ...\n', sum(example_data.gamma));
Total number of SNPs with non-zero effect: 5 ...
>> b_w = exp(b_logw - max(b_logw(:)));
>> b_w = b_w / sum(b_w(:));
>> b_ens = sum(b_alpha);
>> b_ens = dot(b_ens(:), b_w(:));
>> e_w = exp(e_logw - max(e_logw(:)));
>> e_w = e_w / sum(e_w(:));
>> e_ens = sum(e_alpha);
>> e_ens = dot(e_ens(:), e_w(:));
>> disp([sum(example_data.gamma) b_ens e_ens])
5.0000 5.0025 6.1924
The fourth group consists of gene prioritization results under the baseline (b_p1
& b_p2
) and enrichment model (e_p1
& e_p2
). Following Zhu and Stephens (2018), here we define a gene as “trait-associated” (i.e. gene_cau(j)==1
) if at least one SNP within 100 kb of the transcribed region of this gene has non-zero effect.
>> disp([b_p1(1:10) e_p1(1:10) gene_cau(1:10)]);
0.0005 0.0002 0
0.0033 0.0013 0
0.0007 0.0244 0
0.0044 1.0000 1.0000
0.0146 0.2846 1.0000
0.0018 0.0007 0
0.0012 0.0230 0
0.0004 0.0167 0
0.0001 0.0009 0
1.0000 1.0000 1.0000
We can see that the enrichment model has higher power to identify trait-associated genes than the baseline model, since it exploits the underlying enrichment signal.