Last updated: 2024-09-16
Checks: 7 0
Knit directory: rss/
This reproducible R Markdown analysis was created with workflowr (version 1.7.1). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.
Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.
Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.
The command set.seed(20200623)
was run prior to running
the code in the R Markdown file. Setting a seed ensures that any results
that rely on randomness, e.g. subsampling or permutations, are
reproducible.
Great job! Recording the operating system, R version, and package versions is critical for reproducibility.
Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.
Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.
Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.
The results in this page were generated with repository version 941a146. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.
Note that you need to be careful to ensure that all relevant files for
the analysis have been committed to Git prior to generating the results
(you can use wflow_publish
or
wflow_git_commit
). workflowr only checks the R Markdown
file, but you know if there are other scripts or data files that it
depends on. Below is the status of the Git repository when the results
were generated:
Ignored files:
Ignored: .Rhistory
Ignored: .Rproj.user/
Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.
These are the previous versions of the repository in which changes were
made to the R Markdown (rmd/example_5a.Rmd
) and HTML
(docs/example_5a.html
) files. If you’ve configured a remote
Git repository (see ?wflow_git_remote
), click on the
hyperlinks in the table below to view the files as they were in that
past version.
File | Version | Author | Date | Message |
---|---|---|---|---|
html | d2a171c | Xiang Zhu | 2024-07-03 | Build site. |
Rmd | 990f276 | Xiang Zhu | 2024-07-03 | wflow_publish("rmd/example_5a.Rmd") |
html | bab3f58 | Xiang Zhu | 2020-06-24 | Build site. |
html | ae24b64 | Xiang Zhu | 2020-06-24 | Build site. |
Rmd | 7014671 | Xiang Zhu | 2020-06-24 | wflow_publish("rmd/example_5a.Rmd") |
This is Part A of Example 5, which illustrates how to perform enrichment and prioritization analysis of GWAS summary statistics based on variational Bayes (VB) inference of RSS-BVSR model. This part describes an end-to-end analysis of a synthetic dataset used in simulation studies of Zhu and Stephens (2018). This part gives users a quick view of how RSS works in enrichment and prioritization analysis.
The input dataset here is simulated under the enrichment model of RSS, using real genotypes of 12,758 SNPs on chromosome 16 from 1458 individuals in the UK Blood Service Control Group (Wellcome Trust Case Control Consortium, 2007). Please see the caption of Supplementary Figure 1 in Zhu and Stephens (2018) for the simulation details.
To reproduce results of Example 5 Part A, please use the script example5_simulated.m
,
and follow the step-by-step guide below. Before running example5_simulated.m
,
please install the VB
subroutines of RSS. Please find installation instructions here.
Once the software is installed and the input data is downloaded, one should be able to run this part by simply typing the following line in a MATLAB console:
Note that the working directory here is assumed to be
rss/examples/example5
. Please modify example5_simulated.m
accordingly if a different directory is used.
All data files required to run this example are freely available at Zenodo . Please contact me if you have trouble accessing this file. After a complete download, you should see the following files.
$ tree ./
./
├── Aseg_chr16.mat
├── example5_simulated_data.mat
├── example5_simulated_results.mat
├── ibd2015_gsea_seed_459_path2641_squarem.mat
├── ibd2015_null_h_30_theta0_285_seed_459_squarem_step1.mat
├── ibd2015_null_h_30_theta0_288_seed_459_squarem_step1.mat
├── ibd2015_null_h_30_theta0_290_seed_459_squarem_step1.mat
├── ibd2015_path2641_genes.mat
├── ibd2015_path2641_genes_results.mat
├── ibd2015_sumstat.mat
└── ibd2015_sumstat_path2641.mat
0 directories, 11 files
The data file example5_simulated_data.mat
contains the following elements that will be used by RSS.
>> example_data = matfile('example5_simulated_data.mat');
>> example_data
example_data =
matlab.io.MatFile
Properties:
Properties.Source: 'example5_simulated_data.mat'
Properties.Writable: false
R: [12758x12758 double]
betahat: [12758x1 double]
se: [12758x1 double]
snps: [676x1 double]
R
: 12758 by 12758 matrix, LD matrix estimated from a
reference panelbetahat
: 12758 by 1 vector, single-SNP effect size
estimate for each SNPse
: 12758 by 1 vector, standard errors of the
single-SNP effect size estimatessnps
: 676 by 1 vector, indices of SNPs that are
“inside” the target pathwayTypically RSS only requires these four input variables for enrichment
and prioritization analysis. To further reduce computation, RSS uses the
sparse matrix SiRiS
instead of R
:
p = length(betahat);
Si = 1 ./ se(:);
SiRiS = sparse(repmat(Si,1,p) .* R .* repmat(Si',p,1));
clear Si R;
Note that example5_simulated_data.mat
also contains
ground truth about this simulated dataset:
{gamma, pve, thetatype}
. These variables are NOT used by
RSS in any step of analysis, and are ONLY used to verify results in the
end.
The total computational cost of RSS is proportional to the grid size,
and thus please consider reducing the size of theta0
and/or
theta
if you want to finish running example5_simulated.m
faster. (It takes 4 hours to complete the analysis on a single CPU using
the grid below.)
As in Zhu and
Stephens (2018), here we use a simple random start to set initial
values of variational parameters {alpha,mu}
for the
baseline model.
Since we set sigb=1
in Step 2, we use a wrapper function
of RSS null_wrapper_fixsb.m
that fixes sigb=1
for all elements in theta0
.
(In Example 5 Part B we will use a
different wrapper function that can give a different sigb
value for each element in theta0
.)
Since we set sigb=1
in Step 2, we use a wrapper function
gsea_wrapper_fixsb.m
that fixes the value of sigb
for all elements in
theta0
and theta
. (In Example 5 Part B we will use a different
wrapper function that can give a different sigb
value for
each combination of theta0
and theta
.)
[log10_bf,e_logw,e_alpha,e_mu,e_s] = gsea_wrapper_fixsb('squarem',betahat,se,SiRiS,snps,...
sigb,theta0,theta,b_logw,b_alpha,b_mu);
Note that here we use the variational parameter estimates
{b_alpha,b_mu}
from the baseline model fitting (Step 4) to
set initial values of variational parameters for the enrichment model
(Step 5).
The output of baseline and enrichment model fitting can be further
used to prioritize genes within an enriched gene set. We provide a
wrapper function compute_pip.m
for prioritization analysis.
% specify pre-defined genomic segments (genes)
segs_file = 'Aseg_chr16.mat';
% generate gene-level results under baseline model
[b_p1,b_p2] = compute_pip(segs_file,b_logw,b_alpha);
% generate gene-level results under enrichment model
[e_p1,e_p2] = compute_pip(segs_file,e_logw_vec,e_alpha_mat);
Here [b/e]_p_[1/2]
denotes the posterior probability of
each locus containing at lease 1/2 trait-associated SNPs under the
baseline/enrichment model.
The locus information is available in the file Aseg_chr16.mat
.
Please download and save it in the working directory. Specifically, Aseg_chr16.mat
contains the following elements:
>> load Aseg_chr16.mat
>> whos
Name Size Bytes Class Attributes
Aseg 12758x878 391080 double sparse
chr 12758x1 102064 double
gene_chr 878x1 7024 double
gene_id 878x1 7024 double
gene_start 878x1 7024 double
gene_stop 878x1 7024 double
pos 12758x1 102064 double
>> unique(full(Aseg(:)))'
ans =
0 1
As shown above, there are 12,758 SNPs and 878 genes (both based on
NCBI Build 35) in this simulated dataset. Here we assign a SNP
j
to a gene g
(i.e. Aseg(j,g)==1
)
if and only if SNP j
is within 100 kb of transcribed region
of gene g
(i.e. [gene_start-100e3 gene_stop+100e3]
).
At the end of running example5_simulated.m
,
all analysis results are saved as
example5_simulated_results.mat
in the working directory. To
help verify that your own results are as expected, we provide our result
file example5_simulated_results.mat
.
(There may be some differences between two result files, especially when
a smaller grid was used in Step 2.)
>> load example5_simulated_results.mat
>> whos
Name Size Bytes Class Attributes
b_alpha 12758x21 2143344 double
b_logw 21x1 168 double
b_mu 12758x21 2143344 double
b_p1 878x1 7024 double
b_p2 878x1 7024 double
b_s 12758x21 2143344 double
e_alpha 12758x21x21 45010224 double
e_logw 21x21 3528 double
e_mu 12758x21x21 45010224 double
e_p1 878x1 7024 double
e_p2 878x1 7024 double
e_s 12758x21x21 45010224 double
log10_bf 1x1 8 double
rss_time 1x1 8 double
There are four groups of output variables in the result file above.
The first group consists of estimated variational parameters
b_alpha
& b_mu
& b_s
under the baseline model, where b_alpha(:,i)
&
b_mu(:,i)
& b_s(:,i)
correspond to
estimation under the hyper-parameter setting
[sigb, theta0(i)]
.
The second group consists of estimated variational parameters
e_alpha
& e_mu
& e_s
under the enrichment model, where e_alpha(:,i,j)
&
e_mu(:,i,j)
& e_s(:,i,j)
correspond to
estimation under the hyper-parameter setting
[sigb, theta0(i), theta(j)]
.
The third group consists of estimated variational lower bounds under
the baseline (b_logw
) and enrichment (e_logw
)
model, and log 10 enrichment Bayes factor (log10_bf
).
Since this dataset is simulated from the enrichment model, RSS yields a large log 10 enrichment Bayes factor as expected:
>> fprintf('Log 10 enrichment Bayes factor: %.4f ...\n', log10_bf);
Log 10 enrichment Bayes factor: 19.3335 ...
Further, by combining *_logw
with *_alpha
,
we can estimate the number of SNPs with non-zero effect, and compare
with the truth:
>> fprintf('Total number of SNPs with non-zero effect: %d ...\n', sum(example_data.gamma));
Total number of SNPs with non-zero effect: 5 ...
>> b_w = exp(b_logw - max(b_logw(:)));
>> b_w = b_w / sum(b_w(:));
>> b_ens = sum(b_alpha);
>> b_ens = dot(b_ens(:), b_w(:));
>> e_w = exp(e_logw - max(e_logw(:)));
>> e_w = e_w / sum(e_w(:));
>> e_ens = sum(e_alpha);
>> e_ens = dot(e_ens(:), e_w(:));
>> disp([sum(example_data.gamma) b_ens e_ens])
5.0000 5.0025 6.1924
The fourth group consists of gene prioritization results under the
baseline (b_p1
& b_p2
) and enrichment
model (e_p1
& e_p2
). Following Zhu and
Stephens (2018), here we define a gene as “trait-associated”
(i.e. gene_cau(j)==1
) if at least one SNP within 100 kb of
the transcribed region of this gene has non-zero effect.
>> disp([b_p1(1:10) e_p1(1:10) gene_cau(1:10)]);
0.0005 0.0002 0
0.0033 0.0013 0
0.0007 0.0244 0
0.0044 1.0000 1.0000
0.0146 0.2846 1.0000
0.0018 0.0007 0
0.0012 0.0230 0
0.0004 0.0167 0
0.0001 0.0009 0
1.0000 1.0000 1.0000
We can see that the enrichment model has higher power to identify trait-associated genes than the baseline model, since it exploits the underlying enrichment signal.
>> disp([b_p1(gene_cau==1) e_p1(gene_cau==1)]);
0.0044 1.0000
0.0146 0.2846
1.0000 1.0000
1.0000 1.0000
0.6437 0.9949
0.6437 0.9949
0.0131 0.2308
0.0113 0.1893
0.6437 0.9949
0.6437 0.9949
1.0000 1.0000
0.6441 0.9949
0.9970 1.0000
0.9970 1.0000
0.2892 0.9947
0.9970 1.0000
─ Session info ───────────────────────────────────────────────────────────────
setting value
version R version 4.4.1 (2024-06-14)
os macOS Sonoma 14.6.1
system aarch64, darwin20
ui X11
language (EN)
collate en_US.UTF-8
ctype en_US.UTF-8
tz America/Los_Angeles
date 2024-09-16
pandoc 3.1.11 @ /Applications/RStudio.app/Contents/Resources/app/quarto/bin/tools/aarch64/ (via rmarkdown)
─ Packages ───────────────────────────────────────────────────────────────────
package * version date (UTC) lib source
bslib 0.8.0 2024-07-29 [1] CRAN (R 4.4.0)
cachem 1.1.0 2024-05-16 [1] CRAN (R 4.4.0)
callr 3.7.6 2024-03-25 [1] CRAN (R 4.4.0)
cli 3.6.3 2024-06-21 [1] CRAN (R 4.4.0)
devtools 2.4.5 2022-10-11 [1] CRAN (R 4.4.0)
digest 0.6.37 2024-08-19 [1] CRAN (R 4.4.1)
ellipsis 0.3.2 2021-04-29 [1] CRAN (R 4.4.0)
evaluate 0.24.0 2024-06-10 [1] CRAN (R 4.4.0)
fansi 1.0.6 2023-12-08 [1] CRAN (R 4.4.0)
fastmap 1.2.0 2024-05-15 [1] CRAN (R 4.4.0)
fs 1.6.4 2024-04-25 [1] CRAN (R 4.4.0)
getPass 0.2-4 2023-12-10 [1] CRAN (R 4.4.0)
git2r 0.33.0 2023-11-26 [1] CRAN (R 4.4.0)
glue 1.7.0 2024-01-09 [1] CRAN (R 4.4.0)
htmltools 0.5.8.1 2024-04-04 [1] CRAN (R 4.4.0)
htmlwidgets 1.6.4 2023-12-06 [1] CRAN (R 4.4.0)
httpuv 1.6.15 2024-03-26 [1] CRAN (R 4.4.0)
httr 1.4.7 2023-08-15 [1] CRAN (R 4.4.0)
jquerylib 0.1.4 2021-04-26 [1] CRAN (R 4.4.0)
jsonlite 1.8.8 2023-12-04 [1] CRAN (R 4.4.0)
knitr 1.48 2024-07-07 [1] CRAN (R 4.4.0)
later 1.3.2 2023-12-06 [1] CRAN (R 4.4.0)
lifecycle 1.0.4 2023-11-07 [1] CRAN (R 4.4.0)
magrittr 2.0.3 2022-03-30 [1] CRAN (R 4.4.0)
memoise 2.0.1 2021-11-26 [1] CRAN (R 4.4.0)
mime 0.12 2021-09-28 [1] CRAN (R 4.4.0)
miniUI 0.1.1.1 2018-05-18 [1] CRAN (R 4.4.0)
pillar 1.9.0 2023-03-22 [1] CRAN (R 4.4.0)
pkgbuild 1.4.4 2024-03-17 [1] CRAN (R 4.4.0)
pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.4.0)
pkgload 1.4.0 2024-06-28 [1] CRAN (R 4.4.0)
processx 3.8.4 2024-03-16 [1] CRAN (R 4.4.0)
profvis 0.3.8 2023-05-02 [1] CRAN (R 4.4.0)
promises 1.3.0 2024-04-05 [1] CRAN (R 4.4.0)
ps 1.8.0 2024-09-12 [1] CRAN (R 4.4.1)
purrr 1.0.2 2023-08-10 [1] CRAN (R 4.4.0)
R6 2.5.1 2021-08-19 [1] CRAN (R 4.4.0)
Rcpp 1.0.13 2024-07-17 [1] CRAN (R 4.4.0)
remotes 2.5.0 2024-03-17 [1] CRAN (R 4.4.0)
rlang 1.1.4 2024-06-04 [1] CRAN (R 4.4.0)
rmarkdown 2.28 2024-08-17 [1] CRAN (R 4.4.0)
rprojroot 2.0.4 2023-11-05 [1] CRAN (R 4.4.0)
rstudioapi 0.16.0 2024-03-24 [1] CRAN (R 4.4.0)
sass 0.4.9.9000 2024-07-11 [1] Github (rstudio/sass@9228fcf)
sessioninfo 1.2.2 2021-12-06 [1] CRAN (R 4.4.0)
shiny 1.9.1 2024-08-01 [1] CRAN (R 4.4.0)
stringi 1.8.4 2024-05-06 [1] CRAN (R 4.4.0)
stringr 1.5.1 2023-11-14 [1] CRAN (R 4.4.0)
tibble 3.2.1 2023-03-20 [1] CRAN (R 4.4.0)
urlchecker 1.0.1 2021-11-30 [1] CRAN (R 4.4.0)
usethis 3.0.0 2024-07-29 [1] CRAN (R 4.4.0)
utf8 1.2.4 2023-10-22 [1] CRAN (R 4.4.0)
vctrs 0.6.5 2023-12-01 [1] CRAN (R 4.4.0)
whisker 0.4.1 2022-12-05 [1] CRAN (R 4.4.0)
workflowr * 1.7.1 2023-08-23 [1] CRAN (R 4.4.0)
xfun 0.47 2024-08-17 [1] CRAN (R 4.4.0)
xtable 1.8-4 2019-04-21 [1] CRAN (R 4.4.0)
yaml 2.3.10 2024-07-26 [1] CRAN (R 4.4.0)
[1] /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/library
──────────────────────────────────────────────────────────────────────────────