Last updated: 2019-07-09
Checks: 2 0
Knit directory: cause/
This reproducible R Markdown analysis was created with workflowr (version 1.4.0.9000). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.
Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.
Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility. The version displayed above was the version of the Git repository at the time these results were generated.
Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish
or wflow_git_commit
). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:
Ignored files:
Ignored: .Rhistory
Ignored: .Rproj.user/
Untracked files:
Untracked: (tumble-track's conflicted copy 2019-07-09).Rhistory
Untracked: DESCRIPTION (tumble-track's conflicted copy 2019-07-09)
Untracked: analysis/figure/
Untracked: analysis/gwas_pairs.Rmd
Untracked: analysis/ldl_cad_cache/
Untracked: docs/WNAR_2019.pdf
Untracked: example_data/chr22_AF0.05_0.1.RDS
Untracked: example_data/chr22_AF0.05_snpdata.RDS
Untracked: gwas_data/
Untracked: ll_v7_notes.Rmd
Untracked: src/RcppExports.o
Untracked: src/log_likelihood_functions.o
Untracked: temp.txt
Unstaged changes:
Modified: R/map_pi_rho.R
Deleted: test_mixSQP.R
Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.
These are the previous versions of the R Markdown and HTML files. If you’ve configured a remote Git repository (see ?wflow_git_remote
), click on the hyperlinks in the table below to view them.
File | Version | Author | Date | Message |
---|---|---|---|---|
Rmd | 6c171b2 | Jean Morrison | 2019-07-10 | wflow_publish(“analysis/simulations.Rmd”) |
Simulations are run using the causeSims
R package. The package can be downloaded using
devtools::install_github("jean997/causeSims")
This package relies on multiple R packages that are not on CRAN, so you will need to download these separately. For detailed installation instructions including installation of supporting R packages for other methods and a detailed description of the package see here. There you can also find a walk through of a single simulation and explanations of the functions contained in the package.
We conduct simulations using the Dynamical Statisical Comparisons software. You will need to have DSC installed and it will be helpful to take a look at the tutorial.
These instructions specifically generate the results presented in our paper. However, by modifying the dsc files, you can run your own experiments. We will describe the dsc files later on this page. To set up the analysis create a working directory that you will run the simulations in. Make sure all the software in the previous section is installed.
Download the LD data that we will use here. Put the two files into a subdirectory of the working directory named data/
. Download two DSC files: power and false positives. Put these into the working directory.
If you are running on a compute cluster (highly recomended) you will also need a config file. This file is set up to run on the University of Chicago research computing cluster. You will need to modify it to work with your cluster set up. Even if you are running on the University of Chicago cluster, you will need to modify the file to use your own account and home directory. Read about using DSC with a compute cluster here (We are using the “On Host” option). Put your modified config file in the working directory.
To generate the results in the paper, from your working directory use the following commands
nohup dsc --replicate 100 --host config.yml -c 4 pwr.dsc > pwr_dsc.out &
nohup dsc --replicate 100 --host config.yml -c 4 false_positives.dsc > fp_dsc.out &
Coming Soon!
We will now go in detail through one of the DSC files so you can see what is happening and how to make changes to run your own experiments. We will look at false_positives.dsc
but power.dsc
is very similar.
The DSC statement is at the top:
DSC:
define:
mr: ivw, egger, mrp, wm, twas, wm, gsmr
run: simulate*gw_sig,
simulate*LCV,
simulate*mr,
simulate*cause_params*cause
replicate: 100
output: fp
This statement tells DSC what to run. In define
we define a group of modules. These will turn up later in the file. In run
we tell DSC what to run. For example, simulate*gw_sig
means that DSC should run the simulate
module followed by the gw_sig
module which will use the results of the simulate
module. replicate
indicates the number of replicates to run (though this is over-ridden by the --replicate
flag in the DSC call above) and output
names the output directory.
The rest of the document defines modules. All of our modules use R code and most of them rely on the causeSims
R package.
The simulation module:
simulate: R(library(causeSims);
snps <- readRDS("data/chr19_snpdata_hm3only.RDS");
evd_list <- readRDS("data/evd_list_chr19_hm3.RDS");
dat <- sum_stats(snps, evd_list,
n_copies = 30,
n1 = n1, n2=n2,
h1=h1, h2=h1,
neffect1 = neffect1, neffect2 = neffect2,
tau = tau, omega = omega, q = q,
cores = 4, ld_prune_pval_thresh = 0.01,
r2_thresh = 0.1))
q: 0.1, 0.2, 0.3, 0.4, 0.5
omega: 0.02, 0.05
tau: 0
n1: 12000, 40000
n2: 12000, 40000
neffect1: 1000
neffect2: 1000
h1: 0.25
h2: 0.25
$sim_params: c(q = q, omega = omega, tau = tau, h1 = h1, h2 = h2, n1 = n1, n2 = n2, neffect1 = neffect1, neffect2 =neffect2)
$dat: dat
The R code contained in R()
is the guts of this module. A data object named dat
is generated using the sum_stats
function from the causeSims
R pacakge. Below this, we specify parameters used to simulate the data. Most important are q
, omega
, and tau
. q
is the proportion of variants acting through \(U\) while omega
and tau
are standardized versions of \(\gamma\) and \(\eta\) in the paper. tau
is the proportion of \(Y\) heritability explained by the causal effect of \(M\) on \(Y\) times the sign of \(\gamma\), \[
\tau = sign(\gamma)\frac{\gamma^2 h^2_{M}}{h^{2}_{Y}},
\] and omega
is the proportion of \(Y\) heritability explained by \(U\), divided by \(q\) and multiplied by the sign of \(\eta\). \[
\omega = sign(\eta)\frac{\eta^2 h^2_{M}}{h^{2}_{Y}}
\] This parameterization is sometimes more convenient for running simulations that using \(\eta\) and \(\gamma\). However, the sum_stats
function will accept either omega
and tau
arguments or eta
and gamma
. Note that here we have specified that the function can use 4 cores. This should match the n_cpu
parameter in the simulate
section of the config file. The sum_stats
function generates data both with and without LD. In this file we only analyze the “with LD” data but we will describe how to make modifications to analyze both or only the “no LD” data.
The gw_sig
module is a helper module. This module computes the GCP (a parameter defined by LCV (O’Connor and Price 2019)) and computes the number of genome-wide significant variants. It also stores all the simulation parameters which makes it easier to extract them. Extracting parameters from the simulate
module will require reading in the entire data set, objects created by the gw_sig
module are smaller and faster to read.
gw_sig: R(library(causeSims);
m_sig_nold <- with($(dat), sum(p_value_nold < thresh ));
y_sig_nold <- with($(dat), sum(2*pnorm(-abs(beta_hat_2_nold/seb2)) < thresh ));
m_sig <- with($(dat), sum(p_value < thresh & ld_prune==TRUE));
params <- $(sim_params);
tau <- as.numeric(params["tau"]);
omega <- as.numeric(params["omega"]);
q <- as.numeric(params["q"]);
h1 <- as.numeric(params["h1"]);
h2 <- as.numeric(params["h2"]);
neffect1 <- as.numeric(params["neffect1"]);
neffect2 <- as.numeric(params["neffect2"]);
gamma <- sqrt(tau*sum(h2)/sum(h1))*sign(tau);
eta <- sqrt(abs(omega)*sum(h2)/sum(h1))*sign(omega);
#LCV parameters
if(q == 0 & tau == 0){
q.1 <- q.2 <- 0;
gcp <- 0;
}else if(q == 0){
q.1 <- 1;
q.2 <- 0;
gcp <- 1;
}else{
q.1 <- sqrt(q);
q.2 <- eta * sqrt(q) * sqrt(h1) / sqrt(h2);
gcp = (log(q.2^2) - log(q.1^2)) / (log(q.2^2) + log(q.1^2));
};
gcp_mom <- gcp_moments($(dat), h1, h2)
)
thresh: 5e-8
$m_sig: m_sig
$m_sig_nold: m_sig_nold
$y_sig_nold: y_sig_nold
$q: q
$omega: omega
$tau: tau
$gamma: gamma
$eta: eta
$h1: h1
$h2: h2
$n1: as.numeric(params["n1"])
$n2: as.numeric(params["n2"])
$neffect1: as.numeric(params["neffect1"])
$neffect2: as.numeric(params["neffect2"])
$lcv_q1: q.1
$lcv_q2: q.2
$lcv_gcp: gcp
$lcv_gcp_mom: gcp_mom$gcp
We have divided CAUSE analysis into parameter estimation and model fitting steps. This makes it easier to experiment with changes only to the parameter estimation step.
cause_params: R(library(causeSims);
p1 <- cause_params_sims($(dat), null_wt = null_wt, no_ld=FALSE);
)
null_wt: 10
$cause_params_ld: p1
cause: R(library(causeSims);
library(cause);
cause_res <- cause_sims($(dat), $(cause_params_ld), no_ld = FALSE);
z <- -1*summary(cause_res)$z;
p <- pnorm(-z);
quants <- summary(cause_res)$quants)
$cause_res: cause_res
$sigma_g: cause_res$sigma_g
$eta_med_2: quants[[1]][1,2]
$q_med_2: quants[[1]][1,3]
$eta_med_3: quants[[2]][1,2]
$gamma_med_3: quants[[2]][1,1]
$q_med_3: quants[[2]][1,3]
$z: z
$p: p
Both modules use wrapper functions from the causeSims
package. These are very simple wrappers that make it easy to run CAUSE on the simulated data format. Note that here we use no_ld = FALSE
in both modules. If you want to use the data that is simulated without LD instead you can change no_ld = TRUE
. You could get results for both using the following set of modules.
cause_params: R(library(causeSims);
p1 <- cause_params_sims($(dat), null_wt = null_wt, no_ld=FALSE);
p2 <- cause_params_sims($(dat), null_wt = null_wt, no_ld=TRUE);
)
null_wt: 10
$cause_params_ld: p1
$cause_params_nold: p2
cause: R(library(causeSims);
library(cause);
if(no_ld==FALSE){
cause_res <- cause_sims($(dat), $(cause_params_ld), no_ld = no_ld);
else{
cause_res <- cause_sims($(dat), $(cause_params_nold), no_ld = no_ld);
}
z <- -1*summary(cause_res)$z;
p <- pnorm(-z);
quants <- summary(cause_res)$quants)
$no_ld: TRUE, FALSE
$cause_res: cause_res
$sigma_g: cause_res$sigma_g
$eta_med_2: quants[[1]][1,2]
$q_med_2: quants[[1]][1,3]
$eta_med_3: quants[[2]][1,2]
$gamma_med_3: quants[[2]][1,1]
$q_med_3: quants[[2]][1,3]
$z: z
$p: p
There are many modules for alternative MR methods. Modules for IVW regression, Egger regression, MR-PRESSO, and the weighted median method are very similar and rely on wrappers from causeSims
. We can just look at one of these.
ivw: R(library(causeSims);
res <- ivw($(dat), p_val_thresh=thresh, no_ld = no_ld);
)
thresh: 5e-8
no_ld: FALSE
$z: res$z
$p: res$p
To run on the data without LD change the no_ld
parameter to no_ld: TRUE
or no_ld: TRUE, FALSE
to run both.
The GSMR module is a little more complicated
gsmr: R(library(causeSims);
evd_list <- readRDS("data/evd_list_chr19_hm3.RDS");
res <- gsmr_sims($(dat), evd_list, p_val_thresh = 5e-8, no_ld = FALSE);
if(!is.null(res)){
z <- res$bxy/res$bxy_se;
est <- res$bxy;
p <- res$bxy_pval;
}else{
z <- est <- p <- NA;
}
)
thresh: 5e-8
$z: z
$p: p
$est_gsmr: est
$gsmr: res
GSMR requires the LD structure and sometimes returns errors. In this case the gsmr_sims
wrapper returns NULL and we set results to missing. If no_ld = TRUE
the LD data in evd_list
is simply ignored.
Results from LCV are not compared with CAUSE in the paper because LCV is performing a different test than other MR methods. However, we have implemented wrappers and a module to run LCV.
LCV: R(library(causeSims);
res <- try(lcv_sims($(dat),no_ld = no_ld, sig.thresh = thresh));
if(class(res) == "try-error"){
res <- list(pval.gcpzero.2tailed=NA, gcp.pm = NA, gcp.pse = NA);
};
)
thresh: 30
no_ld: FALSE
$p: res$pval.gcpzero.2tailed
$gcp_mean: res$gcp.pm
$gcp_pse: res$gcp.pse
$gcp_obj: res
When LCV returns errors, we set results to missing.