Last updated: 2019-07-09

Checks: 2 0

Knit directory: cause/

This reproducible R Markdown analysis was created with workflowr (version 1.4.0.9000). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.


Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility. The version displayed above was the version of the Git repository at the time these results were generated.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Ignored files:
    Ignored:    .Rhistory
    Ignored:    .Rproj.user/

Untracked files:
    Untracked:   (tumble-track's conflicted copy 2019-07-09).Rhistory
    Untracked:  DESCRIPTION (tumble-track's conflicted copy 2019-07-09)
    Untracked:  analysis/figure/
    Untracked:  analysis/gwas_pairs.Rmd
    Untracked:  analysis/ldl_cad_cache/
    Untracked:  docs/WNAR_2019.pdf
    Untracked:  example_data/chr22_AF0.05_0.1.RDS
    Untracked:  example_data/chr22_AF0.05_snpdata.RDS
    Untracked:  gwas_data/
    Untracked:  ll_v7_notes.Rmd
    Untracked:  src/RcppExports.o
    Untracked:  src/log_likelihood_functions.o
    Untracked:  temp.txt

Unstaged changes:
    Modified:   R/map_pi_rho.R
    Deleted:    test_mixSQP.R

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.


These are the previous versions of the R Markdown and HTML files. If you’ve configured a remote Git repository (see ?wflow_git_remote), click on the hyperlinks in the table below to view them.

File Version Author Date Message
Rmd 6c171b2 Jean Morrison 2019-07-10 wflow_publish(“analysis/simulations.Rmd”)

Software Installation

Simulations are run using the causeSims R package. The package can be downloaded using

devtools::install_github("jean997/causeSims")

This package relies on multiple R packages that are not on CRAN, so you will need to download these separately. For detailed installation instructions including installation of supporting R packages for other methods and a detailed description of the package see here. There you can also find a walk through of a single simulation and explanations of the functions contained in the package.

We conduct simulations using the Dynamical Statisical Comparisons software. You will need to have DSC installed and it will be helpful to take a look at the tutorial.

Set Up The Analysis Directory

These instructions specifically generate the results presented in our paper. However, by modifying the dsc files, you can run your own experiments. We will describe the dsc files later on this page. To set up the analysis create a working directory that you will run the simulations in. Make sure all the software in the previous section is installed.

Download the LD data that we will use here. Put the two files into a subdirectory of the working directory named data/. Download two DSC files: power and false positives. Put these into the working directory.

If you are running on a compute cluster (highly recomended) you will also need a config file. This file is set up to run on the University of Chicago research computing cluster. You will need to modify it to work with your cluster set up. Even if you are running on the University of Chicago cluster, you will need to modify the file to use your own account and home directory. Read about using DSC with a compute cluster here (We are using the “On Host” option). Put your modified config file in the working directory.

Run DSC

To generate the results in the paper, from your working directory use the following commands

nohup dsc --replicate 100 --host config.yml -c 4 pwr.dsc > pwr_dsc.out &
nohup dsc --replicate 100 --host config.yml -c 4 false_positives.dsc > fp_dsc.out &

Make Plots

Coming Soon!

What’s in the DSC Files?

We will now go in detail through one of the DSC files so you can see what is happening and how to make changes to run your own experiments. We will look at false_positives.dsc but power.dsc is very similar.

The DSC statement is at the top:

DSC:
    define:
        mr: ivw, egger, mrp, wm, twas, wm, gsmr
    run: simulate*gw_sig,
         simulate*LCV,
         simulate*mr,
         simulate*cause_params*cause
    replicate: 100
    output: fp

This statement tells DSC what to run. In define we define a group of modules. These will turn up later in the file. In run we tell DSC what to run. For example, simulate*gw_sig means that DSC should run the simulate module followed by the gw_sig module which will use the results of the simulate module. replicate indicates the number of replicates to run (though this is over-ridden by the --replicate flag in the DSC call above) and output names the output directory.

The rest of the document defines modules. All of our modules use R code and most of them rely on the causeSims R package.

Simulation Modules

The simulation module:

simulate: R(library(causeSims);
            snps <- readRDS("data/chr19_snpdata_hm3only.RDS");
            evd_list <- readRDS("data/evd_list_chr19_hm3.RDS");
            dat <- sum_stats(snps, evd_list,
                  n_copies = 30,
                  n1 = n1, n2=n2,
                  h1=h1, h2=h1,
                  neffect1 = neffect1, neffect2 = neffect2,
                  tau = tau, omega = omega, q = q,
                  cores = 4, ld_prune_pval_thresh = 0.01,
                  r2_thresh = 0.1))

     q: 0.1, 0.2, 0.3, 0.4, 0.5
     omega: 0.02, 0.05
     tau: 0
     n1: 12000, 40000
     n2: 12000, 40000
     neffect1: 1000
     neffect2: 1000
     h1: 0.25
     h2: 0.25
     $sim_params: c(q = q, omega = omega, tau = tau,  h1 = h1, h2 = h2, n1 = n1, n2 = n2, neffect1 = neffect1, neffect2 =neffect2)
     $dat: dat

The R code contained in R() is the guts of this module. A data object named dat is generated using the sum_stats function from the causeSims R pacakge. Below this, we specify parameters used to simulate the data. Most important are q, omega, and tau. q is the proportion of variants acting through \(U\) while omega and tau are standardized versions of \(\gamma\) and \(\eta\) in the paper. tau is the proportion of \(Y\) heritability explained by the causal effect of \(M\) on \(Y\) times the sign of \(\gamma\), \[ \tau = sign(\gamma)\frac{\gamma^2 h^2_{M}}{h^{2}_{Y}}, \] and omega is the proportion of \(Y\) heritability explained by \(U\), divided by \(q\) and multiplied by the sign of \(\eta\). \[ \omega = sign(\eta)\frac{\eta^2 h^2_{M}}{h^{2}_{Y}} \] This parameterization is sometimes more convenient for running simulations that using \(\eta\) and \(\gamma\). However, the sum_stats function will accept either omega and tau arguments or eta and gamma. Note that here we have specified that the function can use 4 cores. This should match the n_cpu parameter in the simulate section of the config file. The sum_stats function generates data both with and without LD. In this file we only analyze the “with LD” data but we will describe how to make modifications to analyze both or only the “no LD” data.

The gw_sig module is a helper module. This module computes the GCP (a parameter defined by LCV (O’Connor and Price 2019)) and computes the number of genome-wide significant variants. It also stores all the simulation parameters which makes it easier to extract them. Extracting parameters from the simulate module will require reading in the entire data set, objects created by the gw_sig module are smaller and faster to read.

gw_sig: R(library(causeSims);
           m_sig_nold <- with($(dat), sum(p_value_nold < thresh ));
           y_sig_nold <- with($(dat), sum(2*pnorm(-abs(beta_hat_2_nold/seb2)) < thresh ));
           m_sig <- with($(dat), sum(p_value < thresh & ld_prune==TRUE));

          params <- $(sim_params);
          tau <- as.numeric(params["tau"]);
          omega <- as.numeric(params["omega"]);
          q <- as.numeric(params["q"]);
          h1 <- as.numeric(params["h1"]);
          h2 <- as.numeric(params["h2"]);
          neffect1 <- as.numeric(params["neffect1"]);
          neffect2 <- as.numeric(params["neffect2"]);
          gamma <- sqrt(tau*sum(h2)/sum(h1))*sign(tau);
          eta <- sqrt(abs(omega)*sum(h2)/sum(h1))*sign(omega);

          #LCV parameters
          if(q == 0 & tau == 0){
            q.1 <- q.2 <- 0;
            gcp <- 0;
          }else if(q == 0){
            q.1 <- 1;
            q.2 <- 0;
            gcp <- 1;
          }else{
            q.1 <- sqrt(q);
            q.2 <- eta * sqrt(q) * sqrt(h1) / sqrt(h2);
            gcp = (log(q.2^2) - log(q.1^2)) / (log(q.2^2) + log(q.1^2));
          };
          gcp_mom <- gcp_moments($(dat), h1, h2)
          )
    thresh: 5e-8
    $m_sig: m_sig
    $m_sig_nold: m_sig_nold
    $y_sig_nold: y_sig_nold
    $q: q
    $omega: omega
    $tau: tau
    $gamma: gamma
    $eta: eta
    $h1: h1
    $h2: h2
    $n1: as.numeric(params["n1"])
    $n2: as.numeric(params["n2"])
    $neffect1: as.numeric(params["neffect1"])
    $neffect2: as.numeric(params["neffect2"])
    $lcv_q1: q.1
    $lcv_q2: q.2
    $lcv_gcp: gcp
    $lcv_gcp_mom: gcp_mom$gcp

CAUSE Modules

We have divided CAUSE analysis into parameter estimation and model fitting steps. This makes it easier to experiment with changes only to the parameter estimation step.

cause_params: R(library(causeSims);
                p1 <- cause_params_sims($(dat), null_wt = null_wt, no_ld=FALSE);
                )
    null_wt: 10
    $cause_params_ld: p1

cause: R(library(causeSims);
         library(cause);
         cause_res <- cause_sims($(dat), $(cause_params_ld), no_ld = FALSE);
         z <- -1*summary(cause_res)$z;
         p <- pnorm(-z);
         quants <- summary(cause_res)$quants)
    $cause_res: cause_res
    $sigma_g: cause_res$sigma_g
    $eta_med_2: quants[[1]][1,2]
    $q_med_2: quants[[1]][1,3]
    $eta_med_3: quants[[2]][1,2]
    $gamma_med_3: quants[[2]][1,1]
    $q_med_3: quants[[2]][1,3]
    $z: z
    $p: p

Both modules use wrapper functions from the causeSims package. These are very simple wrappers that make it easy to run CAUSE on the simulated data format. Note that here we use no_ld = FALSE in both modules. If you want to use the data that is simulated without LD instead you can change no_ld = TRUE. You could get results for both using the following set of modules.

cause_params: R(library(causeSims);
                p1 <- cause_params_sims($(dat), null_wt = null_wt, no_ld=FALSE);
                p2 <- cause_params_sims($(dat), null_wt = null_wt, no_ld=TRUE);
                )
    null_wt: 10
    $cause_params_ld: p1
    $cause_params_nold: p2

cause: R(library(causeSims);
         library(cause);
         if(no_ld==FALSE){
          cause_res <- cause_sims($(dat), $(cause_params_ld), no_ld = no_ld);
         else{
          cause_res <- cause_sims($(dat), $(cause_params_nold), no_ld = no_ld);
         }
         z <- -1*summary(cause_res)$z;
         p <- pnorm(-z);
         quants <- summary(cause_res)$quants)
    $no_ld: TRUE, FALSE
    $cause_res: cause_res
    $sigma_g: cause_res$sigma_g
    $eta_med_2: quants[[1]][1,2]
    $q_med_2: quants[[1]][1,3]
    $eta_med_3: quants[[2]][1,2]
    $gamma_med_3: quants[[2]][1,1]
    $q_med_3: quants[[2]][1,3]
    $z: z
    $p: p

Alternative MR Modules

There are many modules for alternative MR methods. Modules for IVW regression, Egger regression, MR-PRESSO, and the weighted median method are very similar and rely on wrappers from causeSims. We can just look at one of these.

ivw: R(library(causeSims);
       res <- ivw($(dat), p_val_thresh=thresh, no_ld = no_ld);
       )
    thresh: 5e-8
    no_ld: FALSE
    $z: res$z
    $p: res$p

To run on the data without LD change the no_ld parameter to no_ld: TRUE or no_ld: TRUE, FALSE to run both.

The GSMR module is a little more complicated

gsmr: R(library(causeSims);
        evd_list <- readRDS("data/evd_list_chr19_hm3.RDS");
        res <- gsmr_sims($(dat), evd_list, p_val_thresh  = 5e-8, no_ld = FALSE);
        if(!is.null(res)){
           z <- res$bxy/res$bxy_se;
           est <- res$bxy;
           p <- res$bxy_pval;
        }else{
           z <- est <- p <-  NA;
        }
      )
    thresh: 5e-8
    $z: z
    $p: p
    $est_gsmr: est
    $gsmr: res

GSMR requires the LD structure and sometimes returns errors. In this case the gsmr_sims wrapper returns NULL and we set results to missing. If no_ld = TRUE the LD data in evd_list is simply ignored.

LCV Module

Results from LCV are not compared with CAUSE in the paper because LCV is performing a different test than other MR methods. However, we have implemented wrappers and a module to run LCV.

LCV: R(library(causeSims);
       res <- try(lcv_sims($(dat),no_ld = no_ld, sig.thresh = thresh));
       if(class(res) == "try-error"){
            res <- list(pval.gcpzero.2tailed=NA, gcp.pm = NA, gcp.pse = NA);
        };
       )
    thresh: 30
    no_ld: FALSE
    $p: res$pval.gcpzero.2tailed
    $gcp_mean: res$gcp.pm
    $gcp_pse: res$gcp.pse
    $gcp_obj: res

When LCV returns errors, we set results to missing.