Last updated: 2019-07-15

Checks: 6 1

Knit directory: cause/

This reproducible R Markdown analysis was created with workflowr (version 1.4.0.9000). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.


Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.

The global environment had objects present when the code in the R Markdown file was run. These objects can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment. Use wflow_publish or wflow_build to ensure that the code is always run in an empty environment.

The following objects were defined in the global environment when these results were created:

Name Class Size
data environment 56 bytes
env environment 56 bytes

The command set.seed(20181014) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.

Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.

Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility. The version displayed above was the version of the Git repository at the time these results were generated.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Ignored files:
    Ignored:    .Rhistory
    Ignored:    .Rproj.user/

Untracked files:
    Untracked:  analysis/figure/
    Untracked:  cause.Rcheck/
    Untracked:  cause_0.2.0.0161.tar.gz
    Untracked:  docs/cause.bib
    Untracked:  docs/figure/cause_figure_1_standalone.pdf
    Untracked:  example_data/chr22_AF0.05_0.1.RDS
    Untracked:  example_data/chr22_AF0.05_snpdata.RDS
    Untracked:  gwas_data/
    Untracked:  ll_v7_notes.Rmd
    Untracked:  sim_results/
    Untracked:  src/RcppExports.o
    Untracked:  src/log_likelihood_functions.o
    Untracked:  temp.txt

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.


These are the previous versions of the R Markdown and HTML files. If you’ve configured a remote Git repository (see ?wflow_git_remote), click on the hyperlinks in the table below to view them.

File Version Author Date Message
Rmd 86ffddd Jean Morrison 2019-07-15 wflow_publish(c(“analysis/index.Rmd”, “analysis/simulations.Rmd”))
html 1074c61 Jean Morrison 2019-07-12 Build site.
Rmd 7e3c495 Jean Morrison 2019-07-11 more plotting details
Rmd aeeacea Jean Morrison 2019-07-11 lots of plotting details. still working
html d2f0f74 Jean Morrison 2019-07-10 Build site.
Rmd 6c171b2 Jean Morrison 2019-07-10 wflow_publish(“analysis/simulations.Rmd”)
html 165f2c8 Jean Morrison 2019-07-09 Revert “Build site.”
html 2666541 Jean Morrison 2019-07-09 Build site.
Rmd 16285d6 Jean Morrison 2019-07-09 wflow_publish(“analysis/simulations.Rmd”)
html a1b74c5 Jean Morrison 2019-07-09 Build site.
Rmd 31a7da8 Jean Morrison 2019-07-09 wflow_publish(“analysis/simulations.Rmd”)
html 0dd65cc Jean Morrison 2019-06-25 Build site.
Rmd b78e77f Jean Morrison 2019-06-25 wflow_publish(files = c(“analysis/simulations.Rmd”))
html 286f4e9 Jean Morrison 2019-06-25 Build site.
Rmd 8f3b82e Jean Morrison 2019-06-25 wflow_publish(files = c(“analysis/about.Rmd”, “analysis/index.Rmd”, “analysis/ldl_cad.Rmd”, “analysis/license.Rmd”,

Software Installation

Simulations are run using the causeSims R package. The package can be downloaded using

devtools::install_github("jean997/causeSims")

This package relies on multiple R packages that are not on CRAN, so you will need to download these separately. For detailed installation instructions including installation of supporting R packages for other methods and a detailed description of the package see here. There you can also find a walk through of a single simulation and explanations of the functions contained in the package.

We conduct simulations using the Dynamical Statisical Comparisons software. You will need to have DSC installed and it will be helpful to take a look at the tutorial.

Set Up The Analysis Directory

These instructions specifically generate the results presented in our paper. However, by modifying the dsc files, you can run your own experiments. We will describe the dsc files later on this page. To set up the analysis create a working directory that you will run the simulations in. Make sure all the software in the previous section is installed.

From inside this working directory use the R command

library(causeSims)
setup_sims()

This function downloads the following files:

  • LD data: This is used for simulating summary statistics and is placed into a sub-directory data. The LD data are downloaded from here and contain data for HapMap3 SNPs on chromosome 19 estimated from the CEU 1000 Genomes samples.

  • DSC files: These contain instructions for DSC on how to simulate data and analyze it. These are discussed later on this page. There are three DSC files downloaded. One is for evaluating false positives in the presence of different levels of correlated pleiotropy, one is for evaluating power when there is a true causal effect, and one is for evaluating the robustness of CAUSE to different prior distributions on \(\gamma\) and \(\eta\) (see supplementary note of the paper).

  • A configuration file for running DSC on a cluster. This file is set up for the University of Chicago Research Computing Cluster so you will need to change it to match your cluster set up. Even if you are using the U of C cluster, you will need to change the home directory and account lines of this file. Read about using DSC with a compute cluster here (We are using the “On Host” option) and modify config.yml as needed.

  • Some code for plotting and results summarization

Run DSC

To run DSC use the following commands at the command line

nohup dsc --replicate 100 --host config.yml -c 4 power.dsc > pwr_dsc.out &
nohup dsc --replicate 100 --host config.yml -c 4 false_positives.dsc > fp_dsc.out &
nohup dsc --replicate 100 --host config.yml -c 4 sigma_g.dsc > sg_dsc.out &

Depending on your computing resources, this could take several hours for each command. Errors can occur that stop DSC, for example, it may have errors communicating with the host. If this happens, simply restart using the same command. DSC will pick up where it left off. We have made the results available as R objects, so if you don’t want to run the simulations you can skip straight to plotting. Or you can modify the DSC files and run your own experiments!

Running these commands will generate lots of *.err and *.out files all of which can be safely deleted. Results are in the fp, pwr and sigma_g directories respectively.

Make Plots

False positive and power results

The script extract_results_main.R will use the dscrutils package to extract results from the false positive and power analyses. If you are modifying the analysis and have changed the names of the output directories, you will need to change them in this file also. Depending on the changes you have made, you may need to make other modifications. Read more about dscrutils here and in the dscrutils documentation. This script will save R objects containing CAUSE, LCV, and other MR method results for both analyses. It will append the date so it won’t over-write the results downloaded by setup_sims(). Run it at the command line using

Rscript extract_results_main.R

or from inside R using source(extract_results_main.R). It will take several minutes to extract all of the results. If you would like to use your own files to plot, change the file names in the code below.

We first summarize results grouping by sample size and simulation setting. Here we are leaving out results of LCV but these are computed in the simulations and available (see plot_results_main.R for code).

library(tidyverse)
── Attaching packages ──────────────────────────────────────────────────────────────────────────────────────── tidyverse 1.2.1 ──
✔ ggplot2 3.2.0     ✔ purrr   0.3.2
✔ tibble  2.1.3     ✔ dplyr   0.8.3
✔ tidyr   0.8.3     ✔ stringr 1.4.0
✔ readr   1.3.1     ✔ forcats 0.4.0
── Conflicts ─────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
library(gridExtra)

Attaching package: 'gridExtra'
The following object is masked from 'package:dplyr':

    combine
library(pROC)
Type 'citation("pROC")' for a citation.

Attaching package: 'pROC'
The following objects are masked from 'package:stats':

    cov, smooth, var
library(jcolors)
causedf <- readRDS("sim_results/main_causedf.RDS")

mrdf <- readRDS("sim_results/main_mrdf.RDS")
                                     
######### Summaries
cause_summ <- causedf %>% 
              group_by(q, tau, omega,  n1, n2) %>%
              summarize(n_sig = sum(cause.p < 0.05, na.rm=T),
                        missing = sum(is.na(cause.p))) %>% 
              mutate(analysis = "cause")

mr_summ <- mrdf %>%
           group_by(q, tau, omega,  n1, n2, mr) %>%
           summarize(n_sig = sum(mr.p < 0.05, na.rm=T),
                     missing = sum(is.na(mr.p))) %>%
           rename(analysis = "mr")


res <- bind_rows(cause_summ, mr_summ) %>%
        ungroup() %>%
        mutate(label = fct_recode(analysis, "CAUSE"="cause",
                                     "Egger" = "egger", "GSMR"="gsmr",
                                     "IVW"="ivw", 
                                     "MR-PRESSO"="mrp", "Wtd. Median" = "wm"),
                 n1 = factor(paste0("N[M]==", n1)),
                 n2 = factor(paste0("N[Y]==", n2)))

Plotting power:

#colors
cc <- as.character(jcolors("default"))
cols <- c(cc[3], cc[1:2], cc[4:5],  "#D590DA")
#plot
pwr <- res %>%
       filter(q == 0) %>%
       ggplot(.) + 
          geom_line(aes(x=tau, y=n_sig/100, color=label)) +  
          geom_point(aes(x=tau, y=n_sig/100, color=label, shape=label)) +
          scale_color_manual(values=cols) + 
          scale_shape_manual(values= c(16, 3, 4, 0, 2, 6:8)) +
          geom_hline(yintercept=0.05, linetype=2) +
          xlab(expression("Percentage of Y heritability explained by M (" ~ tau ~ "=" ~ gamma^2 ~ h[M]^2/h[Y]^2 ~ ")")) +
          ylab("Percent Significant") + 
          theme_bw() + 
          theme(legend.title = element_blank(), 
                strip.text = element_text(size=14),
                axis.title = element_text(size=16),
                legend.text = element_text(size=12),
                legend.position = "bottom") +  
          guides(color=guide_legend(nrow=1),  
                 linetype=guide_legend(nrow=1),
                 shape=guide_legend(nrow=1) )  + 
          facet_wrap(~n1*n2, nrow=2, labeller = label_parsed)

Plotting false positives. The false positive simulations include two values of \(\eta\) (encoded through the omega parameter) though only the larger one is presented in the paper. Here we make plots for both. For convenience, in extracting results, we copy the results for \(q=0\).

z1 <-  filter(res, omega == 0 & tau == 0 & q == 0) %>% mutate(omega = 0.05)
z2 <-  filter(res, omega == 0 & tau == 0 & q == 0) %>% mutate(omega = 0.02)
res2 <- bind_rows(res, z1, z2) %>%
        filter(!(q == 0 & omega == 0 & tau == 0))
fp <- res2 %>%
      filter(tau == 0) %>%
      ggplot(.) +
        geom_line(aes(x=q, y=n_sig/100, color=label)) +
        geom_point(aes(x=q, y=n_sig/100, color=label, shape=label)) +
        geom_hline(yintercept=0.05, linetype=2) +
        scale_color_manual(values=cols) +
        scale_shape_manual(values= c(16, 3, 4, 0, 2, 6:8)) +
        geom_hline(yintercept=0.05, linetype=2) +
        xlab("Proportion of variants acting through U (q)") +
        ylab("Percent Significant") +
        theme_bw() +
        theme(legend.title = element_blank(),
            strip.text = element_text(size=14),
            axis.title = element_text(size=16),
            legend.text = element_text(size=12),
            legend.position = "bottom") +
        guides(color=guide_legend(nrow=1),
             linetype=guide_legend(nrow=1),
             shape=guide_legend(nrow=1) )  +
        facet_wrap(~omega*n1*n2, nrow=1, labeller=label_parsed)

ROC curves are a little more complicated to make. We want to evaluate how well each method is able to distinguish simulations with a causal effect of size \(\gamma = \sqrt{0.05}\) from simulations with a shared factor with an equal effect acconting for 30% of the trait \(M\) vairants. Both of these parameters can be modified. The first step is to combine the CAUSE and alternative MR results into a single data frame

x1 <- causedf %>%
      mutate(analysis = "cause", stat = -log10(cause.p)) %>%
      select(tag, q, omega, tau, n1, n2, analysis, stat)

x2 <- mrdf %>%
      mutate(stat = -log10(mr.p)) %>%
      rename(analysis = mr) %>%
      select(tag, q, omega, tau, n1, n2, analysis, stat)

full_results <- bind_rows(x1, x2)

full_results <- full_results %>% 
                mutate(label = fct_recode(analysis, "CAUSE"="cause",
                                     "Egger" = "egger", "GSMR"="gsmr",
                                     "IVW"="ivw",
                                     "MR-PRESSO"="mrp", "Wtd. Median" = "wm"))

Next we use the pROC package.

eff <- 0.05
q <- 0.3
n1s <- c(12000, 12000, 40000, 40000)
n2s <- c(12000, 40000, 12000, 40000)

methods <- with(full_results, unique(label)) %>%
            as.vector(.)

cc <- as.character(jcolors("default"))
cols <- c(cc[3], cc[1:2], cc[4:5],  "#D590DA")


roc_df <-lapply(seq_along(n1s), function(i){
                    my_n1 <- n1s[i]
                    my_n2 <- n2s[i]
                    rocs <- lapply(methods, function(m){
                       cat(m, " ")
                       xx <- filter(full_results, n1 == my_n1 & n2 == my_n2 & label == m &
                                 ( tau == eff | (omega == eff & q == my_q & tau == 0)))
                       xx$stat[is.na(xx$stat)] <- 0
                       with(xx, roc(predictor = stat, response = tau !=0)) %>%
                            with(., data.frame(thresh = thresholds, spec = specificities,
                                            sens = sensitivities, method = m,
                                            n1 = my_n1, n2 = my_n2,
                                            stringsAsFactors=F)) %>% 
                            return()

                })
                do.call(rbind, rocs)
            })
head(roc_df)

# A data frame with points to add to the curve
pw <- filter(res, tau == eff ) 
f <- filter(res, omega == eff2 & q == my_q & tau ==0)
df <- select(pw, label, n_sig, n1, n2) %>%
      rename(power = n_sig) %>%
      full_join(., f, c("label", "n1", "n2")) %>%
      rename(fp = n_sig) %>%
      select(label, power, fp, n1, n2) %>%
      filter(label %in% methods) %>%
      rename(method=label)

## Plotting!
plt <- roc_df %>% 
       arrange(sens) %>%
       ggplot(.) +
          geom_line(aes(x=1-spec, y=sens, col=method, group=method, linetype=method)) +
          geom_point(aes(x=fp/100, y = power/100, col=method, shape=method), size =4 , data=df) +
          scale_color_manual(values=cols) +
          scale_linetype_manual(values = rep(1, 6)) +
          scale_shape_manual(values = c(rep(16, 6))) +
          xlab("1 - Specificity (False Positive Rate)") + ylab("Sensitivity (Power)") +
          theme_bw() +
          theme(legend.title = element_blank(),
                strip.text = element_text(size=14),
                axis.title = element_text(size=16),
                legend.text = element_text(size=12),
                legend.position = "bottom") +
          guides(color=guide_legend(nrow=1),
                 linetype=guide_legend(nrow=1),
                 shape=guide_legend(nrow=1))  +
          facet_wrap(~n1*n2, ncol=4, labeller= label_parsed)

Prior robustness results

The sigma_g anlaysis is intended to evaluate how sensitive CAUSE is to changes in the prior on \(\gamma\) and \(\eta\). These parameters have the same \(N(0, \sigma_{\gamma, \eta})\) prior distribution. By default, \(\sigma_{\gamma \eta}\) is chosen guided by the data. These results dmeonstrate that, within a wide range, CAUSE is insensitive to the value of \(\sigma_{\gamma \eta}\).

The R script extract_sigma_g.R will use dscrutils to extract simulation results and store them in a data frame. Run it at the command line with

Rscript extract_sigma_g.R

or from inside an R session using source.

This code will append the date to the file it saves so it won’t save over the results downloaded by setup_sims(). If you want to use your own data frame, change the file name in the commands below.

causedf <- readRDS("sim_results/sg_causedf.RDS")
head(causedf)
  DSC simulate.output.file  tau     gamma q omega eta   h1   h2    n1
1   1 simulate/simulate_10 0.00 0.0000000 0     0   0 0.25 0.25 40000
2   1 simulate/simulate_10 0.00 0.0000000 0     0   0 0.25 0.25 40000
3   1 simulate/simulate_10 0.00 0.0000000 0     0   0 0.25 0.25 40000
4   1 simulate/simulate_11 0.05 0.2236068 0     0   0 0.25 0.25 40000
5   1 simulate/simulate_11 0.05 0.2236068 0     0   0 0.25 0.25 40000
6   1 simulate/simulate_11 0.05 0.2236068 0     0   0 0.25 0.25 40000
     n2 neffect1 neffect2 lcv_q1 lcv_q2 gcp gcp_mom m_sig_nold y_sig_nold
1 40000     1000     1000      0      0   0    0.00        101         96
2 40000     1000     1000      0      0   0    0.00        101         96
3 40000     1000     1000      0      0   0    0.00        101         96
4 40000     1000     1000      1      0   1    0.96         94         85
5 40000     1000     1000      1      0   1    0.96         94         85
6 40000     1000     1000      1      0   1    0.96         94         85
  m_sig     params
1   102    (0,0,0)
2   102    (0,0,0)
3   102    (0,0,0)
4   108 (0.05,0,0)
5   108 (0.05,0,0)
6   108 (0.05,0,0)
                                 cause_sigma_g.output.file quant   sigma_g
1 cause_sigma_g/simulate_10_cause_params_1_cause_sigma_g_1  0.51 8.9197112
2 cause_sigma_g/simulate_10_cause_params_1_cause_sigma_g_2  0.65 0.5803009
3 cause_sigma_g/simulate_10_cause_params_1_cause_sigma_g_3  0.80 0.2656855
4 cause_sigma_g/simulate_11_cause_params_1_cause_sigma_g_1  0.51 8.9197112
5 cause_sigma_g/simulate_11_cause_params_1_cause_sigma_g_2  0.65 0.5803009
6 cause_sigma_g/simulate_11_cause_params_1_cause_sigma_g_3  0.80 0.2656855
          z eta_sharing  q_sharing   eta_causal   q_causal gamma_causal
1 -5.688023 -0.06270989 0.02009531 -0.053846753 0.02294318 -0.002114384
2 -6.222519 -0.04002148 0.02456070 -0.032508692 0.02780765 -0.002286300
3 -5.770971 -0.01853399 0.03351503 -0.012154733 0.03793337 -0.002571933
4  4.926419  0.21590336 0.67297501 -0.008234885 0.03065362  0.176649144
5  4.940660  0.21546277 0.67362048 -0.005822829 0.03414573  0.176438987
6  4.871813  0.21381720 0.67599577 -0.001372277 0.04059740  0.175640524
             p
1 1.000000e+00
2 1.000000e+00
3 1.000000e+00
4 4.187512e-07
5 3.892923e-07
6 5.528927e-07

The results data frame contains simulation parameters and file information (columns 1-22), the CAUSE \(z\)-score (column 25) and \(p\)-value (column 31), and the posterior median for each simulation for parameters \(\eta\) and \(q\) in the sharing model and parameters \(\eta\), \(q\), and \(\gamma\) in the causal model. Each simulation was analyzed by CAUSE three times using different values of \(\sigma_{\gamma \eta}\) (column sigma_g). These are large, medium and small values which are chosen so that \(\sqrt{0.05}\) is the 51st, 65th and 80th quantile of the \(N(0, \sigma_{\gamma \eta})\) distribution respectively.

There are three sets of parameters used in simulations. Below we calculate the correlation in \(p\)-value and posterior median for each parameter in each setting across the three values of \(\sigma_{\gamma \eta}\). The following code only reports the minimum correlation for each parameter-simulation setting combination and the correlation between the medium and small values of \(\sigma_{\gamma \eta}\).

vars <- c("z", "p", "eta_sharing", "q_sharing",
          "eta_causal", "q_causal", "gamma_causal")

min_cor <- expand.grid(variable = vars,
                       params = unique(causedf$params), stringsAsFactors=F)
min_cor$min_cor <- NA
min_cor$cor_med_small <- NA
for(i in seq(nrow(min_cor))){
    v <- min_cor$variable[i]
    X <- select(causedf, simulate.output.file, params, quant, one_of(v)) %>%
         spread(quant, get(v))  %>%
         filter(params==min_cor$params[i]) %>%
         select(one_of(c("0.51", "0.65", "0.8")))

    min_cor$min_cor[i] <- min(cor(X))
    min_cor$cor_med_small[i] <- cor(X[,"0.65"], X[,"0.8"])

}
min_cor
       variable       params   min_cor cor_med_small
1             z      (0,0,0) 0.9086311     0.9463102
2             p      (0,0,0) 0.9940151     0.9963328
3   eta_sharing      (0,0,0) 0.9418110     0.9758594
4     q_sharing      (0,0,0) 0.9727896     0.9907447
5    eta_causal      (0,0,0) 0.9341944     0.9808693
6      q_causal      (0,0,0) 0.8259576     0.9291362
7  gamma_causal      (0,0,0) 0.9996609     0.9997962
8             z   (0.05,0,0) 0.9799585     0.9820971
9             p   (0.05,0,0) 0.9639080     0.9700564
10  eta_sharing   (0.05,0,0) 0.9994588     0.9996662
11    q_sharing   (0.05,0,0) 0.9999743     0.9999841
12   eta_causal   (0.05,0,0) 0.8727614     0.9539053
13     q_causal   (0.05,0,0) 0.5074016     0.7999716
14 gamma_causal   (0.05,0,0) 0.9999445     0.9999735
15            z (0,0.05,0.3) 0.9849564     0.9849564
16            p (0,0.05,0.3) 0.9943423     0.9956324
17  eta_sharing (0,0.05,0.3) 0.9453076     0.9745643
18    q_sharing (0,0.05,0.3) 0.9992086     0.9995995
19   eta_causal (0,0.05,0.3) 0.9482481     0.9802321
20     q_causal (0,0.05,0.3) 0.9499426     0.9863485
21 gamma_causal (0,0.05,0.3) 0.9997150     0.9998202

This code is also contained in summarize_sigma_g.R.

What’s in the DSC Files?

We will now go in detail through one of the DSC files so you can see what is happening and how to make changes to run your own experiments. We will look at false_positives.dsc but power.dsc and sigma_g.dsc are similar.

The DSC statement is at the top:

DSC:
    define:
        mr: ivw, egger, mrp, wm, twas, wm, gsmr
    run: simulate*gw_sig,
         simulate*LCV,
         simulate*mr,
         simulate*cause_params*cause
    replicate: 100
    output: fp

This statement tells DSC what to run. In define we define a group of modules. These will turn up later in the file. In run we tell DSC what to run. For example, simulate*gw_sig means that DSC should run the simulate module followed by the gw_sig module which will use the results of the simulate module. replicate indicates the number of replicates to run (though this is over-ridden by the --replicate flag in the DSC call above) and output names the output directory.

The rest of the document defines modules. All of our modules use R code and most of them rely on the causeSims R package.

Simulation Modules

The simulation module:

simulate: R(library(causeSims);
            snps <- readRDS("data/chr19_snpdata_hm3only.RDS");
            evd_list <- readRDS("data/evd_list_chr19_hm3.RDS");
            dat <- sum_stats(snps, evd_list,
                  n_copies = 30,
                  n1 = n1, n2=n2,
                  h1=h1, h2=h1,
                  neffect1 = neffect1, neffect2 = neffect2,
                  tau = tau, omega = omega, q = q,
                  cores = 4, ld_prune_pval_thresh = 0.01,
                  r2_thresh = 0.1))

     q: 0.1, 0.2, 0.3, 0.4, 0.5
     omega: 0.02, 0.05
     tau: 0
     n1: 12000, 40000
     n2: 12000, 40000
     neffect1: 1000
     neffect2: 1000
     h1: 0.25
     h2: 0.25
     $sim_params: c(q = q, omega = omega, tau = tau,  h1 = h1, h2 = h2, n1 = n1, n2 = n2, neffect1 = neffect1, neffect2 =neffect2)
     $dat: dat

The R code contained in R() is the guts of this module. A data object named dat is generated using the sum_stats function from the causeSims R pacakge. Below this, we specify parameters used to simulate the data. Most important are q, omega, and tau. q is the proportion of variants acting through \(U\) while omega and tau are standardized versions of \(\gamma\) and \(\eta\) in the paper. tau is the proportion of \(Y\) heritability explained by the causal effect of \(M\) on \(Y\) times the sign of \(\gamma\), \[ \tau = sign(\gamma)\frac{\gamma^2 h^2_{M}}{h^{2}_{Y}}, \] and omega is the proportion of \(Y\) heritability explained by \(U\), divided by \(q\) and multiplied by the sign of \(\eta\). \[ \omega = sign(\eta)\frac{\eta^2 h^2_{M}}{h^{2}_{Y}} \] This parameterization is sometimes more convenient for running simulations that using \(\eta\) and \(\gamma\). However, the sum_stats function will accept either omega and tau arguments or eta and gamma. Note that here we have specified that the function can use 4 cores. This should match the n_cpu parameter in the simulate section of the config file. The sum_stats function generates data both with and without LD. In this file we only analyze the “with LD” data but we will describe how to make modifications to analyze both or only the “no LD” data.

The gw_sig module is a helper module. This module computes the GCP (a parameter defined by LCV (O’Connor and Price 2019)) and computes the number of genome-wide significant variants. It also stores all the simulation parameters which makes it easier to extract them. Extracting parameters from the simulate module will require reading in the entire data set, objects created by the gw_sig module are smaller and faster to read.

gw_sig: R(library(causeSims);
           m_sig_nold <- with($(dat), sum(p_value_nold < thresh ));
           y_sig_nold <- with($(dat), sum(2*pnorm(-abs(beta_hat_2_nold/seb2)) < thresh ));
           m_sig <- with($(dat), sum(p_value < thresh & ld_prune==TRUE));

          params <- $(sim_params);
          tau <- as.numeric(params["tau"]);
          omega <- as.numeric(params["omega"]);
          q <- as.numeric(params["q"]);
          h1 <- as.numeric(params["h1"]);
          h2 <- as.numeric(params["h2"]);
          neffect1 <- as.numeric(params["neffect1"]);
          neffect2 <- as.numeric(params["neffect2"]);
          gamma <- sqrt(tau*sum(h2)/sum(h1))*sign(tau);
          eta <- sqrt(abs(omega)*sum(h2)/sum(h1))*sign(omega);

          #LCV parameters
          if(q == 0 & tau == 0){
            q.1 <- q.2 <- 0;
            gcp <- 0;
          }else if(q == 0){
            q.1 <- 1;
            q.2 <- 0;
            gcp <- 1;
          }else{
            q.1 <- sqrt(q);
            q.2 <- eta * sqrt(q) * sqrt(h1) / sqrt(h2);
            gcp = (log(q.2^2) - log(q.1^2)) / (log(q.2^2) + log(q.1^2));
          };
          gcp_mom <- gcp_moments($(dat), h1, h2)
          )
    thresh: 5e-8
    $m_sig: m_sig
    $m_sig_nold: m_sig_nold
    $y_sig_nold: y_sig_nold
    $q: q
    $omega: omega
    $tau: tau
    $gamma: gamma
    $eta: eta
    $h1: h1
    $h2: h2
    $n1: as.numeric(params["n1"])
    $n2: as.numeric(params["n2"])
    $neffect1: as.numeric(params["neffect1"])
    $neffect2: as.numeric(params["neffect2"])
    $lcv_q1: q.1
    $lcv_q2: q.2
    $lcv_gcp: gcp
    $lcv_gcp_mom: gcp_mom$gcp

CAUSE Modules

We have divided CAUSE analysis into parameter estimation and model fitting steps. This makes it easier to experiment with changes only to the parameter estimation step.

cause_params: R(library(causeSims);
                p1 <- cause_params_sims($(dat), null_wt = null_wt, no_ld=FALSE);
                )
    null_wt: 10
    $cause_params_ld: p1

cause: R(library(causeSims);
         library(cause);
         cause_res <- cause_sims($(dat), $(cause_params_ld), no_ld = FALSE);
         z <- -1*summary(cause_res)$z;
         p <- pnorm(-z);
         quants <- summary(cause_res)$quants)
    $cause_res: cause_res
    $sigma_g: cause_res$sigma_g
    $eta_med_2: quants[[1]][1,2]
    $q_med_2: quants[[1]][1,3]
    $eta_med_3: quants[[2]][1,2]
    $gamma_med_3: quants[[2]][1,1]
    $q_med_3: quants[[2]][1,3]
    $z: z
    $p: p

Both modules use wrapper functions from the causeSims package. These are very simple wrappers that make it easy to run CAUSE on the simulated data format. Note that here we use no_ld = FALSE in both modules. If you want to use the data that is simulated without LD instead you can change no_ld = TRUE. You could get results for both using the following set of modules.

cause_params: R(library(causeSims);
                p1 <- cause_params_sims($(dat), null_wt = null_wt, no_ld=FALSE);
                p2 <- cause_params_sims($(dat), null_wt = null_wt, no_ld=TRUE);
                )
    null_wt: 10
    $cause_params_ld: p1
    $cause_params_nold: p2

cause: R(library(causeSims);
         library(cause);
         if(no_ld==FALSE){
          cause_res <- cause_sims($(dat), $(cause_params_ld), no_ld = no_ld);
         else{
          cause_res <- cause_sims($(dat), $(cause_params_nold), no_ld = no_ld);
         }
         z <- -1*summary(cause_res)$z;
         p <- pnorm(-z);
         quants <- summary(cause_res)$quants)
    $no_ld: TRUE, FALSE
    $cause_res: cause_res
    $sigma_g: cause_res$sigma_g
    $eta_med_2: quants[[1]][1,2]
    $q_med_2: quants[[1]][1,3]
    $eta_med_3: quants[[2]][1,2]
    $gamma_med_3: quants[[2]][1,1]
    $q_med_3: quants[[2]][1,3]
    $z: z
    $p: p

Alternative MR Modules

There are many modules for alternative MR methods. Modules for IVW regression, Egger regression, MR-PRESSO, and the weighted median method are very similar and rely on wrappers from causeSims. We can just look at one of these.

ivw: R(library(causeSims);
       res <- ivw($(dat), p_val_thresh=thresh, no_ld = no_ld);
       )
    thresh: 5e-8
    no_ld: FALSE
    $z: res$z
    $p: res$p

To run on the data without LD change the no_ld parameter to no_ld: TRUE or no_ld: TRUE, FALSE to run both.

The GSMR module is a little more complicated

gsmr: R(library(causeSims);
        evd_list <- readRDS("data/evd_list_chr19_hm3.RDS");
        res <- gsmr_sims($(dat), evd_list, p_val_thresh  = 5e-8, no_ld = FALSE);
        if(!is.null(res)){
           z <- res$bxy/res$bxy_se;
           est <- res$bxy;
           p <- res$bxy_pval;
        }else{
           z <- est <- p <-  NA;
        }
      )
    thresh: 5e-8
    $z: z
    $p: p
    $est_gsmr: est
    $gsmr: res

GSMR requires the LD structure and sometimes returns errors. In this case the gsmr_sims wrapper returns NULL and we set results to missing. If no_ld = TRUE the LD data in evd_list is simply ignored.

LCV Module

Results from LCV are not compared with CAUSE in the paper because LCV is performing a different test than other MR methods. However, we have implemented wrappers and a module to run LCV.

LCV: R(library(causeSims);
       res <- try(lcv_sims($(dat),no_ld = no_ld, sig.thresh = thresh));
       if(class(res) == "try-error"){
            res <- list(pval.gcpzero.2tailed=NA, gcp.pm = NA, gcp.pse = NA);
        };
       )
    thresh: 30
    no_ld: FALSE
    $p: res$pval.gcpzero.2tailed
    $gcp_mean: res$gcp.pm
    $gcp_pse: res$gcp.pse
    $gcp_obj: res

When LCV returns errors, we set results to missing.


sessionInfo()
R version 3.6.1 (2019-07-05)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 18.04.2 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/openblas/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/libopenblasp-r0.2.20.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] jcolors_0.0.4   pROC_1.15.0     gridExtra_2.3   forcats_0.4.0  
 [5] stringr_1.4.0   dplyr_0.8.3     purrr_0.3.2     readr_1.3.1    
 [9] tidyr_0.8.3     tibble_2.1.3    ggplot2_3.2.0   tidyverse_1.2.1

loaded via a namespace (and not attached):
 [1] tidyselect_0.2.5     xfun_0.8             haven_2.1.1         
 [4] lattice_0.20-38      colorspace_1.4-1     generics_0.0.2      
 [7] vctrs_0.2.0          htmltools_0.3.6      yaml_2.2.0          
[10] rlang_0.4.0          pillar_1.4.2         glue_1.3.1          
[13] withr_2.1.2          modelr_0.1.4         readxl_1.3.1        
[16] plyr_1.8.4           munsell_0.5.0        gtable_0.3.0        
[19] workflowr_1.4.0.9000 cellranger_1.1.0     rvest_0.3.4         
[22] evaluate_0.14        knitr_1.23           highr_0.8           
[25] broom_0.5.2          Rcpp_1.0.1           scales_1.0.0        
[28] backports_1.1.4      jsonlite_1.6         fs_1.3.1            
[31] hms_0.5.0            digest_0.6.20        stringi_1.4.3       
[34] grid_3.6.1           rprojroot_1.3-2      cli_1.1.0           
[37] tools_3.6.1          magrittr_1.5         lazyeval_0.2.2      
[40] crayon_1.3.4         whisker_0.3-2        pkgconfig_2.0.2     
[43] zeallot_0.1.0        xml2_1.2.0           lubridate_1.7.4     
[46] assertthat_0.2.1     rmarkdown_1.13       httr_1.4.0          
[49] rstudioapi_0.10      R6_2.4.0             nlme_3.1-140        
[52] git2r_0.26.1         compiler_3.6.1