Last updated: 2020-11-13

Checks: 7 0

Knit directory: cause/

This reproducible R Markdown analysis was created with workflowr (version 1.6.2). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.


Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.

Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.

The command set.seed(20181014) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.

Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.

Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.

The results in this page were generated with repository version 5f67e0a. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Ignored files:
    Ignored:    .Rhistory
    Ignored:    .Rproj.user/
    Ignored:    analysis/example_data/
    Ignored:    pipeline_code/plink_reference/
    Ignored:    sim_results/
    Ignored:    src/RcppExports.o
    Ignored:    src/cause.so
    Ignored:    src/log_likelihood_functions.o

Untracked files:
    Untracked:  analysis/mrcieu.Rmd
    Untracked:  cause.Rcheck/
    Untracked:  gwas_data/
    Untracked:  ll_v7_notes.Rmd
    Untracked:  ll_v7_notes.html
    Untracked:  pipeline_code/.snakemake/
    Untracked:  pipeline_code/config.schema.yaml
    Untracked:  pipeline_code/ld/
    Untracked:  pipeline_code/raw_data/
    Untracked:  tests/

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.


These are the previous versions of the repository in which changes were made to the R Markdown (analysis/ldl_cad.Rmd) and HTML (docs/ldl_cad.html) files. If you’ve configured a remote Git repository (see ?wflow_git_remote), click on the hyperlinks in the table below to view the files as they were in that past version.

File Version Author Date Message
Rmd c7a1cb4 Jean Morrison 2020-11-13 refactor website
html 81a1a48 Jean Morrison 2020-06-02 Build site.
Rmd f8dff24 Jean Morrison 2020-06-02 wflow_publish(“analysis/ldl_cad.Rmd”)
html 1cfcb0b Jean Morrison 2020-05-29 Build site.
Rmd 3c9571d Jean Morrison 2020-05-29 wflow_publish(“analysis/ldl_cad.Rmd”)
html 70e2b97 Jean Morrison 2019-12-04 Build site.
Rmd b1c1567 Jean Morrison 2019-08-07 update vignette to use new formatting code
html 1a90dd6 Jean Morrison 2019-07-15 Build site.
Rmd 02aa566 Jean Morrison 2019-07-15 wflow_publish(“analysis/ldl_cad.Rmd”)
html 1702acc Jean Morrison 2019-07-12 Build site.
html 02bc6d3 Jean Morrison 2019-07-09 Build site.
Rmd 28d211f Jean Morrison 2019-07-09 wflow_publish(“analysis/ldl_cad.Rmd”)
html bfa7d28 Jean Morrison 2019-07-09 Build site.
Rmd d8f89fc Jean Morrison 2019-07-09 wflow_publish(“analysis/ldl_cad.Rmd”)
html 3b5c7e2 Jean Morrison 2019-06-25 Build site.
Rmd a55827d Jean Morrison 2019-06-25 wflow_publish(files = c(“analysis/index.Rmd”, “analysis/ldl_cad.Rmd”))
html d8d1486 Jean Morrison 2019-06-25 Build site.
Rmd e0a6df4 Jean Morrison 2019-06-25 wflow_publish(files = c(“analysis/ldl_cad.Rmd”))
html 33b3732 Jean Morrison 2019-06-25 Build site.
Rmd 0e35268 Jean Morrison 2019-06-25 wflow_publish(files = c(“analysis/ldl_cad.Rmd”))
html 286f4e9 Jean Morrison 2019-06-25 Build site.
Rmd 8f3b82e Jean Morrison 2019-06-25 wflow_publish(files = c(“analysis/about.Rmd”, “analysis/index.Rmd”, “analysis/ldl_cad.Rmd”, “analysis/license.Rmd”,
html 1753b22 Jean Morrison 2018-11-09 update website
html 2eb09d8 Jean Morrison 2018-11-06 added new option for a truncated prior on q
html 4a8f76c Jean Morrison 2018-11-06 Build site.
Rmd 2652be1 Jean Morrison 2018-11-06 wflow_publish(“analysis/ldl_cad.Rmd”)
html 6ae7a60 Jean Morrison 2018-10-24 build website
Rmd 48bdf21 Jean Morrison 2018-10-24 fixing warnings
Rmd 8c57c8a Jean Morrison 2018-10-24 build website
html 8c57c8a Jean Morrison 2018-10-24 build website
Rmd a34393d Jean Morrison 2018-10-24 build website
html a34393d Jean Morrison 2018-10-24 build website
html 6354c35 Jean Morrison 2018-10-22 Build site.
Rmd 24510f1 Jean Morrison 2018-10-22 small language changes
Rmd d050da8 Jean Morrison 2018-10-22 wflow_publish(“analysis/ldl_cad.Rmd”)
html bbe4901 Jean Morrison 2018-10-17 Build site.
Rmd 558cd32 Jean Morrison 2018-10-17 wflow_publish(“analysis/ldl_cad.Rmd”)
html 73690eb Jean Morrison 2018-10-17 Build site.
Rmd 1a891e3 Jean Morrison 2018-10-17 wflow_publish(“analysis/ldl_cad.Rmd”)
Rmd d10191f Jean Morrison 2018-10-17 wflow_git_commit(“analysis/ldl_cad.Rmd”)

Introduction

This document will walk through a real genome-sized example of how to use CAUSE. Some of the steps will take 5-10 minutes. The LD pruning steps will benefit from access to a cluster or multiple cores. For steps that require long computation we also provide output files that can be downloaded to make it easier to run through the example.

We will be analyzing GWAS data for LDL cholesterol and for coronary artery disease to test for a causal relationship of LDL on CAD. The analysis will have the following steps:

  1. Format the data for use with CAUSE
  2. Calculate nuisance parameters
  3. LD pruning
  4. Fit CAUSE
  5. Look at results

Step 3 will require LD information estimated from the 1000 Genomes CEU population using LDshrink here. LD data are about 11 Gb. The GWAS data we will use are about 320 Gb. However, in this tutorial you will be able to skip the large data steps and simply download the results.

Step 0: Install CAUSE

Follow installation instructions here

Step 1: Format Data for CAUSE

We will use read_tsv to read in summary statistics for a GWAS of LDL cholesterol and a GWAS of coronary artery disease from the internet. We will then combine these and format them for use with CAUSE. First read in the data. For LDL Cholesterol, we use summary statistics from Willer et al (2013) [PMID: 24097068]. For CAD we use summary statistics from van der Harst et al. (2017) [PMID: 29212778]. Downloading and formatting the data takes several minutes. If you want to skip this step, we provide a formatted data file that you can download below.

library(readr)
library(dplyr)

Attaching package: 'dplyr'
The following objects are masked from 'package:stats':

    filter, lag
The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union
library(cause)
X1 <- read_tsv("http://csg.sph.umich.edu/abecasis/public/lipids2013/jointGwasMc_LDL.txt.gz")
X2 <- read_tsv("ftp://ftp.ebi.ac.uk/pub/databases/gwas/summary_statistics/vanderHarstP_29212778_GCST005194/CAD_META.gz")

CAUSE needs the following information from each data set: SNP or variant ID, effect size, and standard error, effect allele and other allele. For convenience, we provide a simple function that will merge data sets and produce a cause_data object that can be used with later functions. This step and the rest of the analysis are done in R.

The function gwas_merge will try to merge two data sets and and align effect sizes to correspond to the same allele. It will remove variants with ambiguous alleles (G/C or A/T) or with alleles that do not match between data sets (e.g A/G in one data set and A/C in the other). It will also remove variants that are duplicated in either data set. It will not remove variants that are simply strand flipped between the two data sets (e. g. A/C in one data set, T/G in the other).

LDL column headers:

  • SNP: rsid
  • Effect: beta
  • Standard Error: se
  • Effect Allele: A1
  • Other Allele: A2

CAD column headers:

  • SNP: oldID
  • Effect: Effect
  • Standard Error: StdErr
  • Effect Allele: Allele1
  • Other Allele: Allele2
X <- gwas_merge(X1, X2, snp_name_cols = c("rsid", "oldID"), 
                       beta_hat_cols = c("beta", "Effect"), 
                       se_cols = c("se", "StdErr"), 
                       A1_cols = c("A1", "Allele1"), 
                       A2_cols = c("A2", "Allele2"))

Alternatively, you can download already formatted data here and read it in using readRDS.

system("mkdir example_data/")
download.file("https://github.com/jean997/cause/blob/master/example_data/LDL_CAD_merged.RDS", destfile = "example_data/LDL_CAD_merged.RDS")
X <- readRDS("example_data/LDL_CAD_merged.RDS")
head(X)
        snp beta_hat_1   seb1 beta_hat_2   seb2 A1 A2
1 rs4747841     0.0037 0.0052     0.0106 0.0056  A  G
2 rs4749917    -0.0033 0.0052    -0.0108 0.0056  A  G
3  rs737656     0.0099 0.0054     0.0196 0.0058  A  G
4  rs737657     0.0084 0.0054     0.0195 0.0058  A  G
5 rs7086391    -0.0075 0.0067     0.0115 0.0072  A  G
6  rs878177    -0.0073 0.0055    -0.0225 0.0059  A  G

There are likely more efficient ways to do this merge. If you would like to process the data yourself, you can construct a cause_data object from a data frame using the constructor new_cause_data(X) where X is any data frame that includes the columns snp, beta_hat_1, seb1, beta_hat_2, and seb2.

Step 2: Calculate nuisance parameters

The next step is to estimate the parameters that define the prior distribution of \(\beta_{M}\) and \(\theta\) and to estimate \(\rho\), the correlation between summary statistics that is due to sample overlap or population structure. We will do this with a random subset of 1,000,000 variants since our data set is large. est_cause_params estimates the nuisance parameters by finding the maximum a posteriori estimate of \(\rho\) and the mixing parameters when \(\gamma = \eta = 0\). This step takes a several minutes.

set.seed(100)
varlist <- with(X, sample(snp, size=1000000, replace=FALSE))
params <- est_cause_params(X, varlist)
Estimating CAUSE parameters with  1000000  variants.
1 0.2180944 
2 0.0006398994 
3 7.445625e-06 
4 4.878722e-08 

The object params is of class cause_params and contains information about the fit as well as the maximum a posteriori estimates of the mixing parameters (\(\pi\)) and \(\rho\). The object params$mix_grid is a data frame defining the distribution of summary statistics. The column S1 is the variance for trait 1 (\(M\)), the column S2 is the variance for trait 2 (\(Y\)) and the column pi is the mixture proportion assigned to that variance combination.

class(params)
[1] "cause_params"
names(params)
[1] "rho"       "pi"        "mix_grid"  "loglik"    "PIS"       "RHO"      
[7] "LLS"       "converged" "prior"    
params$rho
[1] 0.06465784
head(params$mix_grid)
           S1          S2           pi
1 0.000000000 0.000000000 0.3105286643
2 0.000000000 0.003440730 0.2381082227
3 0.000000000 0.004865928 0.1490899299
4 0.003533615 0.004865928 0.1660234953
5 0.004997287 0.009731855 0.1177553318
6 0.019989147 0.009731855 0.0006521846

So, for example, in this case, we have estimated that 31% of variants have trait 1 variance and trait 2 equal to 0 meaning that they have no association with either trait.

Tip: Do not try to estimate the nuisance parameters with substantially fewer than 100,000 variants. This can lead to poor estimates of the mixing parameters whih leads to bad model comparisons.

Step 3: LD Pruning

We estimate CAUSE posterior distributions using an LD pruned set of variants, prioritizing variants with low trait \(M\) (LDL) \(p\)-values.

There are two ways to do this step. The CAUSE R package contains a function that performs LD pruning given a file that contains pairwise estimates of \(r^2\). This is the method we used in our paper, coupled with LD estimates 1000 Genomes European samples computed via LDshrin. This method however, is very slow.

An alternative is to use Plink to perform LD clumping. The ieugwasr package provides a convenient R interface to Plink. This method is much faster but requires a reference sample. You can download reference data from here.

Step 4: Fit CAUSE

Now that we have formatted data, an LD pruned set of variants, and nuisance parameters estimated, we can fit CAUSE! The function cause estimates posterior distributions under the sharing and causal models and calculates the ELPD for both models as well as for the null model in which there is neither a causal or a shared factor effect. This might take 5-10 minutes.

top_vars <- readRDS("example_data/top_ldl_pruned_vars.RDS")
res <- cause(X=X, variants = top_vars, param_ests = params)
Estimating CAUSE posteriors using  1215  variants.

Step 5: Look at Results

The resulting cause object contains an object for the partial sharing model fit (sharing), and object for the causal model fit (causal) and a table of ELPD results.

class(res)
[1] "cause"
names(res)
[1] "sharing" "causal"  "elpd"    "loos"    "data"    "sigma_g" "qalpha" 
[8] "qbeta"  
res$elpd
   model1  model2 delta_elpd se_delta_elpd         z
1    null sharing -66.887583      8.350653 -8.009862
2    null  causal -75.072952      9.497320 -7.904646
3 sharing  causal  -8.185368      1.246101 -6.568783
class(res$sharing)
[1] "cause_post"
class(res$causal)
[1] "cause_post"

The elpd table has the following columns:

  • model1, model2: The models being compared
  • delta_elpd: Estimated difference in elpd. If delta_elpd is negative, model 2 is a better fit
  • se_delta_elpd: Estimated standard error of delta_elpd
  • z: delta_elpd/se_delta_elpd. A z-score that can be compared to a normal distribution to test if the difference in model fit is significant.

In this case we see that the causal model is significantly better than the sharing model from the thrid line of the table. The \(z\)-score is -6.57 corresponding to a p-value of 2.5^{-11}.

For each model (partial sharing and full) we can plot the posterior distributions of the parameters. Dotted lines show the prior distributions.

plot(res$sharing)

Version Author Date
81a1a48 Jean Morrison 2020-06-02
70e2b97 Jean Morrison 2019-12-04
bfa7d28 Jean Morrison 2019-07-09
d8d1486 Jean Morrison 2019-06-25
33b3732 Jean Morrison 2019-06-25
286f4e9 Jean Morrison 2019-06-25
1753b22 Jean Morrison 2018-11-09
2eb09d8 Jean Morrison 2018-11-06
4a8f76c Jean Morrison 2018-11-06
a34393d Jean Morrison 2018-10-24
bbe4901 Jean Morrison 2018-10-17
73690eb Jean Morrison 2018-10-17
plot(res$causal)

Version Author Date
81a1a48 Jean Morrison 2020-06-02
70e2b97 Jean Morrison 2019-12-04
bfa7d28 Jean Morrison 2019-07-09
d8d1486 Jean Morrison 2019-06-25
33b3732 Jean Morrison 2019-06-25
286f4e9 Jean Morrison 2019-06-25
1753b22 Jean Morrison 2018-11-09
2eb09d8 Jean Morrison 2018-11-06
4a8f76c Jean Morrison 2018-11-06
a34393d Jean Morrison 2018-10-24
bbe4901 Jean Morrison 2018-10-17
73690eb Jean Morrison 2018-10-17

The summary method will summarize the posterior medians and credible intervals.

summary(res, ci_size=0.95)
p-value testing that causal model is a better fit:  2.5e-11 
Posterior medians and  95 % credible intervals:
     model     gamma              eta                   q                  
[1,] "Sharing" NA                 "0.38 (0.31, 0.45)"   "0.78 (0.65, 0.89)"
[2,] "Causal"  "0.34 (0.28, 0.4)" "-0.01 (-0.66, 0.53)" "0.03 (0, 0.25)"   

The plot method applied to a cause object will arrange all of this information on one spread.

plot(res)

Version Author Date
81a1a48 Jean Morrison 2020-06-02
70e2b97 Jean Morrison 2019-12-04
1a90dd6 Jean Morrison 2019-07-15
bfa7d28 Jean Morrison 2019-07-09
d8d1486 Jean Morrison 2019-06-25
33b3732 Jean Morrison 2019-06-25
286f4e9 Jean Morrison 2019-06-25
1753b22 Jean Morrison 2018-11-09
2eb09d8 Jean Morrison 2018-11-06
4a8f76c Jean Morrison 2018-11-06
a34393d Jean Morrison 2018-10-24
bbe4901 Jean Morrison 2018-10-17
73690eb Jean Morrison 2018-10-17

The plot method can also produce scatter plots of the data showing for each model, the probability that each variant is acting through the shared factor (\(U\)) and the contribution of each variant to the ELPD test statistic.

plot(res, type="data")

Version Author Date
81a1a48 Jean Morrison 2020-06-02
1cfcb0b Jean Morrison 2020-05-29
70e2b97 Jean Morrison 2019-12-04
1a90dd6 Jean Morrison 2019-07-15
bfa7d28 Jean Morrison 2019-07-09
d8d1486 Jean Morrison 2019-06-25
33b3732 Jean Morrison 2019-06-25
286f4e9 Jean Morrison 2019-06-25
1cc4860 Jean Morrison 2018-11-09
2eb09d8 Jean Morrison 2018-11-06
4a8f76c Jean Morrison 2018-11-06
sessionInfo()
R version 4.0.3 (2020-10-10)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 18.04.5 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/openblas/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/libopenblasp-r0.2.20.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] cause_1.2.0     dplyr_1.0.2     readr_1.4.0     workflowr_1.6.2

loaded via a namespace (and not attached):
 [1] tidyselect_1.1.0   xfun_0.18          ashr_2.2-47        purrr_0.3.4       
 [5] lattice_0.20-41    colorspace_2.0-0   vctrs_0.3.4        generics_0.1.0    
 [9] htmltools_0.5.0    loo_2.3.0          yaml_2.2.1         utf8_1.1.4        
[13] rlang_0.4.8        mixsqp_0.3-43      later_1.1.0.1      pillar_1.4.6      
[17] glue_1.4.2         matrixStats_0.57.0 lifecycle_0.2.0    stringr_1.4.0     
[21] munsell_0.5.0      gtable_0.3.0       evaluate_0.14      labeling_0.4.2    
[25] knitr_1.30         httpuv_1.5.4       invgamma_1.1       irlba_2.3.3       
[29] parallel_4.0.3     fansi_0.4.1        Rcpp_1.0.5         promises_1.1.1    
[33] backports_1.2.0    scales_1.1.1       RcppParallel_5.0.2 truncnorm_1.0-8   
[37] farver_2.0.3       fs_1.5.0           gridExtra_2.3      ggplot2_3.3.2     
[41] hms_0.5.3          digest_0.6.27      stringi_1.5.3      rprojroot_1.3-2   
[45] grid_4.0.3         cli_2.1.0          tools_4.0.3        magrittr_1.5      
[49] tibble_3.0.4       crayon_1.3.4       whisker_0.4        tidyr_1.1.2       
[53] pkgconfig_2.0.3    ellipsis_0.3.1     Matrix_1.2-18      SQUAREM_2020.5    
[57] assertthat_0.2.1   rmarkdown_2.3      rstudioapi_0.12    R6_2.5.0          
[61] intervals_0.15.2   git2r_0.27.1       compiler_4.0.3    

sessionInfo()
R version 4.0.3 (2020-10-10)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 18.04.5 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/openblas/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/libopenblasp-r0.2.20.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] cause_1.2.0     dplyr_1.0.2     readr_1.4.0     workflowr_1.6.2

loaded via a namespace (and not attached):
 [1] tidyselect_1.1.0   xfun_0.18          ashr_2.2-47        purrr_0.3.4       
 [5] lattice_0.20-41    colorspace_2.0-0   vctrs_0.3.4        generics_0.1.0    
 [9] htmltools_0.5.0    loo_2.3.0          yaml_2.2.1         utf8_1.1.4        
[13] rlang_0.4.8        mixsqp_0.3-43      later_1.1.0.1      pillar_1.4.6      
[17] glue_1.4.2         matrixStats_0.57.0 lifecycle_0.2.0    stringr_1.4.0     
[21] munsell_0.5.0      gtable_0.3.0       evaluate_0.14      labeling_0.4.2    
[25] knitr_1.30         httpuv_1.5.4       invgamma_1.1       irlba_2.3.3       
[29] parallel_4.0.3     fansi_0.4.1        Rcpp_1.0.5         promises_1.1.1    
[33] backports_1.2.0    scales_1.1.1       RcppParallel_5.0.2 truncnorm_1.0-8   
[37] farver_2.0.3       fs_1.5.0           gridExtra_2.3      ggplot2_3.3.2     
[41] hms_0.5.3          digest_0.6.27      stringi_1.5.3      rprojroot_1.3-2   
[45] grid_4.0.3         cli_2.1.0          tools_4.0.3        magrittr_1.5      
[49] tibble_3.0.4       crayon_1.3.4       whisker_0.4        tidyr_1.1.2       
[53] pkgconfig_2.0.3    ellipsis_0.3.1     Matrix_1.2-18      SQUAREM_2020.5    
[57] assertthat_0.2.1   rmarkdown_2.3      rstudioapi_0.12    R6_2.5.0          
[61] intervals_0.15.2   git2r_0.27.1       compiler_4.0.3