Processing math: 100%
  • Introduction
  • Set Up
  • How to Use the Pipeline
  • Download data and build spreadsheet
  • Example

Last updated: 2020-05-29

Checks: 7 0

Knit directory: cause/

This reproducible R Markdown analysis was created with workflowr (version 1.4.0.9000). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.


Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.

Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.

The command set.seed(20181014) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.

Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.

Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility. The version displayed above was the version of the Git repository at the time these results were generated.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Ignored files:
    Ignored:    .Rhistory
    Ignored:    .Rproj.user/
    Ignored:    sim_results/
    Ignored:    src/RcppExports.o
    Ignored:    src/cause.so
    Ignored:    src/log_likelihood_functions.o

Untracked files:
    Untracked:  analysis/figure/ldl_cad.Rmd/
    Untracked:  analysis/figure/little_test.Rmd/
    Untracked:  cause.Rcheck/
    Untracked:  docs/cause.bib
    Untracked:  docs/figure/cause_figure_1_standalone.pdf
    Untracked:  gwas_data/
    Untracked:  ll_v7_notes.Rmd
    Untracked:  test_A.RDS
    Untracked:  test_A_answer1.RDS
    Untracked:  test_A_answer3.RDS
    Untracked:  test_fit2.RDS
    Untracked:  test_mix_grid_new.RDS
    Untracked:  test_params3_warm.RDS
    Untracked:  test_params_orig.RDS
    Untracked:  test_w.RDS
    Untracked:  tests/

Unstaged changes:
    Modified:   R/est_cause_params.R
    Modified:   R/map_pi_rho.R

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.


These are the previous versions of the R Markdown and HTML files. If you’ve configured a remote Git repository (see ?wflow_git_remote), click on the hyperlinks in the table below to view them.

File Version Author Date Message
Rmd b1f8424 Jean Morrison 2020-05-29 wflow_publish(“analysis/gwas_pairs_2.Rmd”)
html 6c76002 Jean Morrison 2019-12-05 Build site.
Rmd 17dbaf7 Jean Morrison 2019-12-05 wflow_publish(files = c(“analysis/_site.yml“,”analysis/gwas_pairs_2.Rmd“))
html bf034aa Jean Morrison 2019-10-18 Build site.
Rmd 96d3d07 Jean Morrison 2019-10-18 wflow_publish(“analysis/gwas_pairs_2.Rmd”)
html 4cf5222 Jean Morrison 2019-10-18 Build site.
Rmd c2e7daf Jean Morrison 2019-10-18 wflow_publish(“analysis/gwas_pairs_2.Rmd”)
html 5659dec Jean Morrison 2019-10-18 Build site.
Rmd 4a59a3f Jean Morrison 2019-10-18 wflow_publish(“analysis/gwas_pairs_2.Rmd”)
html a1c2e09 Jean Morrison 2019-10-16 Build site.
Rmd 12192aa Jean Morrison 2019-10-16 wflow_publish(c(“analysis/index.Rmd”, “analysis/gwas_pairs_2.Rmd”))

Click here to explore results interactively

To reproduce these results or to learn how to use the CAUSE Snakemake pipeline, follow the tutorial below

Introduction

We have setup a Snakemake pipeline that will make it easy to run CAUSE (and a handful of other methods) on many pairs of traits. We will demonstrate the pipeline with an example that produces a subset of the results above.

Set Up

We use Snakemake and a conda environment to run this analysis. If you don’t have Miniconda or Anaconda installed you can either install one of them (recomended) or just make sure that you have Python3, pandas and Snakemake installed. We have chosen to not include R in the conda environment so you will need to have R installed outside the environment and also have the cause and tidyverse R packages installed and running. If you were able to work through the package tutorial you should be in good shape. For some alternative MR methods you will need the MendelianRandomization R package and the MR-PRESSO R package which can be installed here.

First create the conda environment

conda create -n cause_large python=3.6 snakemake

Next create a working directory that you would like to analyze the data in. Change to that directory using cd in Mac or Linux. For example

mkdir gwas_pairs
cd gwas_pairs

Finally, inside the working directory and using R, setup a CAUSE pipeline. The download_ld=TRUE argument causes the function to download correctly formatted ld data computed using 1k genomes CEU individuals. If you’ve already downloaded this or you want to use different LD data, use download_ld=FALSE. The download_eur_ld_scores=TRUE argument will download LD scores to use with LCV. If you have these already or don’t want to run LCV, set it to false.

cause::setup_cause_pipeline(download_ld=TRUE, download_eur_ld_scores=TRUE)

How to Use the Pipeline

Set up a pipeline analysis. From the directory you want to use, in R run

cause::setup_cause_pipeline()

The setup_cause_pipeline() function provides you with everything you need to run an analysis except for data and a .csv file describing that data. You can use this function to start any pipeline analysis, not just this example. The analysis is controlled by the config.yaml file which has four sections. You should edit the file to match your analysis desires.

The input section gives the location of a csv file that describes each set of GWAS summary statistics. More on this file later.

# Path to data spreadsheed
# csv format, columns include:
# name, raw_data_path, snp, A1, A2, beta_hat, se, p_value, sample_size, delimeter
input:
    sum_stats: "gwas_pairs.csv"

The analysis section gives instructions about what analysis to run.

# What analysis to do
# If all_pairs do all pairs of triats. Otherwise
# use traits in trait1 as M and traits in trait2 as Y
# methods should be a comma separated list with no spaces.
# Optional methods should be one of
# cause_*_*_*, mrpackage, lcv,  mrpresso
analysis:
    all_pairs: True
    trait1: 1,2
    trait2: 3,4,5
    methods: cause_1_2_1,cause_1_10_1,cause_1_100_1,mrpackage,mrpresso,lcv
    mr_pval: "5e-8"
    cause_seed: 100

If all_pairs is True then the pipeline will run the desired methods for all pairs of traits in the csv. Otherwise it will use the trait1 and trait2 fields to determine which pairs to run. Numbers in these fields refer to line numbers (beginnin at 1 after the header) in the csv. The methods line lists the methods you would like to run. Options are listed above in the comments. The numbers following cause designate the prior on q. If the method is given as cause_a_b_c then a Beta(a,b) prior truncated at c will but used. The mrpackage option runs a set of six different methods using the MendelianRandomization R package. These are IVW and Egger regression both with random effects, the weighted median and the weighted mode with with values of phi equal to 1, 0.5, and 0.25. The mr_pval field gives the minimum p-value for variants used in methods besides CAUSE and LCV which use all variants. The cause_seed provides a random seed to CAUSE and ensures that results can be reproduced exactly.

The ld section tells the pipeline where to find correctly formatted LD data and what files are named.

#Path to directory containing LD data
#ld_score_dir is used only by method lcv
ld:
    dir: "ld/"
    r2_file: "_AF0.05_0.1.RDS"
    info_file: "_AF0.05_snpdata.RDS"
    ld_score_dir: "ld_scores/eur_w_ld_chr/"

The pipeline will expect to find files in the directory given in dir: with names {chr}_{r2_file} and {chr}_{info_file} where {r2_file} and {info_file} are the file endings given in those respective fields and {chr} is a chromosme (e.g. chr1, chr2) for chromosomes 1-22. ld_score_dir is only necessary if you want to run LCV using method lcv.

The out section tells the pipeline where to store output files.

out:
    gwas_data_dir: "cause_standard_format/"
    other_data_dir: "data/"
    output_dir: "results/"

The gwas_data_dir field is a directory to store formatted summary statistics. It can be helpful to store these in a centralized location if you are running multiple pipelines ot save on work and storage. The other_data_dir lists a directory to store other data files. These include some temporary files that are removed at the end of the pipeline and lists of SNPs pruned for LD that are saved. output_dir is a directory to store analysis results.

The cluster.yaml file describes resources allocated for each kind of job. The default will work but if it requests more resources than are available on your cluster (e.g. memory) you may need to change it.

The Snakemake command for submitting the pipeline is in the run-snakemake.sh file. You will need to modify it to match your cluster.

Once you have downloaded summary statistics and created the csv file you are ready to run the pipeline. You can run with

nohup ./run-snakemake.sh & 

You may need to change the permissons of run-snakemake.sh to be excecutable with chmod a+x run-snakemake.sh. The nohup is optional but is nice because the pipeline can run for a long time. I generally prefer to run the pipeline from a compute node rather than the login node but this will depend on your setup.

Download data and build spreadsheet

The next step is to acquire some GWAS summary statistics and to describe them in a spreadsheet so that the pipeline is able to format them properly. The spreadsheet has the following mandatory column headers in any order:

name: a unique string naming the study

delimeter: Field delimeter. One of “tab”, “,”, “space”, or any symbol.

snp: Column name of SNP ID (generally rs number but anything that matches the other file). This will be the field that studies are joined on.

A1: Column name of effect allele

A2: Column name of other allele

beta_hat: Column name of effect estimate

se: Column name of standard error of effect estimate

p_value: Column name of p-value

sample_size: Column name of per-SNP sample size

The p_value and sample_size fields may contain NAs if some studies don’t have them. The rest are required. Most studies can be used exactly as downloaded but you may have to a little bit of formatting before you can use the pipeline. For example, some studies do not contain rs numbers or have atypical variant names.

Example

To run an example, from inside of R run

cause::cause_download_example_gwas_data()

This function will set up an example analysis of three traits LDL cholesterol (Willer et al 2013 PMID 24097068 ), coronary artery disease (van der Harst et al 2017 PMID 29212778), and asthma (Demenais et al 2018 PMID 29273806). The function will download summary statistics directly from their sources. In these cases they are ready to use without any modifications, but this isn’t always the case. For example, some studies do not have rs ids included or report odds ratio rather than log odds ratio (the coefficient estimate from logistic regression). In these cases, you will need to modify the data so they contain the five mandatory colums of snp name, effect allele, other allele, coefficient (effect) estimate, and standard error of effect estimate. The function also downloads a csv file called gwas_pairs.csv. Take a look at the csv if you are having trouble making your own. This file has some extra (non-required) columns that we find useful for keeping track of studies.

Now to run an analysis all you need to do is

  1. Edit the run-snakemake.sh file so that the cluster command is compatible with your cluster setup.

  2. Edit the config.yaml file. You may not need to make any changes but you might need to change the location of the LD files or change where you would like data stored. Leave all_pairs: True or, alternatively to only run LDL -> CAD and LDL -> Asthma change to

analysis:
  all_pairs: False
  trait1: 1
  trait2: 2,3

The numbers correspond to the index of each trait in gwas_pairs.csv. Leave the other fields in the analysis section as they are.

  1. Change the persmissions of run-snakemake.sh to make it an executable. Use chmod a+x run-snakemake.sh.

  2. Run (on a compute node if you like)

source activate cause_large
nohup ./run-snakemake.sh & 

A good thing to keep in mind is that Snakemake will pickup wherever it left off if a job fails or the analysis is interrupted. For example, suppose you find that you need to give one method more memory using the data you have. You can modify the cluster.yaml file and then simply re-run the command above. No work that has been completed will be repeated. You can use snakemake -n to do a “dry run” which tells you what commands will be run.

When the pipeline is done, in the results directory there will be a handful of files named results/df_{method}.RDS where {method} is one of the methods run above. Source these into R and take a look using

df <- readRDS("results/df_cause_1_10_1.RDS")
df

Additionally, the resuts folder will contain files named results/{name1}__{name2}_{method}.RDS containing results for each method. If the method is CAUSE, this will be a CAUSE object that you can look at using utilities in the cause package. Try

library(cause)
res <- readRDS("results/glg_ldl__vanderHarst_cad_cause_1_10_1.RDS")
summary(res)
plot(res, type="posteriors")
plot(res, type="data")

sessionInfo()
R version 3.6.3 (2020-02-29)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 18.04.4 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/openblas/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/libopenblasp-r0.2.20.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
 [1] workflowr_1.4.0.9000 Rcpp_1.0.4.6         digest_0.6.23       
 [4] rprojroot_1.3-2      backports_1.1.5      git2r_0.26.1        
 [7] magrittr_1.5         evaluate_0.14        stringi_1.4.3       
[10] fs_1.3.1             whisker_0.4          rmarkdown_1.15      
[13] tools_3.6.3          stringr_1.4.0        glue_1.3.1          
[16] xfun_0.9             yaml_2.2.0           compiler_3.6.3      
[19] htmltools_0.3.6      knitr_1.24