Last updated: 2019-07-29
Checks: 6 1
Knit directory: cause/gwas_data/
This reproducible R Markdown analysis was created with workflowr (version 1.4.0.9000). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.
Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.
The global environment had objects present when the code in the R Markdown file was run. These objects can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment. Use wflow_publish
or wflow_build
to ensure that the code is always run in an empty environment.
The following objects were defined in the global environment when these results were created:
Name | Class | Size |
---|---|---|
data | environment | 56 bytes |
env | environment | 56 bytes |
The command set.seed(20181014)
was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.
Great job! Recording the operating system, R version, and package versions is critical for reproducibility.
Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.
Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.
Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility. The version displayed above was the version of the Git repository at the time these results were generated.
Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish
or wflow_git_commit
). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:
Ignored files:
Ignored: .Rhistory
Ignored: .Rproj.user/
Ignored: sim_results/
Ignored: src/RcppExports.o
Ignored: src/cause.so
Ignored: src/log_likelihood_functions.o
Untracked files:
Untracked: analysis/figure/ldl_cad.Rmd/
Untracked: analysis/figure/little_test.Rmd/
Untracked: cause.Rcheck/
Untracked: docs/cause.bib
Untracked: docs/figure/cause_figure_1_standalone.pdf
Untracked: gwas_data/
Untracked: ll_v7_notes.Rmd
Untracked: tests/
Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.
These are the previous versions of the R Markdown and HTML files. If you’ve configured a remote Git repository (see ?wflow_git_remote
), click on the hyperlinks in the table below to view them.
File | Version | Author | Date | Message |
---|---|---|---|---|
Rmd | ded1693 | Jean Morrison | 2019-07-30 | wflow_publish(“analysis/gwas_pairs.Rmd”) |
Rmd | ff51b98 | Jean Morrison | 2019-07-29 | Working on GWAS pairs documenation |
Rmd | e8c10f6 | Jean Morrison | 2019-07-26 | Start writing instructions |
html | 3b42836 | Jean Morrison | 2019-07-10 | Build site. |
Rmd | e9682b2 | Jean Morrison | 2019-07-10 | wflow_publish(“analysis/gwas_pairs.Rmd”) |
html | 286f4e9 | Jean Morrison | 2019-06-25 | Build site. |
Rmd | 8f3b82e | Jean Morrison | 2019-06-25 | wflow_publish(files = c(“analysis/about.Rmd”, “analysis/index.Rmd”, “analysis/ldl_cad.Rmd”, “analysis/license.Rmd”, |
In this page we walk through an analysis of pairs of 16 traits with publicly available GWAS data using CAUSE. These results are also presented in the paper (Section 2.3) so this page will serve two functions. The first is that it allows an interested person to replicate those results. The second is that it is an example of how to set up a larger scale anlaysis using CAUSE and explains some of the technical differences from an analysis of a single pair of traits. In the paper, we present results for pairs of 20 traits. However, blood pressure results from Eheret et al (2011) (PMID: 21909115) must be obtained through dbGAP and so aren’t included here.
We are using Snakemake and a conda environment to run this analysis. If you don’t have Miniconda or Anaconda installed you can either install one of them (recomended) or just make sure that you have Python3, pandas and Snakemake installed. We have chosen to not include R in the conda environment so you will need to have R installed outside the environment and also have the cause
and tidyverse
R packages installed and running. If you were able to work through the package tutorial you should be in good shape.
Snakemake is a tool for creating scalable workflows. In our case we are using it to perform the same set of data processing and analysis steps for all pairs of traits. We will walk through exactly what the Snakemake analysis is doing in the next sections.
The first set up step is to create the conda environment
conda create -n cause_large python=3.6 snakemake
Next create a working directory that you would like to analyze the data in. Change to that directory using cd
in Mac or Linux. For example
mkdir gwas_pairs
cd gwas_pairs
The last step is to download the data and code that we will use. Inside of R, run
cause::setup_gwas_pairs()
This function sets up the analysis directory and is a shortcut to downloading all the data and code we will use in this analysis. Downloading the data might take up to half an hour depending on your network speed. Here is what the function downloads.
GWAS summary statistics, cleaned and formatted. We will talk about these more later on. These files go in the data/
subdirectory.
LD data estimated from 1000 Genomes CEU populatoin. This goes in the ld/
subdirectory.
Some R scripts that will be used to run the analysis. These are in the R/
subdirectory.
A Snakemake file called pairs_snakemake.py
.
In the rest of this document we will go through the Snakemake file in detail and explore the scripts that it calls. The analysis is designed to be run on a cluster and can be executed with a single command. First activate the conda environment.
source activate cause_large
The Snakemake command below assumes you are using a cluster with a Slurm workload manager. If your cluster uses something else you should edit the value that is given to the --cluster
argument. You may need to edit this anyway to include necessary information. For example, if you are working on the University of Chicago Research Computing Center you will need to add --account
and --partition
arguments.
nohup snakemake -s pairs_snakemake.py --keep-going --jobs 96 --cluster "sbatch --output={params.log}_%A.out --error={params.log}_%A.err --cpus-per-task={params.cpus} --ntasks=1 --mem-per-cpu={params.mem} --time=1:00:00 --job-name={params.jobname}" > pairs.out &
Currently there is no standard format for GWAS summary statistics so when you are downloading summary statistics from different sources they are likely to have different formats and may not have effect alleles oriented the same way. Many analysis steps are faster and simpler if all the data files are in the same format and have effect alleles oriented consistently.
The data files downloaded by setup_gwas_pairs()
have been pre-formatted so they are ready to use. After running setup_gwas_pairs()
you will find a file data/gwas_info.csv
that gives information about each study and the original download source. Open up the file and take a look.
library(tidyverse)
library(cause)
info <- read_csv("data/gwas_info.csv")
Parsed with column specification:
cols(
consortium = col_character(),
trait = col_character(),
full_trait_name = col_character(),
PMID = col_double(),
pub_sample_size = col_double(),
pub_cases = col_double(),
pub_controls = col_double(),
first_author = col_character(),
pub_year = col_double(),
summary_statistics_link = col_character(),
snp = col_character(),
a1 = col_character(),
a2 = col_character(),
beta_hat = col_character(),
se = col_character(),
p_value = col_character(),
sample_size = col_character(),
regression_type = col_character()
)
head(info) %>% print(width = Inf)
# A tibble: 6 x 18
consortium trait full_trait_name PMID pub_sample_size pub_cases
<chr> <chr> <chr> <dbl> <dbl> <dbl>
1 glg tg triglycerides 24097068 188577 NA
2 glg ldl ldl 24097068 188577 NA
3 glg hdl hdl 24097068 188577 NA
4 glg tc total_cholesterol 24097068 188577 NA
5 giant height height 25282103 253288 NA
6 giant bmi body_mass_index 25673413 339224 NA
pub_controls first_author pub_year
<dbl> <chr> <dbl>
1 NA Willer 2013
2 NA Willer 2013
3 NA Willer 2013
4 NA Willer 2013
5 NA Wood 2014
6 NA Locke 2015
summary_statistics_link
<chr>
1 http://csg.sph.umich.edu/abecasis/public/lipids2013/jointGwasMc_TG.txt.gz
2 http://csg.sph.umich.edu/abecasis/public/lipids2013/jointGwasMc_LDL.txt.…
3 http://csg.sph.umich.edu/abecasis/public/lipids2013/jointGwasMc_HDL.txt.…
4 http://csg.sph.umich.edu/abecasis/public/lipids2013/jointGwasMc_TC.txt.gz
5 http://portals.broadinstitute.org/collaboration/giant/images/0/01/GIANT_…
6 http://portals.broadinstitute.org/collaboration/giant/images/1/15/SNP_gw…
snp a1 a2 beta_hat se p_value sample_size
<chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 rsid A1 A2 beta se P.value N
2 rsid A1 A2 beta se P-value N
3 rsid A1 A2 beta se P-value N
4 rsid A1 A2 beta se P-value N
5 MarkerName Allele1 Allele2 b SE P.value N
6 SNP A1 A2 b se P.value N
regression_type
<chr>
1 linear
2 linear
3 linear
4 linear
5 linear
6 linear
For each study we have recorded a short string to indicate the consortium or data set used (eg. glg
= Global Lipid Genetics Consortium), a short string to inidicate the trait, the full trait name, information about the source publication (PMID, first author, and publication year), the total sample size from the publication and number of cases and controls if relevant and the original download link. The remaining columns give the column name in the original data of important fields. We don’t need this information for our analysis here. All of the formatted data files that have been downloaded are named data/{consortium}_{trait}_summary_statistics.tsv.gz
. To convert the data from their original formats to a standard format, we used a summary statistics processing tool created by Joseph Marcus. This tool is a little more complicated than is required for CAUSE and uses a large external data reference. We hope to include a simpler data formatting function in the cause
R package soon. Lets take a look at some of the formatted data. Use the following R commands
dat <- read_tsv("data/ckdgen_egfrcrea_summary_statistics.tsv.gz",
col_type=list(col_character(),col_integer(),
col_character(),col_character(),
col_character(),col_double(),col_double(),
col_double(),col_double(),
col_double()))
head(dat) %>% print(width = Inf)
# A tibble: 6 x 10
chrom pos snp ref_allele alt_allele beta_hat se p_value
<chr> <int> <chr> <chr> <chr> <dbl> <dbl> <dbl>
1 1 752566 rs3094315 G A -0.0013 0.0015 0.38
2 1 754192 rs3131968 A G -0.0013 0.0019 0.5
3 1 768448 rs12562034 G A -0.0014 0.0023 0.55
4 1 775659 rs2905035 A G -0.0022 0.0019 0.24
5 1 776546 rs12124819 A G -0.003 0.0018 0.092
6 1 779322 rs4040617 A G 0.0019 0.0019 0.33
sample_size f_eur
<dbl> <dbl>
1 114684 0.16
2 102660 0.128
3 106355 0.908
4 104963 0.128
5 109251 0.771
6 99471 0.882
In our analysis we will exploit the following features of the data format:
All the files have the same information fields in the same order and with the same column names.
All of the alleles are oriented the same way. This means, for example, that if rs3094315 appears in another study, it will also have reference allele G and alternative allele A in that study.
In the future, we will update this section with instructions on how to achieve this using CAUSE functions. If your data are formatted with the same columns we have used (and also saved as a tsv) then you will be able to analyze it using our code making only minimal changes to the paires_snakemake.py
file.
We will now walk through all the steps in the Snakemake file and explain what they do in detail. If your data is formatted in the same way as ours (see above) you should be able to use all of this code and will only need to change the preamble portion so that your file names match. It will be helpful to familiarize yourself with Snakemake syntax a little before you begin.
In the preamble section we set up the analysis we want to do. First we list some directories.
import pandas as pd
data_dir = "data/" #where the data is
ld_dir = "ld/" #where the ld data is
cause_dir = "cause/" #where CAUSE results will go
mr_dir = "mr/" #where other MR method results will go
Next we create a table of all the trait pairs we’d like to analyze.
consortia = ["giant", "giant", "lu",
"glg", "glg", "glg", "glg",
"ckdgen", "gefos",
"egg", "egg", "egg",
"vanderHarst", "diagram",
"megastroke", "magic"]
traits = ["height", "bmi", "bfp",
"tg", "ldl", "hdl", "tc",
"egfrcrea", "bone",
"bl", "bw", "hc",
"cad", "t2d",
"as", "fg"]
tags = [consortia[i] + "_" + traits[i] for i in range(len(traits))]
tag_pairs = [(tag1, tag2) for tag1 in tags for tag2 in tags if tag1!=tag2]
We will refer to the string {consortium}_{trait}
as a “tag” in the rest of this analysis. Finally, in Snakemake, the rule all
lists all the files we should have at the end of the analysis. The rest of the file will explain how to produce these.
rule all:
input: expand(mr_dir + '{tp[0]}__{tp[1]}_mr.RDS', tp = tag_pairs),
expand(mr_dir + '{tp[0]}__{tp[1]}_mrpresso.RDS', tp = tag_pairs),
expand(mr_dir + '{tp[0]}__{tp[1]}_mregger.RDS', tp = tag_pairs),
expand(cause_dir + '{tp[0]}__{tp[1]}_cause.RDS', tp = tag_pairs)
We use Awk to write out temporary files that will speed up some of the later analysis steps
rule data_overlap:
input: file1 = data_dir + '{tag1}_summary_statistics.tsv.gz',
file2 = data_dir + '{tag2}_summary_statistics.tsv.gz'
output: out= temp(data_dir + "{tag1}__{tag2}_overlap.tsv.gz")
params: log="tempov", mem="20G", cpus="1",
jobname='dataov'
shell: """
snps() {{ gzip -cd "$@" | awk '{{ if ($7!="NA" && $7 > 0 && $6!="NA") print $3 }}' ;}}
full() {{ gzip -cd "$@" | awk '{{ if ($7!="NA" && $7 > 0 && $6!="NA") print $0 }}' ;}}
snps {input.file2} | awk 'NR==FNR{{F1[$0];next}}$3 in F1{{print}}' - <(full {input.file1}) | gzip > {output.out}
"""
This step takes as input two data files, one for tag1 (trait \(M\)) and one for tag2 (trait \(Y\)). It outputs a file that is simply the subset of the tag1 data for SNPs that are in both tag1 and tag2 GWAS. The temp()
in the output:
line tells Snakemake to delete these files when we are done with them. This step also filters out SNPs that have missing effect estimates or standard errors or who’s standard errors are non-positive.
CAUSE estimates posteriors using a set of LD pruned variants. To maximize power, we LD prune preferentially choosing variants with low tarit \(M\) \(p\)-values. The next two rules in the Snakemake file write a list of LD pruned variants for each trait pair.
First there is a rule that LD prunes a single chromosome:
rule ld_prune_one_chrom:
input: data = data_dir + '{tag1}__{tag2}_overlap.tsv.gz',
ld1 = ld_dir + 'chr{chrom}_AF0.05_0.1.RDS',
ld2 = ld_dir + 'chr{chrom}_AF0.05_snpdata.RDS'
output: out=temp(data_dir + "snps_{tag1}__{tag2}.pruned.{chrom}.RDS")
params: log="ldprune", mem="10G", cpus="4",
pval_thresh = "1e-3", r2_thresh = 0.1 ,
jobname='ldprune_{chrom}', partition="broadwl"
shell: 'Rscript R/ld_prune_one_chrom.R {input.data} {wildcards.chrom} \
{params.pval_thresh} {params.r2_thresh} {input.ld1} {input.ld2} {output.out}'
This rule takes as input the overlap data set for trait \(M\) that we created previously and LD data. The output is a pruned list of SNPs on a given chromosome. The rule calls an R scirpt R/ld_prune_one_chrom.R
that reads in the data, removes duplicated SNPs and then LD prunes using the function cause::ld_prune
.
The next rule concatonates pruned lists for all chromosomes for a single pair into one file.
rule ld_prune_combine:
input: fls = expand( data_dir + "snps_{{tag1}}__{{tag2}}.pruned.{chr}.RDS", chr = range(1, 23))
output: out1 = data_dir + "snps_{tag1}__{tag2}.pruned.txt"
params: log="ld_comb", mem="2G", cpus="1",
jobname='combine', partition="broadwl"
shell: "Rscript R/ld_cat.R {output.out1} {input.fls}"
The next step is to run CAUSE
rule cause:
input: file1 = data_dir + "{tag1}__{tag2}_overlap.tsv.gz",
file2 = data_dir + '{tag2}__{tag1}_overlap.tsv.gz',
snps = data_dir + 'snps_{tag1}__{tag2}.pruned.txt'
output: params = cause_dir + '{tag1}__{tag2}_params.RDS',
cause = cause_dir + '{tag1}__{tag2}_cause.RDS',
data = data_dir + '{tag1}__{tag2}_data.RDS'
params: log="cause", mem="5G", cpus="8",
jobname='cause', seed = 100
shell: 'Rscript R/cause.R {input.file1} {input.file2} \
{input.snps} {output.params} \
{output.cause} {output.data} {params.seed}'
This rule takes as input the overlap tag1 and tag2 data sets and the list of pruned snps. It uses the R script R/cause.R
which we will go through below. We could have used the *_summary_statistics.tsv.gz
files in place of the *_overlap.tsv.gz
files. Using the overlap files saves reading some data into memory and can be especially helpful when one GWAS has many more variants measured than the other.The R script performs four steps:
args <- commandArgs(trailingOnly=TRUE)
#Input files
data_file_1 <- args[1]
data_file_2 <- args[2]
snp_file_asc <- args[3]
#Output files
params_out <- args[4]
cause_out <- args[5]
data_out <- args[6]
seed <- as.numeric(args[7])
#if(is.na(seed)) seed <- 100
d1 <- read_tsv(data_file_1, col_type=list(col_character(),col_integer(),
col_character(),col_character(),
col_character(),col_double(),col_double(),
col_double(),col_double(),
col_double()))
dup1 <- d1$snp[duplicated(d1$snp)]
d1 <- d1 %>% filter(!snp %in% dup1)
d2 <- read_tsv(data_file_2, col_type=list(col_character(),col_integer(),
col_character(),col_character(),
col_character(),col_double(),col_double(),
col_double(),col_double(),
col_double()))
dup2 <- d1$snp[duplicated(d1$snp)]
d2 <- d2 %>% filter(!snp %in% dup2)
cause::gwas_format_cause
. However, this function performs steps to ensure that alleles are oriented in the same way that are unnecessary because the data has already been formated uniformly. If we skip these steps, the merging procedure is about eight times faster. We save the mreged data to use with other MR methods.X <- d1 %>%
select(snp, beta_hat, se) %>%
rename(beta_hat_1 = beta_hat, seb1 = se) %>%
inner_join(., d2, by="snp") %>%
rename(beta_hat_2 = beta_hat, seb2 = se,
A1 = ref_allele, A2 = alt_allele) %>%
select(snp, beta_hat_1, seb1, beta_hat_2, seb2, A1, A2) %>%
new_cause_data(.)
saveRDS(X, file=data_out)
set.seed(seed)
if(nrow(X) < 1e6){
snps_grid <- X$snp
}else{
snps_grid <- sample(X$snp, size=1e6, replace=FALSE)
}
params <- est_cause_params(X, snps_grid)
saveRDS(params, params_out)
snps_asc <- read_lines(snp_file_asc)
res <- cause(X =X, param_ests = params, variants=snps_asc,
qalpha = 1, qbeta = 10, force=TRUE)
saveRDS(res, cause_out)
The force = TRUE
argument ensures that CAUSE runs even if the parameter estimates didn’t converge. In our analysis this never happens.
The remaining rules in the Snakemake file run alternative MR methods: IVW regression, Egger regression and MR-PRESSO. These rules all have the same format, each calling its own R script
rule ivw:
input: data = data_dir + '{tag1}__{tag2}_data.RDS'
output: out = mr_dir + '{tag1}__{tag2}_mr.RDS',
params: log="mr", mem="1G", cpus="1",
jobname='mr_{tag1}__{tag2}'
shell: 'Rscript R/ivw.R {input.data} 5e-8 {output.out} '
rule egger:
input: data = data_dir + '{tag1}__{tag2}_data.RDS'
output: out = mr_dir + '{tag1}__{tag2}_mregger.RDS',
params: log="mr", mem="1G", cpus="1",
jobname='mr_{tag1}__{tag2}'
shell: 'Rscript R/mregger.R {input.data} 5e-8 {output.out} '
rule mrpresso:
input: data = data_dir + '{tag1}__{tag2}_data.RDS'
output: out = mr_dir + '{tag1}__{tag2}_mrpresso.RDS',
params: log="mrp", mem="1G", cpus="1",
jobname='mrp_{tag1}__{tag2}'
shell: 'Rscript R/mrpresso.R {input.data} 5e-8 {output.out} '
Coming Soon!
sessionInfo()
R version 3.6.1 (2019-07-05)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 18.04.2 LTS
Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/openblas/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/libopenblasp-r0.2.20.so
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] cause_0.3.0.0178 forcats_0.4.0 stringr_1.4.0 dplyr_0.8.3
[5] purrr_0.3.2 readr_1.3.1 tidyr_0.8.3 tibble_2.1.3
[9] ggplot2_3.2.0 tidyverse_1.2.1
loaded via a namespace (and not attached):
[1] Rcpp_1.0.1 lubridate_1.7.4 lattice_0.20-38
[4] utf8_1.1.4 assertthat_0.2.1 zeallot_0.1.0
[7] rprojroot_1.3-2 digest_0.6.20 foreach_1.4.4
[10] truncnorm_1.0-8 R6_2.4.0 cellranger_1.1.0
[13] backports_1.1.4 evaluate_0.14 httr_1.4.0
[16] highr_0.8 pillar_1.4.2 rlang_0.4.0
[19] lazyeval_0.2.2 pscl_1.5.2 readxl_1.3.1
[22] rstudioapi_0.10 whisker_0.3-2 Matrix_1.2-17
[25] rmarkdown_1.13 loo_2.1.0 munsell_0.5.0
[28] mixsqp_0.1-97 broom_0.5.2 numDeriv_2016.8-1.1
[31] compiler_3.6.1 modelr_0.1.4 xfun_0.8
[34] pkgconfig_2.0.2 SQUAREM_2017.10-1 htmltools_0.3.6
[37] tidyselect_0.2.5 gridExtra_2.3 workflowr_1.4.0.9000
[40] matrixStats_0.54.0 intervals_0.15.1 codetools_0.2-16
[43] fansi_0.4.0 crayon_1.3.4 withr_2.1.2
[46] MASS_7.3-51.4 grid_3.6.1 nlme_3.1-140
[49] jsonlite_1.6 gtable_0.3.0 git2r_0.26.1
[52] magrittr_1.5 scales_1.0.0 RcppParallel_4.4.3
[55] cli_1.1.0 stringi_1.4.3 fs_1.3.1
[58] doParallel_1.0.14 xml2_1.2.0 vctrs_0.2.0
[61] generics_0.0.2 iterators_1.0.10 tools_3.6.1
[64] glue_1.3.1 hms_0.5.0 parallel_3.6.1
[67] yaml_2.2.0 colorspace_1.4-1 ashr_2.2-32
[70] rvest_0.3.4 knitr_1.23 haven_2.1.1