Last updated: 2022-07-25
Checks: 7 0
Knit directory:
SISG2022_Association_Mapping/
This reproducible R Markdown analysis was created with workflowr (version 1.7.0). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.
Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.
Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.
The command set.seed(20220530)
was run prior to running
the code in the R Markdown file. Setting a seed ensures that any results
that rely on randomness, e.g. subsampling or permutations, are
reproducible.
Great job! Recording the operating system, R version, and package versions is critical for reproducibility.
Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.
Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.
Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.
The results in this page were generated with repository version bdaaf1a. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.
Note that you need to be careful to ensure that all relevant files for
the analysis have been committed to Git prior to generating the results
(you can use wflow_publish
or
wflow_git_commit
). workflowr only checks the R Markdown
file, but you know if there are other scripts or data files that it
depends on. Below is the status of the Git repository when the results
were generated:
Ignored files:
Ignored: .DS_Store
Ignored: analysis/.Rhistory
Ignored: code/.Rhistory
Ignored: data/.DS_Store
Ignored: lectures/.DS_Store
Unstaged changes:
Modified: analysis/_site.yml
Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.
These are the previous versions of the repository in which changes were
made to the R Markdown (analysis/SISGM15_prac6.Rmd
) and
HTML (docs/SISGM15_prac6.html
) files. If you’ve configured
a remote Git repository (see ?wflow_git_remote
), click on
the hyperlinks in the table below to view the files as they were in that
past version.
File | Version | Author | Date | Message |
---|---|---|---|---|
html | e54c1bf | Joelle Mbatchou | 2022-07-24 | Build site. |
html | 660617e | Joelle Mbatchou | 2022-07-22 | Build site. |
html | 7453eb7 | Joelle Mbatchou | 2022-07-22 | Build site. |
html | 2741a94 | Joelle Mbatchou | 2022-07-22 | Build site. |
html | 231ce3c | Joelle Mbatchou | 2022-07-22 | Build site. |
Rmd | 7de5878 | Joelle Mbatchou | 2022-07-22 | add loic’s practical exercises |
This practical runs on the SISG2022 server.
First login
ssh yourlogincredentials@si2022-m15.biostat.washington.edu
This practical proposes a detailed tutorial for running three popular polygenic score (PGS) methods using GWAS summary statistics as introduced in Lecture 6.
Method 1: C+PT
The first method is classically denoted as “Clumping + P-value
Thresholding (C+PT)”. This method is also abbreviated as P+T or C+T in
certain publications. In brief, the principle of this method is compare
various sets of uncorrelated SNPs (e.g., maximum squared pairwise
correlation between allele counts at SNPs in the selected set is \(r^2 \leq 0.1\)) that are associated with
the trait / disease as certain p-value threshold (e.g., p<0.01). We
then select the set of SNPs that yields the largest prediction accuracy
with the trait / disease of interest in a validation set. This method is
broadly used because of it simplicity but may not often yield the
largest accuracy. In this practical, we will use PLINK
and
R
to determine these optimal sets of SNPs. PGS will be
calculated using marginal/GWAS SNP effects as weights.
Method 2: SBLUP
The second method, the Summary-data based BLUP, is an approximation of the BLUP prediction method introduced in Lecture 5. In brief, SBLUP calculates PGS weights \(\mathbf{b}\) using the following equation
\[ \mathbf{b} = \left( \mathbf{R} +
\frac{M_{SNP}(1-h_{SNP}^2)}{N_{GWAS}h_{SNP}^2}\mathbf{I}_M \right)^{-1}
\boldsymbol{\beta} \] where \(\boldsymbol{\beta}\) is the vector of GWAS
SNP effects and \(\mathbf{R}\) a LD
correlation matrix between SNPs estimated from a reference sample. Note
that the SBLUP method is equivalent to the LDpred-Inf
method. A detailed tutorial for LDpred2 is available at https://privefl.github.io/bigsnpr/articles/LDpred2.html.
In this practical, we will calculate the SBLUP PGS weights “manually” in
R.
Method 3: SBayes C
The Summary-data based BayesC (SBayesC) method is an approximation of the Bayes C method. This method is built upon a more flexible assumption regarding the distribution of SNP effects that only allows a subset of all \(M\) SNPs to affect the trait/disease of interest. We denote \(\pi\) the proportion of SNPs with a non-zero effect on the trait / disease. (S)BayesC implements a prior mixture distribution for the joint SNP effects \(\mathbf{b} = \left(b_1,\ldots,b_M\right)\) defined as
\[ b_j \sim \pi \mathcal{N}\left(0,\sigma^2_b\right) + (1-\pi) \delta_0\]
Note that the SBayesC method is equivalent to the LDpred
(non-infinitesimal) method. In this practical, we will use the GCTB
software to fit SBayes C.
Overview of the data
We provide GWAS summary statistics for two (simulated) traits
hereafter denoted Trait1
and Trait2
. For
simplicity, we will focus summary statistics of \(M=32,260\) SNPs located on chromosome 20.
These two GWAS were conducted in \(N=348,501\) unrelated European ancestry
participants from the UK Biobank (data accessed under project 12505)
using the fastGWA
module of the GCTA
software
(Yang et al. 2011).
We will assess prediction \(n=2504\)
samples from the 1000 Genomes Project (1KG) with different ancestries.
Ancestries groups are European (EUR; N=503), East-Asian (EAS; N=504),
South-Asian (SAS; N=489), Admixed individuals from the Americas (AMR;
N=347) and African (AFR; N=661). We provide genotypes under binary PLINK
format (three files: 1kg_hm3.bed
, 1kg_hm3.bim
and 1kg_hm3.fam
) for these 2504 samples and phenotypes
(files named: 1kg.Trait1.phen
and
1kg.Trait2.phen
). For more details about this format please
visit PLINK website (https://www.cog-genomics.org/plink/1.9/formats#bed).
Create a folder for the practical and move in that folder
mkdir Practical6
cd Practical6
Copy the data needed for the practical.
cp /data/SISG2022M15/data/1kg_hm3.* .
fastGWA
cp /data/SISG2022M15/data/*.fastGWA .
cp /data/SISG2022M15/data/*.phen .
cp /data/SISG2022M15/data/1kg-sample-2504-phased.txt .
cp /data/SISG2022M15/data/EUR.id .
Create a sub-folder to store all intermediate files generated for this method.
mkdir C+PT
We can run a first example using Trait1
with following
clumping parameters for PLINK: p-value below 5e-8 and squared
correlation below r2_thres=0.1
to determine independence
between SNPs located within a window window_kb=1000
kb
(=1,000,000 base pairs). The command lines below allow you to that
window_kb=1000 # 1000 kb = 1 Mb window
r2_thresh=0.1 # LD threshold for clumping
trait=Trait1
pv_thresh=5e-8
PLINK=/data/SISG2022M15/exe/plink
${PLINK} --bfile 1kg_hm3 \
--keep EUR.id \
--clump ${trait}.fastGWA \
--clump-kb ${window_kb} \
--clump-p1 ${pv_thresh} \
--clump-p2 ${pv_thresh} \
--clump-r2 ${r2_thresh} \
--out C+PT/${trait}_rsq_${r2_thresh}_p_below_${pv_thresh}
Question 1. How many SNPs were selected using the command above?
We will now consider multiple significance thresholds and also analyse the two traits. You could use the code below
for pv_thresh in 5e-8 5e-7 5e-6 5e-5 5e-4 5e-3 5e-2
do
for trait in Trait1 Trait2
do
${PLINK} --bfile 1kg_hm3 \
--keep EUR.id \
--clump ${trait}.fastGWA \
--clump-kb ${window_kb} \
--clump-p1 ${pv_thresh} \
--clump-p2 ${pv_thresh} \
--clump-r2 ${r2_thresh} \
--out C+PT/${trait}_rsq_${r2_thresh}_p_below_${pv_thresh} --silent
done
done
Now we can calculate the PGS with PLINK using the
--score
command (help: https://www.cog-genomics.org/plink/1.9/score).
## Calculate PGS for each predictor
for pv_thresh in 5e-8 5e-7 5e-6 5e-5 5e-4 5e-3 5e-2
do
for trait in Trait1 Trait2
do
${PLINK} --bfile 1kg_hm3 \
--score ${trait}.fastGWA 2 4 8 sum center \
--extract C+PT/${trait}_rsq_${r2_thresh}_p_below_${pv_thresh}.clumped \
--out C+PT/${trait}_rsq_${r2_thresh}_p_below_${pv_thresh}.pred --silent
done
done
PGS calculated using the command above are stored in the
*.profile
files (last column named SCORESUM
).
We can now load them in R
and calculate prediction accuracy
in each population.
options(stringsAsFactors = FALSE)
r2_thr <- 0.1
pops <- read.table("1kg-sample-2504-phased.txt",h=T) ## This contains ancestry information
trait <- "Trait1"
phen <- read.table(paste0("1kg.",trait,".phen"),h=T)[,-1] ## Read phenotype
colnames(phen) <- c("IID",trait)
## Merge with all PGS
Threshs <- 8:2
for(thresh in Threshs){
filename <- paste0("C+PT/",trait,"_rsq_",r2_thr,"_p_below_5e-",thresh,".pred.profile")
tmp <- read.table(filename,h=T)[,c(2,6)]
colnames(tmp)[2] <- paste0("P",thresh)
phen <- merge(phen,tmp,by="IID")
}
## This is the merged data containing all PGS
## P8 denotes the PGS calculated using SNPs with p-value <5e-8
phen <- merge(phen,pops,by.x="IID",by.y="sample")
## Look at prediction accuracy
# across the sample
summary(lm(paste0(trait,"~P8"),phen))
## within European ancestry individuals
summary(lm(paste0(trait,"~P8"),phen[which(phen$super_pop=="EUR"),]))
## within African ancestry individuals
summary(lm(paste0(trait,"~P8"),phen[which(phen$super_pop=="AFR"),]))
## Select best threshold
## Across all samples
RsqOverall <- rep(NA,length(Threshs))
names(RsqOverall) <- paste0("P",Threshs)
for(thresh in Threshs){
idThresh <- paste0("P",thresh)
RsqOverall[idThresh] <- cor(phen[,trait],phen[,idThresh])^2
}
## Within each ancestry group
ancestries <- c("EUR","SAS","EAS","AMR","AFR")
RsqPerAnc <- matrix(NA,nrow=length(ancestries),ncol=length(Threshs))
colnames(RsqPerAnc) <- paste0("P",Threshs)
rownames(RsqPerAnc) <- ancestries
for(thresh in Threshs){
idThresh <- paste0("P",thresh)
for(ancestry in ancestries){
idAnc <- which(phen[,"super_pop"]==ancestry)
RsqPerAnc[ancestry,idThresh] <- cor(phen[idAnc,trait],phen[idAnc,idThresh])^2
}
}
Question 2. What is the best prediction accuracy in EUR
individuals across all significance thresholds? Is the best accuracy
reached using the same significance threshold consistently across all
ancestry groups? In which ancestry group is the prediction higher? Do
you see any difference between Trait1
and
Trait2
?
If you have time and answered the question above then you could split the EUR sample in 2 (random) subsets, select the threshold that yields the larger accuracy in subset1, then re-evaluate the prediction accuracy of the corresponding PGS in subset 2. Compare the resulting prediction accuracy with your answer to the first part of Question 2 above. This process is called cross-validation.
Create a folder to store all intermediate files generated for this method.
mkdir SBLUP
The SNP-based heritability of Trait1
and
Trait2
is \(h^2_{SNP} =
0.2\). As shown above, the SBLUP uses a LD reference. Here we
provide LD data calculated with PLINK in \(N=348,501\) unrelated European ancestry
participants from the UK Biobank using the following command.
## Calculate PGS for each predictor
${PLINK} \
--bfile UKBu_chrom20 \
--chr 20 \
--ld-window 999999999 \
--ld-window-kb 1000 \
--ld-window-r2 0.00 \
--r 'yes-really' 'gz' \
--out plinkLDMat_chrom20
We only show this command out of
general interest. For the sake of time, we do not advise running this
command during the practical but you could try it later at home. Note
that the data named UKBu_chrom20
has not been provided but
you could replace it another dataset, e.g., with
1kg_hm3
.
You will see below an R script to run SBLUP. This code is provided in
an external R
script named sblup.R
. You can
run it using the following command
## Calculate PGS for each predictor
Rscript /data/SISG2022M15/data/sblup.R Trait1
The command above will generate a file named
SBLUP/Trait1.sblupInR.res
.
Question 3. Using the examples above, 1) calculate the PGS
using SNP effects from SBLUP (for Trait1
and
Trait2
) 2) assess prediction accuracy in the test sample 3)
compare your results with the prediction accuracy in EUR and non-EUR
individuals obtained using the C+PT method.
## Solution
${PLINK} --bfile 1kg_hm3 --score SBLUP/Trait1.sblubInR.res 1 2 3 sum center --out SBLUP/Trait1.pred.R
It is expected that you try to answer these questions using the R functions introduced above. The below code is a partial solution to the above questions and uses some key functions that you may want to use to answer the above questions.
options(stringsAsFactors = FALSE)
pops <- read.table("1kg-sample-2504-phased.txt",h=T)
trait <- "Trait1"
phen <- read.table(paste0("1kg.",trait,".phen"),h=T)[,-1] ## Read phenotype
colnames(phen) <- c("IID",trait)
sblup <- read.table(paste0("SBLUP/",trait,".pred.R.profile"),h=T)[,c(2,6)]
colnames(sblup)[2] <- "SBLUP"
phen <- merge(phen,sblup,by="IID")
phen <- merge(phen,pops,by.x="IID",by.y="sample")
## Look at prediction accuracy
# across the entire sample
summary(lm(paste0(trait,"~SBLUP"),phen))
## within European ancestry individuals
summary(lm(paste0(trait,"~SBLUP"),phen[which(phen$super_pop=="EUR"),]))
## within African ancestry individuals
summary(lm(paste0(trait,"~SBLUP"),phen[which(phen$super_pop=="AFR"),]))
## Within each ancestry group
ancestries <- c("EUR","SAS","EAS","AMR","AFR")
RsqPerAnc <- rep(NA,length(ancestries))
names(RsqPerAnc) <- ancestries
for(ancestry in ancestries){
idAnc <- which(phen[,"super_pop"]==ancestry)
RsqPerAnc[ancestry] <- cor(phen[idAnc,trait],phen[idAnc,"SBLUP"])^2
}
It is also possible to run SBLUP using GCTA
. This will
require a different formatting of GWAS summary statistics, which we
provide (files named Trait1.ma
and Trait2.ma
).
Note that the same format can also be used with GCTB
to run
SBayesC (see Part 3 below). The implementation of SBLUP
in GCTA
requires genotype data as input (and not a
pre-calculated LD correlation matrix) as well as shrinkage parameter
\(\lambda =
M_{SNPs}\left(\frac{1}{h^2_{SNP}}-1\right)\). An example command
is given below
GCTA=/data/SISG2022M15/exe/gcta64_v1.94
lambda=135628
for trait in Trait1 Trait2
do
${GCTA} \
--bfile 1kg_hm3 \
--chr 20 \
--cojo-file /data/SISG2022M15/data/${trait}.ma \
--cojo-sblup ${lambda} \
--cojo-wind 1000 \
--thread-num 20 \
--out SBLUP/${trait}
done
This command will take approximately 5 min to run. In the interest of time, we recommend not running it and moving to Part 3 instead. More details can be found here: https://yanglab.westlake.edu.cn/software/gcta/#SBLUP.
Create a folder to store all intermediate files generated for this method.
mkdir SBC
SBayesC requires a different format for the LD matrix. We provide a LD matrix directly calculated with GCTB using a command like this
${GCTB}/gctb --bfile sampleForLD --make-full-ldm --snp 1-5000 --out SBC/BLOCK1618.CHROM20
sampleForLD
is a subset of 348,501 unrelated EUR
participants of the UKB. LD correlations on chromosome 20 were
calculated within 38 non-overlapping LD blocks identified in an EUR
sample (see: http://bitbucket.org/nygcresearch/ldetect-data).
Pre-calculated LD matrices can be downloaded from the GCTB
website. We provide these 38 LD submatrices in the folder named
ldm
and paths for each file are listed in
mldm.txt
(note the similarity with GCTA
: in
GCTB
--mldm
indicates that multiple LD
matrices will be used, while in GCTA
--mgrm
indicates that multiple GRMs will be used).
Set the path to GCTB
GCTB=/data/SISG2022M15/exe/gctb
Now, to run SBayes C, you could run the following command.
mldm=/data/SISG2022M15/data/mldm.txt
trait=Trait1
${GCTB} --sbayes C --mldm ${mldm} \
--gwas-summary /data/SISG2022M15/data/${trait}.ma \
--pi 0.0001 --hsq 0.001 \
--chain-length 10000 \
--burn-in 5000 \
--no-mcmc-bin \
--robust \
--out-freq 100 \
--thin 10 \
--out SBC/${trait}
This command will run 10,000 MCMC iterations
(--chain-length
) and output results every 100 iterations
(--out-freq
). Final statistics for parameters and
prediction accuracy will be calculated using 1/10 iterations
(--thin
) after a burn-in of 5000 iterations
(--burn-in
).
Question 4. What is the heritability of Trait1
?
What is the proportion/expected number of SNPs with a non-zero effect on
the trait? What s the genetic and residual variances?.
The command above will generate a file named
SBC/Trait1.snpRes
. We can use that file to calculated a PGS
using the following command
trait=Trait1
${PLINK} --bfile 1kg_hm3 \
--score SBC/${trait}.snpRes 2 5 8 sum center \
--out SBC/${trait}.pred
Assess the prediction accuracy of the SBayesC PGS in R
as done previously
options(stringsAsFactors = FALSE)
pops <- read.table("1kg-sample-2504-phased.txt",h=T)
trait <- "Trait1"
phen <- read.table(paste0("1kg.",trait,".phen"),h=T)[,-1] ## Read phenotype
colnames(phen) <- c("IID",trait)
sblup <- read.table(paste0("SBC/",trait,".pred.profile"),h=T)[,c(2,6)]
colnames(sblup)[2] <- "SBC"
phen <- merge(phen,sblup,by="IID")
phen <- merge(phen,pops,by.x="IID",by.y="sample")
## Look at prediction accuracy
# across the sample
summary(lm(paste0(trait,"~SBC"),phen))
## within European ancestry individuals
summary(lm(paste0(trait,"~SBC"),phen[which(phen$super_pop=="EUR"),]))
## within African ancestry individuals
summary(lm(paste0(trait,"~SBC"),phen[which(phen$super_pop=="AFR"),]))
## Within each ancestry group
ancestries <- c("EUR","SAS","EAS","AMR","AFR")
RsqPerAnc <- rep(NA,length(ancestries))
names(RsqPerAnc) <- ancestries
for(ancestry in ancestries){
idAnc <- which(phen[,"super_pop"]==ancestry)
RsqPerAnc[ancestry] <- cor(phen[idAnc,trait],phen[idAnc,"SBC"])^2
}
sessionInfo()
R version 3.6.1 (2019-07-05)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS 10.16
Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] workflowr_1.7.0
loaded via a namespace (and not attached):
[1] Rcpp_1.0.8.3 bslib_0.3.1 jquerylib_0.1.4 compiler_3.6.1
[5] pillar_1.7.0 later_1.3.0 git2r_0.30.1 tools_3.6.1
[9] getPass_0.2-2 digest_0.6.25 jsonlite_1.7.2 evaluate_0.15
[13] tibble_3.1.6 lifecycle_1.0.1 pkgconfig_2.0.3 rlang_1.0.4
[17] cli_3.1.1 rstudioapi_0.13 yaml_2.2.1 xfun_0.31
[21] fastmap_1.1.0 httr_1.4.3 stringr_1.4.0 knitr_1.39
[25] sass_0.4.0 fs_1.5.2 vctrs_0.3.8 rprojroot_2.0.3
[29] glue_1.6.1 R6_2.4.1 processx_3.5.3 fansi_0.4.1
[33] rmarkdown_2.14 callr_3.7.0 magrittr_1.5 whisker_0.4
[37] ps_1.7.0 promises_1.2.0.1 htmltools_0.5.2 ellipsis_0.3.2
[41] httpuv_1.6.5 utf8_1.1.4 stringi_1.4.6 crayon_1.3.4