Solution for Practical 6: Prediction using GWAS summary statistics

Last updated: 2023-07-25

Checks: 7 0

Knit directory: SISG2023_Association_Mapping/

This reproducible R Markdown analysis was created with workflowr (version 1.7.0). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.

R Markdown file: up-to-date

Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.

Environment: empty

Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.

Seed: set.seed(20230530)

The command set.seed(20230530) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.

Session information: recorded

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.

Cache: none

Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.

File paths: relative

Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.

Repository version: 02dbf6c

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.

The results in this page were generated with repository version 02dbf6c. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Untracked files:
    Untracked:  GWAS.ma
    Untracked:  analysis/SISGM15_prac9Solution.Rmd
    Untracked:  causals.snplist
    Untracked:  ldRef.bed
    Untracked:  ldRef.bim
    Untracked:  ldRef.fam
    Untracked:  ldRef.log
    Untracked:  ldRef.map
    Untracked:  ldRef.ped
    Untracked:  test1.cma.cojo
    Untracked:  test1.jma.cojo
    Untracked:  test1.ldr.cojo
    Untracked:  test1.log
    Untracked:  test2.cma.cojo
    Untracked:  test2.jma.cojo
    Untracked:  test2.ldr.cojo
    Untracked:  test2.log
    Untracked:  test3.cma.cojo
    Untracked:  test3.jma.cojo
    Untracked:  test3.ldr.cojo
    Untracked:  test3.log
    Untracked:  test4.cma.cojo
    Untracked:  test4.jma.cojo
    Untracked:  test4.ldr.cojo
    Untracked:  test4.log

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.

These are the previous versions of the repository in which changes were made to the R Markdown (analysis/SISGM15_prac6Solution.Rmd) and HTML (docs/SISGM15_prac6Solution.html) files. If you’ve configured a remote Git repository (see ?wflow_git_remote), click on the hyperlinks in the table below to view the files as they were in that past version.

File	Version	Author	Date	Message
Rmd	02dbf6c	Joelle Mbatchou	2023-07-25	edit

This practical runs on the SISG2023 server.

Preparation

This practical proposes a detailed tutorial for running three popular polygenic score (PGS) methods using GWAS summary statistics as introduced in Lecture 6.

Method 1: C+PT

The first method is classically denoted as “Clumping + P-value Thresholding (C+PT)”. This method is also abbreviated as P+T or C+T in certain publications. In brief, the principle of this method is compare various sets of uncorrelated SNPs (e.g., maximum squared pairwise correlation between allele counts at SNPs in the selected set is \(r^2 \leq 0.1\)) that are associated with the trait / disease as certain p-value threshold (e.g., p<0.01). We then select the set of SNPs that yields the largest prediction accuracy with the trait / disease of interest in a validation set. This method is broadly used because of it simplicity but may not often yield the largest accuracy. In this practical, we will use PLINK and R to determine these optimal sets of SNPs. PGS will be calculated using marginal/GWAS SNP effects as weights.

Method 2: SBLUP

The second method, the Summary-data based BLUP, is an approximation of the BLUP prediction method introduced in Lecture 5. In brief, SBLUP calculates PGS weights \(\mathbf{b}\) using the following equation

\[ \mathbf{b} = \left( \mathbf{R} + \frac{M_{SNP}(1-h_{SNP}^2)}{N_{GWAS}h_{SNP}^2}\mathbf{I}_M \right)^{-1} \boldsymbol{\beta} \] where \(\boldsymbol{\beta}\) is the vector of GWAS SNP effects and \(\mathbf{R}\) a LD correlation matrix between SNPs estimated from a reference sample. Note that the SBLUP method is equivalent to the LDpred-Inf method. A detailed tutorial for LDpred2 is available at https://privefl.github.io/bigsnpr/articles/LDpred2.html. In this practical, we will calculate the SBLUP PGS weights “manually” in R.

Method 3: SBayes C

The Summary-data based BayesC (SBayesC) method is an approximation of the Bayes C method. This method is built upon a more flexible assumption regarding the distribution of SNP effects that only allows a subset of all \(M\) SNPs to affect the trait/disease of interest. We denote \(\pi\) the proportion of SNPs with a non-zero effect on the trait / disease. (S)BayesC implements a prior mixture distribution for the joint SNP effects \(\mathbf{b} = \left(b_1,\ldots,b_M\right)\) defined as

\[ b_j \sim \pi \mathcal{N}\left(0,\sigma^2_b\right) + (1-\pi) \delta_0\]

Note that the SBayesC method is equivalent to the LDpred (non-infinitesimal) method. In this practical, we will use the GCTB software to fit SBayes C.

Overview of the data

We provide GWAS summary statistics for two (simulated) traits hereafter denoted Trait1 and Trait2. For simplicity, we will focus summary statistics of \(M=32,260\) SNPs located on chromosome 20. These two GWAS were conducted in \(N=348,501\) unrelated European ancestry participants from the UK Biobank (data accessed under project 12505) using the fastGWA module of the GCTA software (Yang et al. 2011).

We will assess prediction \(n=2504\) samples from the 1000 Genomes Project (1KG) with different ancestries. Ancestries groups are European (EUR; N=503), East-Asian (EAS; N=504), South-Asian (SAS; N=489), Admixed individuals from the Americas (AMR; N=347) and African (AFR; N=661). We provide genotypes under binary PLINK format (three files: 1kg_hm3.bed, 1kg_hm3.bim and 1kg_hm3.fam) for these 2504 samples and phenotypes (files named: 1kg.Trait1.phen and 1kg.Trait2.phen). For more details about this format please visit PLINK website (https://www.cog-genomics.org/plink/1.9/formats#bed).

Create a folder for the practical and move in that folder

mkdir Practical6
cd Practical6

Copy the data needed for the practical.

Copy the genotype files (data from 1000 Genomes Project; 1KG)

cp /data/SISG2023M15/data/1kg_hm3.* .

Copy GWAS summary statistics generated using fastGWA

cp /data/SISG2023M15/data/*.fastGWA .

Copy phenotype data for the validation sample used to assess prediction accuracy

cp /data/SISG2023M15/data/*.phen .

Copy ancestry information for the validation sample

cp /data/SISG2023M15/data/1kg-sample-2504-phased.txt .

Copy list of 1KG samples with European ancestries

cp /data/SISG2023M15/data/EUR.id .

Part 1. Prediction using C+PT.

Create a sub-folder to store all intermediate files generated for this method.

mkdir C+PT

We can run a first example using Trait1 with following clumping parameters for PLINK: p-value below 5e-8 and squared correlation below r2_thres=0.1 to determine independence between SNPs located within a window window_kb=1000 kb (=1,000,000 base pairs). The command lines below allow you to that

window_kb=1000 # 1000 kb = 1 Mb window
r2_thresh=0.1  # LD threshold for clumping
trait=Trait1
pv_thresh=5e-8

PLINK=/data/SISG2023M15/exe/plink
${PLINK} --bfile 1kg_hm3 \
      --keep EUR.id \
      --clump ${trait}.fastGWA \
      --clump-kb ${window_kb} \
      --clump-p1 ${pv_thresh} \
      --clump-p2 ${pv_thresh} \
      --clump-r2 ${r2_thresh} \
      --out C+PT/${trait}_rsq_${r2_thresh}_p_below_${pv_thresh}

Question 1. How many SNPs were selected using the command above?

The command above has selected 450 SNPs (in fact: clumps = groups of SNPs).

We will now consider multiple significance thresholds and also analyse the two traits. You could use the code below

for pv_thresh in 5e-8 5e-7 5e-6 5e-5 5e-4 5e-3 5e-2
do
  for trait in Trait1 Trait2
    do
    ${PLINK} --bfile 1kg_hm3 \
      --keep EUR.id \
      --clump ${trait}.fastGWA \
      --clump-kb ${window_kb} \
      --clump-p1 ${pv_thresh} \
      --clump-p2 ${pv_thresh} \
      --clump-r2 ${r2_thresh} \
      --out C+PT/${trait}_rsq_${r2_thresh}_p_below_${pv_thresh} --silent
  done
done

Now we can calculate the PGS with PLINK using the --score command (help: https://www.cog-genomics.org/plink/1.9/score).

## Calculate PGS for each predictor
for pv_thresh in 5e-8 5e-7 5e-6 5e-5 5e-4 5e-3 5e-2
  do
  for trait in Trait1 Trait2
    do
    ${PLINK} --bfile 1kg_hm3 \
      --score ${trait}.fastGWA 2 4 8 sum center \
      --extract C+PT/${trait}_rsq_${r2_thresh}_p_below_${pv_thresh}.clumped \
      --out C+PT/${trait}_rsq_${r2_thresh}_p_below_${pv_thresh}.pred --silent
  done
done

PGS calculated using the command above are stored in the *.profile files (last column named SCORESUM). We can now load them in R and calculate prediction accuracy in each population.

options(stringsAsFactors = FALSE)
r2_thr <- 0.1
pops   <- read.table("1kg-sample-2504-phased.txt",h=T) ## This contains ancestry information
trait  <- "Trait2"
phen   <- read.table(paste0("1kg.",trait,".phen"),h=T)[,-1] ## Read phenotype
colnames(phen) <- c("IID",trait)

## Merge with all PGS
Threshs <- 8:2
for(thresh in Threshs){
  filename <- paste0("C+PT/",trait,"_rsq_",r2_thr,"_p_below_5e-",thresh,".pred.profile")
  tmp <- read.table(filename,h=T)[,c(2,6)]
  colnames(tmp)[2] <- paste0("P",thresh)
  phen <- merge(phen,tmp,by="IID")
}

## This is the merged data containing all PGS
## P8 denotes the PGS calculated using SNPs with p-value <5e-8
phen <- merge(phen,pops,by.x="IID",by.y="sample")

## Look at prediction accuracy
# across the sample
summary(lm(paste0(trait,"~P8"),phen)) 

## within European ancestry individuals
summary(lm(paste0(trait,"~P8"),phen[which(phen$super_pop=="EUR"),])) 

## within African  ancestry individuals
summary(lm(paste0(trait,"~P8"),phen[which(phen$super_pop=="AFR"),])) 

## Select best threshold
## Across all samples
RsqOverall <- rep(NA,length(Threshs))
names(RsqOverall) <- paste0("P",Threshs)
for(thresh in Threshs){
  idThresh <- paste0("P",thresh)
  RsqOverall[idThresh] <- cor(phen[,trait],phen[,idThresh])^2
}

## Within each ancestry group
ancestries <- c("EUR","SAS","EAS","AMR","AFR")
RsqPerAnc  <- matrix(NA,nrow=length(ancestries),ncol=length(Threshs))
colnames(RsqPerAnc) <- paste0("P",Threshs)
rownames(RsqPerAnc) <- ancestries
for(thresh in Threshs){
  idThresh <- paste0("P",thresh)
  for(ancestry in ancestries){
    idAnc <- which(phen[,"super_pop"]==ancestry)
    RsqPerAnc[ancestry,idThresh] <- cor(phen[idAnc,trait],phen[idAnc,idThresh])^2
  }
}
RsqPerAnc

Question 2.

What is the best prediction accuracy in EUR individuals across all significance thresholds?

The best threshold is reached for P<5E-7 (see R code below)

bestIndex <- which.max(RsqPerAnc["EUR",])
paste0("5E-",Threshs[bestIndex])

Is the best accuracy reached using the same significance threshold consistently across all ancestry groups?

No. Most of the times, SNPs with P<5E-8 perform better across all ancestries but accuracy in maximized in EUR with P<5E-7 and in EAS with P<5E-5 (see R code below)

bestIndex <- apply(RsqPerAnc,1,which.max)
rbind(names(bestIndex),paste0("5E-",Threshs[bestIndex]))

In which ancestry group is the prediction higher?

The largest prediction accuracy is observed in Admixed individuals from the Americas (AMR group; \(R^2_{max} = 0.147\)) (see R code below)

bestAccuracies <- apply(RsqPerAnc,1,max)
which.max( bestAccuracies )

Do you see any difference between Trait1 and Trait2?

For Trait2, the best accuracy in EUR is reached for SNPs with P<5E-2; Consistent with Trait1, the best within-ancestry prediction accuracy is not ncessarily reached with the same significance threshold; The largest accuracy is observed in EUR this time (\(R^2_{max} = 0.12\)).

If you have time and answered the question above then you could split the EUR sample in 2 (random) subsets, select the threshold that yields the larger accuracy in subset1, then re-evaluate the prediction accuracy of the corresponding PGS in subset 2. Compare the resulting prediction accuracy with your answer to the first part of Question 2 above. This process is called cross-validation.

Part 2. Prediction using SBLUP.

Create a folder to store all intermediate files generated for this method.

(Back to the terminal!)

mkdir SBLUP

The SNP-based heritability of Trait1 and Trait2 is \(h^2_{SNP} = 0.2\). As shown above, the SBLUP uses a LD reference. Here we provide LD data calculated with PLINK in \(N=348,501\) unrelated European ancestry participants from the UK Biobank using the following command.

## Calculate PGS for each predictor
${PLINK} \
  --bfile UKBu_chrom20 \
  --chr 20 \
  --ld-window 999999999 \
  --ld-window-kb 1000 \
  --ld-window-r2 0.00 \
  --r 'yes-really' 'gz' \
  --out plinkLDMat_chrom20

We only show this command out of general interest. For the sake of time, we do not advise running this command during the practical but you could try it later at home. Note that the data named UKBu_chrom20 has not been provided but you could replace it another dataset, e.g., with 1kg_hm3.

You will see below an R script to run SBLUP. This code is provided in an external R script named sblup.R. You can run it using the following command

## Calculate PGS for each predictor
Rscript /data/SISG2023M15/data/sblup.R Trait1

The command above will generate a file named SBLUP/Trait1.sblupInR.res.

Question 3. Using the examples above,

1) calculate the PGS using SNP effects from SBLUP (for Trait1 and Trait2)

## Calculate PGS for each SBLUP predictor
for trait in Trait1 Trait2
  do
  ${PLINK} --bfile 1kg_hm3 \
    --score SBLUP/${trait}.sblubInR.res 1 2 3 sum center \
    --out SBLUP/${trait}.pred.R --silent
done

2) assess prediction accuracy in the test sample 3) compare your results with the prediction accuracy in EUR and non-EUR individuals obtained using the C+PT method.

We can note that SBLUP underperforms as compared to C+PT for Trait1 (in EUR: 4.5% vs 9%), while for Trait2 SBLUP yields the best accuracy in EUR (12.4%).

It is expected that you try to answer these questions using the R functions introduced above. The below code is a partial solution to the above questions and uses some key functions that you may want to use to answer the above questions.

options(stringsAsFactors = FALSE)
pops   <- read.table("1kg-sample-2504-phased.txt",h=T)
trait  <- "Trait2"
phen   <- read.table(paste0("1kg.",trait,".phen"),h=T)[,-1] ## Read phenotype
colnames(phen) <- c("IID",trait)
sblup  <- read.table(paste0("SBLUP/",trait,".pred.R.profile"),h=T)[,c(2,6)]

colnames(sblup)[2] <- "SBLUP"
phen <- merge(phen,sblup,by="IID")
phen <- merge(phen,pops,by.x="IID",by.y="sample")

## Look at prediction accuracy
# across the entire sample
summary(lm(paste0(trait,"~SBLUP"),phen)) 

## within European ancestry individuals
summary(lm(paste0(trait,"~SBLUP"),phen[which(phen$super_pop=="EUR"),])) 

## within African  ancestry individuals
summary(lm(paste0(trait,"~SBLUP"),phen[which(phen$super_pop=="AFR"),])) 

## Within each ancestry group
ancestries <- c("EUR","SAS","EAS","AMR","AFR")
RsqPerAnc  <- rep(NA,length(ancestries))
names(RsqPerAnc) <- ancestries
for(ancestry in ancestries){
  idAnc <- which(phen[,"super_pop"]==ancestry)
  RsqPerAnc[ancestry] <- cor(phen[idAnc,trait],phen[idAnc,"SBLUP"])^2
}

It is also possible to run SBLUP using GCTA. This will require a different formatting of GWAS summary statistics, which we provide (files named Trait1.ma and Trait2.ma). Note that the same format can also be used with GCTB to run SBayesC (see Part 3 below). The implementation of SBLUP in GCTA requires genotype data as input (and not a pre-calculated LD correlation matrix) as well as shrinkage parameter \(\lambda = M_{SNPs}\left(\frac{1}{h^2_{SNP}}-1\right)\). An example command is given below

GCTA=/data/SISG2023M15/exe/gcta64_v1.94
lambda=135628
for trait in Trait1 Trait2
  do
  ${GCTA} \
  --bfile 1kg_hm3 \
  --chr 20 \
  --cojo-file /data/SISG2023M15/data/${trait}.ma \
  --cojo-sblup ${lambda} \
  --cojo-wind 1000 \
  --thread-num 20 \
  --out SBLUP/${trait}
done

This command will take approximately 5 min to run. In the interest of time, we recommend not running it and moving to Part 3 instead. More details can be found here: https://yanglab.westlake.edu.cn/software/gcta/#SBLUP.

Part 3. Prediction using SBayes C

Create a folder to store all intermediate files generated for this method.

mkdir SBC

SBayesC requires a different format for the LD matrix. We provide a LD matrix directly calculated with GCTB using a command like this

${GCTB} --bfile sampleForLD --make-full-ldm --snp 1-5000 --out SBC/BLOCK1618.CHROM20

sampleForLD is a subset of 348,501 unrelated EUR participants of the UKB. LD correlations on chromosome 20 were calculated within 38 non-overlapping LD blocks identified in an EUR sample (see: http://bitbucket.org/nygcresearch/ldetect-data). Pre-calculated LD matrices can be downloaded from the GCTB website. We provide these 38 LD submatrices in the folder named ldm and paths for each file are listed in mldm.txt (note the similarity with GCTA: in GCTB --mldm indicates that multiple LD matrices will be used, while in GCTA --mgrm indicates that multiple GRMs will be used).

Set the path to GCTB

GCTB=/data/SISG2023M15/exe/gctb

Now, to run SBayes C, you could run the following command.

mldm=/data/SISG2023M15/data/mldm.txt
trait=Trait1

${GCTB} --sbayes C --mldm ${mldm} \
--gwas-summary /data/SISG2023M15/data/${trait}.ma \
--pi 0.0001 --hsq 0.001 \
--chain-length 10000 \
--burn-in 5000 \
--no-mcmc-bin \
--robust \
--out-freq 100 \
--thin 10 \
--out SBC/${trait}

This command will run 10,000 MCMC iterations (--chain-length) and output results every 100 iterations (--out-freq). Final statistics for parameters and prediction accuracy will be calculated using 1/10 iterations (--thin) after a burn-in of 5000 iterations (--burn-in).

Question 4. What is the heritability of Trait1? What is the proportion/expected number of SNPs with a non-zero effect on the trait? What is the genetic and residual variances?.

GCTB displays these results at the end of the run. For Trait1, the heritability is \(h^2 = 0.219\) (S.E. 0.001), the proportion of SNPs with a non-zero effect is \(\pi = 0.8\)% (S.E. 0.06%) and the expected number of SNPs with a non-zero effect is approximately equal to 241 (S.E. 11). The genetic variance is 0.22 (S.E. 0.001) and resiudal variance is 0.78 (S.E. 0.002).

The command above will generate a file named SBC/Trait1.snpRes. We can use that file to calculated a PGS using the following command

trait=Trait1
${PLINK} --bfile 1kg_hm3 \
--score SBC/${trait}.snpRes 2 5 8 sum center \
--out SBC/${trait}.pred

Assess the prediction accuracy of the SBayesC PGS in R as done previously

options(stringsAsFactors = FALSE)
pops   <- read.table("1kg-sample-2504-phased.txt",h=T)
trait  <- "Trait1"
phen   <- read.table(paste0("1kg.",trait,".phen"),h=T)[,-1] ## Read phenotype
colnames(phen) <- c("IID",trait)
sblup  <- read.table(paste0("SBC/",trait,".pred.profile"),h=T)[,c(2,6)]
colnames(sblup)[2] <- "SBC"
phen <- merge(phen,sblup,by="IID")
phen <- merge(phen,pops,by.x="IID",by.y="sample")

## Look at prediction accuracy
# across the sample
summary(lm(paste0(trait,"~SBC"),phen)) 

## within European ancestry individuals
summary(lm(paste0(trait,"~SBC"),phen[which(phen$super_pop=="EUR"),])) 

## within African  ancestry individuals
summary(lm(paste0(trait,"~SBC"),phen[which(phen$super_pop=="AFR"),])) 

## Within each ancestry group
ancestries <- c("EUR","SAS","EAS","AMR","AFR")
RsqPerAnc  <- rep(NA,length(ancestries))
names(RsqPerAnc) <- ancestries
for(ancestry in ancestries){
  idAnc <- which(phen[,"super_pop"]==ancestry)
  RsqPerAnc[ancestry] <- cor(phen[idAnc,trait],phen[idAnc,"SBC"])^2
}

For Trait1, SBayesC yields an accuracy of 12% in EUR (so simular to C+PT and SBLUP) but the accuracy in AMR is much larger (up to 18%).

sessionInfo()

R version 4.3.1 (2023-06-16)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 22.04.2 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.10.0 
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0

locale:
 [1] LC_CTYPE=C.UTF-8       LC_NUMERIC=C           LC_TIME=C.UTF-8       
 [4] LC_COLLATE=C.UTF-8     LC_MONETARY=C.UTF-8    LC_MESSAGES=C.UTF-8   
 [7] LC_PAPER=C.UTF-8       LC_NAME=C              LC_ADDRESS=C          
[10] LC_TELEPHONE=C         LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C   

time zone: Etc/UTC
tzcode source: system (glibc)

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
 [1] vctrs_0.6.3       cli_3.6.1         knitr_1.43        rlang_1.1.1      
 [5] xfun_0.39         stringi_1.7.12    promises_1.2.0.1  jsonlite_1.8.7   
 [9] workflowr_1.7.0   glue_1.6.2        rprojroot_2.0.3   git2r_0.32.0     
[13] htmltools_0.5.5   httpuv_1.6.11     sass_0.4.6        fansi_1.0.4      
[17] rmarkdown_2.23    jquerylib_0.1.4   evaluate_0.21     tibble_3.2.1     
[21] fastmap_1.1.1     yaml_2.3.7        lifecycle_1.0.3   whisker_0.4.1    
[25] stringr_1.5.0     compiler_4.3.1    fs_1.6.2          Rcpp_1.0.11      
[29] pkgconfig_2.0.3   rstudioapi_0.15.0 later_1.3.1       digest_0.6.33    
[33] R6_2.5.1          utf8_1.2.3        pillar_1.9.0      magrittr_2.0.3   
[37] bslib_0.5.0       tools_4.3.1       cachem_1.0.8

Solution for Practical 6: Prediction using GWAS summary statistics

Summer Institute of Statical Genetics (Module 15)

2023-07-25

Preparation

Part 1. Prediction using C+PT.

Part 2. Prediction using SBLUP.

Part 3. Prediction using SBayes C