Last updated: 2023-06-29

Checks: 7 0

Knit directory: SISG2023_Association_Mapping/

This reproducible R Markdown analysis was created with workflowr (version 1.7.0). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.


Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.

Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.

The command set.seed(20230530) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.

Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.

Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.

The results in this page were generated with repository version 8d74414. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Ignored files:
    Ignored:    data/sim_rels_geno.bed
    Ignored:    exe/
    Ignored:    mk_website.R
    Ignored:    tmp/

Untracked files:
    Untracked:  .mk_website.R.swp
    Untracked:  analysis/SISGM15_prac4Solution.Rmd
    Untracked:  analysis/SISGM15_prac5Solution.Rmd
    Untracked:  analysis/SISGM15_prac6Solution.Rmd
    Untracked:  analysis/SISGM15_prac9Solution.Rmd
    Untracked:  analysis/Session01_practical_Key.Rmd
    Untracked:  analysis/Session02_practical_Key.Rmd
    Untracked:  analysis/Session03_practical_Key.Rmd
    Untracked:  analysis/Session07_practical_Key.Rmd
    Untracked:  analysis/Session08_practical_Key.Rmd

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.


These are the previous versions of the repository in which changes were made to the R Markdown (analysis/SISGM15_prac4.Rmd) and HTML (docs/SISGM15_prac4.html) files. If you’ve configured a remote Git repository (see ?wflow_git_remote), click on the hyperlinks in the table below to view the files as they were in that past version.

File Version Author Date Message
Rmd 5163a3d Joelle Mbatchou 2023-06-29 update year
Rmd 1f3a4dc Joelle Mbatchou 2023-06-26 add files for website

This practical can be run in R on your own computer.

Part 1: Power of GWAS under single locus model

In Lecture 4, we saw proposed closed-form formula for the non-centrality parameter (and so the power) of an association test aiming at detecting allele frequency differences between cases and controls. We recall this formula below

\[\begin{equation} \text{NCP} = \frac{ 2N \pi (1-\pi)\left(p_1 - p_0 \right)^2 }{ \bar{p}(1-\bar{p}) } \end{equation}\]

where, \(N\) is the total sample size, \(\pi\) the proportion of cases in the sample, \(p_1\) and \(p_0\) the (expected) risk allele frequency in cases and controls, respectively; and \(\bar{p} = \pi p_1 + (1-\pi)p_0\).

Under the single locus disease model, \(p_1\) and \(p_0\) can be expressed

\[\begin{equation} p_1 = \frac{pR}{pR + 1 - p}\text{ and }p_0 = \frac{p}{1-K}\left[ 1 - \frac{KR}{1 + p(R-1)} \right], \end{equation}\]

where \(R\) is the relative risk for heterozygotes for the risk allele, \(K\) is the disease prevalence in the population and \(p\) the risk allele frequency in the population.

The R code below defines a function to calculate the expected statistical power using the set of equations written above. Copy this code into your R session to load the function.

PowerCC <- function(N=10000,p=0.2,pi=0.5,alpha=5e-8,R=1.2,K=0.01){
  p1  <- p * R  / (p * R + 1 - p)
  p0  <- ( p/(1-K) ) * (1 - K*R/(1 + p*(R-1) ))
  pb  <- p1*pi + (1-pi)*p0
  NCP <- 2*N*pi*(1-pi)*((p1-p0)^2) / (pb*(1-pb))
  Theshold <- qchisq(p=alpha,df=1,lower.tail = FALSE)
  Power    <- pchisq(q=Theshold,df=1,ncp=NCP,lower.tail = FALSE)
  c(p0=p0,p1=p1,NCP=NCP,Power=Power)
} 

Question 1. Use the online power calculator (https://zzz.bwh.harvard.edu/gpc/cc2.html) and verify that you get the same power calculations as when using the R function above. Try different input parameters.

Question 2. The online power calculator does not necessarily assume a multiplicative model. Now consider a dominant model, where the relative risk of heterozygotes (field named: “Genotype relative risk Aa”) is exactly the same as that of homozygotes (field named “Genotype relative risk AA”) for the risk allele. How does it impact your power calculations as compared to that predicted from using the function PowerCC defined above?

So, we’re back to using a simple multiplicative model and we want to verify empirically if our power calculations are correct. For that, we will

  1. define a statistical test, here a simple t-test comparing mean allele count between cases and controls

  2. simulate data under using various input parameters and count how many times does the p-value from our test is lower than the expected significance threshold.

The R function below (i) simulates data under a single locus disease model (i.e., uses expectations of \(p_0\) and \(p_1\) defined from knowing \(R\) and \(K\)), (ii) run a t-test and (iii) returns the p-value from the t-test.

Copy this code into your R session to load the function.

SimBasedPower <- function(N=10000,p=0.2,pi=0.5,R=1.2,K=0.01){
  ## Expected allele frequencies under single locus disease model
  p1  <- p * R  / (p * R + 1 - p)
  p0  <- ( p/(1-K) ) * (1 - K*R/(1 + p*(R-1) ))
  
  ## Expected counts
  n1 <- N*pi     # number of cases
  n0 <- N*(1-pi) # number of controls

  ## Phenotypes -- not used here but might be handy for the bonus question
  y1 <- rep(1,n1)  # vector of 1's - case status
  y0 <- rep(0,n0)  # vector of 0's - control status
  y  <- c(y1,y0)   # vector phenotypes (y=1: cases; y=0: controls)
  
  ## Simulate genotypes
  x1 <- rbinom(n=n1,size=2,prob=p1) # genotypes of cases
  x0 <- rbinom(n=n0,size=2,prob=p0) # genotypes of controls
  x  <- c(x1,x0)
  
  ## T-test
  tt <- t.test(x1,x0)
  return(tt$p.value)
} 

Question 3. Use the function above to simulate 10,000 replicates and count how many times p-values are below the input significance threshold. Is this consistent with our expectations? Bonus: modify the code above to replace the t-test with the p-value from a regression model of the phenotype on the allele counts (glm(y~x,family="binomial)). Is the empirical power still consistent with the expected statistical power?

B <- 10000
Pvals <- sapply(1:B, function(k) SimBasedPower(N=10000,p=0.2,pi=0.5,R=1.2,K=0.01) )
EmpiricalPower <- mean(Pvals<5e-8)
EmpiricalPower

Part 2. Power of a two-stage GWAS approach.

This part of the practical aims at evaluating the statistical power of a two-stage GWAS strategy. In the old days of GWAS, investigators would first run a GWAS on a relatively small number of individuals (\(N_d\), for discovery), select a fraction of the most associated SNPs (e.g., with a p-value lower than 0.05), genotype those top SNPs in a larger number of individuals (\(N_r\), for replication) and re-test those for association.

Deriving a closed-form formula for the statistical power of this approach is not straightforward. Therefore, we will use simulations. The following R code simulates such a strategy. In the first stage, M=500 SNPs are tested for association in Nd=2000 individuals in the discovery sample. Only one of those SNPs is actually associated with the trait and explains \(q^2=0.3\%\) of trait variance. The first stage consists of selecting SNPs with a p-value lower than 0.005.

Given that the trait of interest is quantitative, we can calculate the statistical power the first stage using the NCP formula for quantitative traits given in the \[ \text{NCP}({q^2},N) = \frac{Nq^2}{1-q^2} \] Using this formula and a significance threshold \(\alpha_1=0.005\) for the first stage, we can estimate the statistical power of the first stage to be about 36%. The second stage involves genotyping \(N_r=5,000\) individuals. If we could genotype all individuals in the discovery and replication samples for all SNPs, then the statistical power for a GWAS in \(N=N_d + N_r\), would have been about 71% (calculated using \(NCP(q^2=0.003,N=N_d+N_r)\)).

The R code below simulates the two-stage strategy described above and returns the smallest p-value obtained. Copy this code into your R session to load the function.

PowerTwoStages <- function(Nd=2000,
                           Nr=5000,
                           q2=0.003,
                           Md=500,
                           alpha1=5e-3,
                           alpha2=5e-8,
                           verbose=FALSE){
  ## Power stage 1
  NCP = Nd * q2 / (1 - q2)
  Theshold <- qchisq(p=alpha1,df=1,lower.tail = FALSE)
  Power1   <- pchisq(q=Theshold,df=1,ncp=NCP,lower.tail = FALSE)
  
  if(verbose){
    cat(paste0("Statistical Power of Stage 1 = ",round(100*Power1),"%.\n"))
  }
  
  ## Stage 1
  Xd <- matrix(rbinom(n=Nd*Md,size=2,prob=0.5),nrow=Nd,ncol=Md)
  colnames(Xd) <- paste0("SNP",1:Md)
  b  <- sqrt(2 * q2)
  Yd <- rnorm(n=Nd,mean=b * Xd[,"SNP1"],sd=sqrt(1-q2))
  
  GWAS1 <- do.call("rbind",lapply(1:Md, function(j){
    summary(lm(Yd~Xd[,j]))$coefficients[2,]
  }))
  rownames(GWAS1) <- colnames(Xd)
  topSNPs <- rownames(GWAS1[which(GWAS1[,4]<alpha1),])

  ## Power stage 1+2
  NCP = (Nr + Nd) * q2 / (1 - q2)
  Theshold <- qchisq(p=alpha2,df=1,lower.tail = FALSE)
  Power2   <- pchisq(q=Theshold,df=1,ncp=NCP,lower.tail = FALSE)
  Power2
  if(verbose){
    cat(paste0("Expected Statistical Power of Stage 1+2 = ",round(100*Power2),"%.\n"))
  }
  
  ## Stage 2
  Mr <- length(topSNPs)
  if(Mr>1){
    Xr <- matrix(rbinom(n=Nr*Mr,size=2,prob=0.5),nrow=Nr,ncol=Mr)
    colnames(Xr) <- topSNPs
    if("SNP1"%in%topSNPs){
      Yr <- rnorm(n=Nr,mean=b * Xr[,"SNP1"],sd=sqrt(1-q2))
    }else{
      Yr <- rnorm(n=Nr)
    }
    
    GWAS2 <- do.call("rbind",lapply(1:Mr, function(j){
      summary(lm(Yr~Xr[,j]))$coefficients[2,]
    }))
    rownames(GWAS2) <- colnames(Xr)
    
    ## Meta-analysis
    gwas1 <- GWAS1[topSNPs,]
    gwas2 <- GWAS2[topSNPs,]
    
    bm    <- (gwas1[,1] * Nd + gwas2[,1] * Nr) / (Nd + Nr)
    sm    <- sqrt( (gwas1[,2]^2 * Nd^2 + gwas2[,2]^2 * Nr^2) /( (Nd + Nr)^2 ) )
    pval  <- pchisq(q=(bm/sm)^2,df=1,lower.tail = F)
    return(min(pval))
  }else{
    return(NULL)
  }
}

## Example
PowerTwoStages(Nd=2000,Nr=5000,q2=0.003,Md=1000,alpha1=5e-3,alpha2=5e-8,verbose = TRUE)

Question 4. Run at least 100 replicates and quantify the empirical statistical power, i.e., proportion of p-values lower than \(\alpha_2=5\times 10^{-8}\). Can you improve power by varying \(N_d\) and \(N_r\) (try a few more combinations)?

To answer these questions, you could use the following R commands

nRep <- 100 ## Number of replicates
Pvals <- sapply(1:nRep,function(k){
  PowerTwoStages(Nd=2000,Nr=5000,q2=0.003,Md=500,alpha1=5e-3,alpha2=5e-8,verbose=FALSE)
})
PowerEmpirical <- mean(unlist(Pvals)<5e-8)
print(PowerEmpirical)

sessionInfo()
R version 4.3.0 (2023-04-21)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS Ventura 13.4

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: Australia/Brisbane
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
 [1] vctrs_0.6.2      cli_3.6.1        knitr_1.43       rlang_1.1.1     
 [5] xfun_0.39        stringi_1.7.12   promises_1.2.0.1 jsonlite_1.8.5  
 [9] workflowr_1.7.0  glue_1.6.2       rprojroot_2.0.3  git2r_0.32.0    
[13] htmltools_0.5.5  httpuv_1.6.11    sass_0.4.6       fansi_1.0.4     
[17] rmarkdown_2.22   evaluate_0.21    jquerylib_0.1.4  tibble_3.2.1    
[21] fastmap_1.1.1    yaml_2.3.7       lifecycle_1.0.3  whisker_0.4.1   
[25] stringr_1.5.0    compiler_4.3.0   fs_1.6.2         Rcpp_1.0.10     
[29] pkgconfig_2.0.3  rstudioapi_0.14  later_1.3.1      digest_0.6.31   
[33] R6_2.5.1         utf8_1.2.3       pillar_1.9.0     magrittr_2.0.3  
[37] bslib_0.5.0      tools_4.3.0      cachem_1.0.8