Last updated: 2022-07-25
Checks: 7 0
Knit directory:
SISG2022_Association_Mapping/
This reproducible R Markdown analysis was created with workflowr (version 1.7.0). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.
Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.
Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.
The command set.seed(20220530)
was run prior to running
the code in the R Markdown file. Setting a seed ensures that any results
that rely on randomness, e.g. subsampling or permutations, are
reproducible.
Great job! Recording the operating system, R version, and package versions is critical for reproducibility.
Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.
Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.
Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.
The results in this page were generated with repository version bdaaf1a. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.
Note that you need to be careful to ensure that all relevant files for
the analysis have been committed to Git prior to generating the results
(you can use wflow_publish
or
wflow_git_commit
). workflowr only checks the R Markdown
file, but you know if there are other scripts or data files that it
depends on. Below is the status of the Git repository when the results
were generated:
Ignored files:
Ignored: .DS_Store
Ignored: analysis/.Rhistory
Ignored: code/.Rhistory
Ignored: data/.DS_Store
Ignored: lectures/.DS_Store
Unstaged changes:
Modified: analysis/_site.yml
Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.
These are the previous versions of the repository in which changes were
made to the R Markdown (analysis/SISGM15_prac4.Rmd
) and
HTML (docs/SISGM15_prac4.html
) files. If you’ve configured
a remote Git repository (see ?wflow_git_remote
), click on
the hyperlinks in the table below to view the files as they were in that
past version.
File | Version | Author | Date | Message |
---|---|---|---|---|
html | e54c1bf | Joelle Mbatchou | 2022-07-24 | Build site. |
Rmd | 072b219 | Joelle Mbatchou | 2022-07-24 | update exercises |
html | 660617e | Joelle Mbatchou | 2022-07-22 | Build site. |
html | 7453eb7 | Joelle Mbatchou | 2022-07-22 | Build site. |
html | 2741a94 | Joelle Mbatchou | 2022-07-22 | Build site. |
html | 231ce3c | Joelle Mbatchou | 2022-07-22 | Build site. |
Rmd | 7de5878 | Joelle Mbatchou | 2022-07-22 | add loic’s practical exercises |
This practical can be run in
R
on your own computer.
In Lecture 4, we saw proposed closed-form formula for the non-centrality parameter (and so the power) of an association test aiming at detecting allele frequency differences between cases and controls. We recall this formula below
\[\begin{equation} \text{NCP} = \frac{ 2N \pi (1-\pi)\left(p_1 - p_0 \right)^2 }{ \bar{p}(1-\bar{p}) } \end{equation}\]
where, \(N\) is the total sample size, \(\pi\) the proportion of cases in the sample, \(p_1\) and \(p_0\) the (expected) risk allele frequency in cases and controls, respectively; and \(\bar{p} = \pi p_1 + (1-\pi)p_0\).
Under the single locus disease model, \(p_1\) and \(p_0\) can be expressed
\[\begin{equation} p_1 = \frac{pR}{pR + 1 - p}\text{ and }p_0 = \frac{p}{1-K}\left[ 1 - \frac{KR}{1 + p(R-1)} \right], \end{equation}\]
where \(R\) is the relative risk for heterozygotes for the risk allele, \(K\) is the disease prevalence in the population and \(p\) the risk allele frequency in the population.
The R
code below defines a function to calculate the
expected statistical power using the set of equations written above.
Copy this code into your R
session to load the
function.
PowerCC <- function(N=10000,p=0.2,pi=0.5,alpha=5e-8,R=1.2,K=0.01){
p1 <- p * R / (p * R + 1 - p)
p0 <- ( p/(1-K) ) * (1 - K*R/(1 + p*(R-1) ))
pb <- p1*pi + (1-pi)*p0
NCP <- 2*N*pi*(1-pi)*((p1-p0)^2) / (pb*(1-pb))
Theshold <- qchisq(p=alpha,df=1,lower.tail = FALSE)
Power <- pchisq(q=Theshold,df=1,ncp=NCP,lower.tail = FALSE)
c(p0=p0,p1=p1,NCP=NCP,Power=Power)
}
Question 1. Use the online power calculator (https://zzz.bwh.harvard.edu/gpc/cc2.html) and verify
that you get the same power calculations as when using the
R
function above. Try different input
parameters.
Question 2. The online power calculator does not necessarily
assume a multiplicative model. Now consider a dominant model, where the
relative risk of heterozygotes (field named: “Genotype relative risk
Aa”) is exactly the same as that of homozygotes (field named “Genotype
relative risk AA”) for the risk allele. How does it impact your power
calculations as compared to that predicted from using the function
PowerCC
defined above?
So, we’re back to using a simple multiplicative model and we want to verify empirically if our power calculations are correct. For that, we will
define a statistical test, here a simple t-test comparing mean allele count between cases and controls
simulate data under using various input parameters and count how many times does the p-value from our test is lower than the expected significance threshold.
The R
function below (i) simulates data under a single
locus disease model (i.e., uses expectations of \(p_0\) and \(p_1\) defined from knowing \(R\) and \(K\)), (ii) run a t-test and (iii) returns
the p-value from the t-test.
Copy this code into your R
session to load the
function.
SimBasedPower <- function(N=10000,p=0.2,pi=0.5,R=1.2,K=0.01){
## Expected allele frequencies under single locus disease model
p1 <- p * R / (p * R + 1 - p)
p0 <- ( p/(1-K) ) * (1 - K*R/(1 + p*(R-1) ))
## Expected counts
n1 <- N*pi # number of cases
n0 <- N*(1-pi) # number of controls
## Phenotypes -- not used here but might be handy for the bonus question
y1 <- rep(1,n1) # vector of 1's - case status
y0 <- rep(0,n0) # vector of 0's - control status
y <- c(y1,y0) # vector phenotypes (y=1: cases; y=0: controls)
## Simulate genotypes
x1 <- rbinom(n=n1,size=2,prob=p1) # genotypes of cases
x0 <- rbinom(n=n0,size=2,prob=p0) # genotypes of controls
x <- c(x1,x0)
## T-test
tt <- t.test(x1,x0)
return(tt$p.value)
}
Question 3. Use the function above to simulate 10,000
replicates and count how many times p-values are below the input
significance threshold. Is this consistent with our expectations? Bonus:
modify the code above to replace the t-test with the p-value from a
regression model of the phenotype on the allele counts
(glm(y~x,family="binomial)
). Is the empirical power still
consistent with the expected statistical power?
B <- 10000
Pvals <- sapply(1:B, function(k) SimBasedPower(N=10000,p=0.2,pi=0.5,R=1.2,K=0.01) )
EmpiricalPower <- mean(Pvals<5e-8)
EmpiricalPower
This part of the practical aims at evaluating the statistical power of a two-stage GWAS strategy. In the old days of GWAS, investigators would first run a GWAS on a relatively small number of individuals (\(N_d\), for discovery), select a fraction of the most associated SNPs (e.g., with a p-value lower than 0.05), genotype those top SNPs in a larger number of individuals (\(N_r\), for replication) and re-test those for association.
Deriving a closed-form formula for the statistical power of this
approach is not straightforward. Therefore, we will use simulations. The
following R
code simulates such a strategy. In the first
stage, M=500
SNPs are tested for association in
Nd=2000
individuals in the discovery sample. Only one of
those SNPs is actually associated with the trait and explains \(q^2=0.3\%\) of trait variance. The first
stage consists of selecting SNPs with a p-value lower than 0.005.
Given that the trait of interest is quantitative, we can calculate the statistical power the first stage using the NCP formula for quantitative traits given in the \[ \text{NCP}({q^2},N) = \frac{Nq^2}{1-q^2} \] Using this formula and a significance threshold \(\alpha_1=0.005\) for the first stage, we can estimate the statistical power of the first stage to be about 36%. The second stage involves genotyping \(N_r=5,000\) individuals. If we could genotype all individuals in the discovery and replication samples for all SNPs, then the statistical power for a GWAS in \(N=N_d + N_r\), would have been about 71% (calculated using \(NCP(q^2=0.003,N=N_d+N_r)\)).
The R
code below simulates the two-stage strategy
described above and returns the smallest p-value obtained. Copy
this code into your R
session to load the
function.
PowerTwoStages <- function(Nd=2000,
Nr=5000,
q2=0.003,
Md=500,
alpha1=5e-3,
alpha2=5e-8,
verbose=FALSE){
## Power stage 1
NCP = Nd * q2 / (1 - q2)
Theshold <- qchisq(p=alpha1,df=1,lower.tail = FALSE)
Power1 <- pchisq(q=Theshold,df=1,ncp=NCP,lower.tail = FALSE)
if(verbose){
cat(paste0("Statistical Power of Stage 1 = ",round(100*Power1),"%.\n"))
}
## Stage 1
Xd <- matrix(rbinom(n=Nd*Md,size=2,prob=0.5),nrow=Nd,ncol=Md)
colnames(Xd) <- paste0("SNP",1:Md)
b <- sqrt(2 * q2)
Yd <- rnorm(n=Nd,mean=b * Xd[,"SNP1"],sd=sqrt(1-q2))
GWAS1 <- do.call("rbind",lapply(1:Md, function(j){
summary(lm(Yd~Xd[,j]))$coefficients[2,]
}))
rownames(GWAS1) <- colnames(Xd)
topSNPs <- rownames(GWAS1[which(GWAS1[,4]<alpha1),])
## Power stage 1+2
NCP = (Nr + Nd) * q2 / (1 - q2)
Theshold <- qchisq(p=alpha2,df=1,lower.tail = FALSE)
Power2 <- pchisq(q=Theshold,df=1,ncp=NCP,lower.tail = FALSE)
Power2
if(verbose){
cat(paste0("Expected Statistical Power of Stage 1+2 = ",round(100*Power2),"%.\n"))
}
## Stage 2
Mr <- length(topSNPs)
if(Mr>1){
Xr <- matrix(rbinom(n=Nr*Mr,size=2,prob=0.5),nrow=Nr,ncol=Mr)
colnames(Xr) <- topSNPs
if("SNP1"%in%topSNPs){
Yr <- rnorm(n=Nr,mean=b * Xr[,"SNP1"],sd=sqrt(1-q2))
}else{
Yr <- rnorm(n=Nr)
}
GWAS2 <- do.call("rbind",lapply(1:Mr, function(j){
summary(lm(Yr~Xr[,j]))$coefficients[2,]
}))
rownames(GWAS2) <- colnames(Xr)
## Meta-analysis
gwas1 <- GWAS1[topSNPs,]
gwas2 <- GWAS2[topSNPs,]
bm <- (gwas1[,1] * Nd + gwas2[,1] * Nr) / (Nd + Nr)
sm <- sqrt( (gwas1[,2]^2 * Nd^2 + gwas2[,2]^2 * Nr^2) /( (Nd + Nr)^2 ) )
pval <- pchisq(q=(bm/sm)^2,df=1,lower.tail = F)
return(min(pval))
}else{
return(NULL)
}
}
## Example
PowerTwoStages(Nd=2000,Nr=5000,q2=0.003,Md=1000,alpha1=5e-3,alpha2=5e-8,verbose = TRUE)
Question 4. Run at least 100 replicates and quantify the empirical statistical power, i.e., proportion of p-values lower than \(\alpha_2=5\times 10^{-8}\). Can you improve power by varying \(N_d\) and \(N_r\) (try a few more combinations)?
To answer these questions, you could use the following R
commands
nRep <- 100 ## Number of replicates
Pvals <- sapply(1:nRep,function(k){
PowerTwoStages(Nd=2000,Nr=5000,q2=0.003,Md=500,alpha1=5e-3,alpha2=5e-8,verbose=FALSE)
})
PowerEmpirical <- mean(unlist(Pvals)<5e-8)
print(PowerEmpirical)
sessionInfo()
R version 3.6.1 (2019-07-05)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS 10.16
Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] workflowr_1.7.0
loaded via a namespace (and not attached):
[1] Rcpp_1.0.8.3 bslib_0.3.1 jquerylib_0.1.4 compiler_3.6.1
[5] pillar_1.7.0 later_1.3.0 git2r_0.30.1 tools_3.6.1
[9] getPass_0.2-2 digest_0.6.25 jsonlite_1.7.2 evaluate_0.15
[13] tibble_3.1.6 lifecycle_1.0.1 pkgconfig_2.0.3 rlang_1.0.4
[17] cli_3.1.1 rstudioapi_0.13 yaml_2.2.1 xfun_0.31
[21] fastmap_1.1.0 httr_1.4.3 stringr_1.4.0 knitr_1.39
[25] sass_0.4.0 fs_1.5.2 vctrs_0.3.8 rprojroot_2.0.3
[29] glue_1.6.1 R6_2.4.1 processx_3.5.3 fansi_0.4.1
[33] rmarkdown_2.14 callr_3.7.0 magrittr_1.5 whisker_0.4
[37] ps_1.7.0 promises_1.2.0.1 htmltools_0.5.2 ellipsis_0.3.2
[41] httpuv_1.6.5 utf8_1.1.4 stringi_1.4.6 crayon_1.3.4