Last updated: 2017-03-06
Code version: c7339fc
Note: the code here is designed to be readable by a beginner, rather than “efficient”. The idea is that you can use this code to learn about the basics of MCMC, but not as a model for how to program well in R!
Any MCMC scheme aims to produce (dependent) samples from a ``target" distribution. In this case we are going to use the exponential distribution with mean 1 as our target distribution. So we start by defining our target density:
target = function(x){
if(x<0){
return(0)}
else {
return( exp(-x))
}
}
Having defined the function, we can now use it to compute a couple of values (just to illustrate the idea of a function):
target(1)
[1] 0.3678794
target(-1)
[1] 0
Next, we will program a Metropolis–Hastings scheme to sample from a distribution proportional to the target
x = rep(0,1000)
x[1] = 3 #this is just a starting value, which I've set arbitrarily to 3
for(i in 2:1000){
currentx = x[i-1]
proposedx = currentx + rnorm(1,mean=0,sd=1)
A = target(proposedx)/target(currentx)
if(runif(1)<A){
x[i] = proposedx # accept move with probabily min(1,A)
} else {
x[i] = currentx # otherwise "reject" move, and stay where we are
}
}
Note that x is a realization of a Markov Chain. We can make a few plots of x:
plot(x)
hist(x)
We can wrap this up in a function to make things a bit neater, and make it easy to try changing starting values and proposal distributions
easyMCMC = function(niter, startval, proposalsd){
x = rep(0,niter)
x[1] = startval
for(i in 2:niter){
currentx = x[i-1]
proposedx = rnorm(1,mean=currentx,sd=proposalsd)
A = target(proposedx)/target(currentx)
if(runif(1)<A){
x[i] = proposedx # accept move with probabily min(1,A)
} else {
x[i] = currentx # otherwise "reject" move, and stay where we are
}
}
return(x)
}
Now we’ll run the MCMC scheme 3 times, and look to see how similar the results are:
z1=easyMCMC(1000,3,1)
z2=easyMCMC(1000,3,1)
z3=easyMCMC(1000,3,1)
plot(z1,type="l")
lines(z2,col=2)
lines(z3,col=3)
par(mfcol=c(3,1)) #rather odd command tells R to put 3 graphs on a single page
maxz=max(c(z1,z2,z3))
hist(z1,breaks=seq(0,maxz,length=20))
hist(z2,breaks=seq(0,maxz,length=20))
hist(z3,breaks=seq(0,maxz,length=20))
Use the function easyMCMC
to explore the following:
target = function(x){
return((x>0 & x <1) + (x>2 & x<3))
}
What does this target look like? What happens if the proposal sd is too small here? (try e.g. 1 and 0.1)
A standard assumption when modelling genotypes of bi-allelic loci (e.g. loci with alleles A and a) is that the population is “randomly mating”. From this assumption it follows that the population will be in “Hardy Weinberg Equilibrium” (HWE), which means that if p is the frequency of the allele A then the genotypes AA, Aa and aa will have frequencies p2,2p(1−p) and (1−p)2 respectively.
A simple prior for p is to assume it is uniform on [0,1]. Suppose that we sample n individuals, and observe nAA with genotype AA, nAa with genotype Aa and naa with genotype aa.
The following R code gives a short MCMC routine to sample from the posterior distribution of p. Try to go through the code to see how it works.
prior = function(p){
if((p<0) || (p>1)){ # || here means "or"
return(0)}
else{
return(1)}
}
likelihood = function(p, nAA, nAa, naa){
return(p^(2*nAA) * (2*p*(1-p))^nAa * (1-p)^(2*naa))
}
psampler = function(nAA, nAa, naa, niter, pstartval, pproposalsd){
p = rep(0,niter)
p[1] = pstartval
for(i in 2:niter){
currentp = p[i-1]
newp = currentp + rnorm(1,0,pproposalsd)
A = prior(newp)*likelihood(newp,nAA,nAa,naa)/(prior(currentp) * likelihood(currentp,nAA,nAa,naa))
if(runif(1)<A){
p[i] = newp # accept move with probabily min(1,A)
} else {
p[i] = currentp # otherwise "reject" move, and stay where we are
}
}
return(p)
}
Running this sample for nAA = 50, nAa = 21, naa=29.
z=psampler(50,21,29,10000,0.5,0.01)
Now some R code to compare the sample from the posterior with the theoretical posterior (which in this case is available analytically; since we observed 121 As, and 79 as, out of 200, the posterior for p is Beta(121+1,79+1).
x=seq(0,1,length=1000)
hist(z,prob=T)
lines(x,dbeta(x,122, 80)) # overlays beta density on histogram
You might also like to discard the first 5000 z’s as “burnin”. Here’s one way in R to select only the last 5000 z’s
hist(z[5001:10000])
Investigate how the starting point and proposal standard deviation affect the convergence of the algorithm.
A slightly more complex alternative than HWE is to assume that there is a tendency for people to mate with others who are slightly more closely-related than “random” (as might happen in a geographically-structured population, for example). This will result in an excess of homozygotes compared with HWE. A simple way to capture this is to introduce an extra parameter, the “inbreeding coefficient” f, and assume that the genotypes AA, Aa and aa have frequencies fp+(1−f)p∗p,(1−f)2p(1−p), and f(1−p)+(1−f)(1−p)(1−p).
In most cases it would be natural to treat f as a feature of the population, and therefore assume f is constant across loci. For simplicity we will consider just a single locus.
Note that both f and p are constrained to lie between 0 and 1 (inclusive). A simple prior for each of these two parameters is to assume that they are independent, uniform on [0,1]. Suppose that we sample n individuals, and observe nAA with genotype AA, nAa with genotype Aa and naa with genotype aa.
Hint: here is a start; you’ll need to fill in the …
fpsampler = function(nAA, nAa, naa, niter, fstartval, pstartval, fproposalsd, pproposalsd){
f = rep(0,niter)
p = rep(0,niter)
f[1] = fstartval
p[1] = pstartval
for(i in 2:niter){
currentf = f[i-1]
currentp = p[i-1]
newf = currentf + ...
newp = currentp + ...
...
}
return(list(f=f,p=p)) # return a "list" with two elements named f and p
}
You could also tackle this problem with a Gibbs Sampler (see vignettes here and here).
To do so you will want to use the following “latent variable” representation of the model: zi∼Bernoulli(f) p(gi=AA|zi=1)=p;p(gi=AA|zi=0)=p2 p(gi=Aa|zi=1)=0;p(gi=Aa|zi=0)=2p(1−p) p(gi=aa|zi=1)=(1−p);p(gi=aa|zi=0)=(1−p)2
Summing over zi gives the same model as above: p(gi=AA)=fp+(1−f)p2
Using the above, implement a Gibbs Sampler to sample from the joint distribution of z,f, and p given genotype data g.
Hint: this requires iterating the following steps
sessionInfo()
R version 3.3.2 (2016-10-31)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 14.04.5 LTS
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] MASS_7.3-45 expm_0.999-0 Matrix_1.2-8
[4] workflowr_0.4.0 rmarkdown_1.3.9004
loaded via a namespace (and not attached):
[1] Rcpp_0.12.9 lattice_0.20-34 gtools_3.5.0 digest_0.6.12
[5] rprojroot_1.2 grid_3.3.2 backports_1.0.5 git2r_0.18.0
[9] magrittr_1.5 evaluate_0.10 stringi_1.1.2 tools_3.3.2
[13] stringr_1.2.0 yaml_2.1.14 htmltools_0.3.5 knitr_1.15.1
This site was created with R Markdown