Last updated: 2022-03-18

Checks: 6 1

Knit directory: ~/Documents/Winter_Quarter_2022/Fundamentals/jaurbanChicago.github.io/bin/

This reproducible R Markdown analysis was created with workflowr (version 1.7.0). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.

R Markdown file: uncommitted changes

The R Markdown file has unstaged changes. To know which version of the R Markdown file created these results, you’ll want to first commit it to the Git repo. If you’re still working on the analysis, you can ignore this warning. When you’re finished, you can run wflow_publish to commit the R Markdown file and build the HTML.

Environment: empty

Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.

Seed: set.seed(12345)

The command set.seed(12345) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.

Session information: recorded

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.

Cache: none

Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.

File paths: relative

Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.

Repository version: a232a33

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.

The results in this page were generated with repository version a232a33. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Untracked files:
    Untracked:  .DS_Store
    Untracked:  .Rhistory

Unstaged changes:
    Deleted:    New Folder With Items/.DS_Store
    Deleted:    New Folder With Items/Genetic_Drift_Markov_Chain.Rmd
    Deleted:    New Folder With Items/_site.yml
    Deleted:    New Folder With Items/include/footer.html
    Modified:   bin/Genetic_Drift_Markov.Rmd
    Deleted:    index.html

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.

These are the previous versions of the repository in which changes were made to the R Markdown (bin/Genetic_Drift_Markov.Rmd) and HTML (docs/Genetic_Drift_Markov.html) files. If you’ve configured a remote Git repository (see ?wflow_git_remote), click on the hyperlinks in the table below to view the files as they were in that past version.

File	Version	Author	Date	Message
Rmd	af0fd73	jaurbanChicago	2022-03-18	markov html
html	883962f	jaurbanChicago	2022-03-18	Updated markov html
html	943c0b9	jaurbanChicago	2022-03-18	Updated markov html
Rmd	b65811a	jaurbanChicago	2022-03-18	Updated markov rmd
html	af0f016	jaurbanChicago	2022-03-17	Added Markov html
html	a311740	jaurbanChicago	2022-03-17	Removed all htmls
Rmd	0d8c343	jaurbanChicago	2022-03-17	Updates both rmd and html
html	0d8c343	jaurbanChicago	2022-03-17	Updates both rmd and html
html	112890c	jaurbanChicago	2022-03-17	Added Genetic_Drift_Markov.html
Rmd	6b81ff1	jaurbanChicago	2022-03-17	Added Genetic_Drift_Markov.Rmd

Pre-Requisites

Wright-Fisher Model
Discrete Markov Chains
Basic notions of probability theory
Basic notions of population genetics

Introduction to Discrete Markov Chains

Before getting into the specific details of the relationship between Markov chains and genetic drift, it is necessary to briefly explain what a discrete, finite Markov chain is.

Let´s consider a discrete random variable \(X\), which at time points \(0,1,2,3,...\), takes values \(0,1,2,...,M\). Therefore, it is precise to say that each state \(E_i\), \(X\) takes the value \(i\) [2]. This means that at each state \(E_i\) at some time \(t\), \(X\) is in some state \(E_i\). A variable is said to be Markovian if the probability of a certain outcome in the next time step \(t+1\) depends only on the present state at time \(t\) and is memoryless with regard to any previous time step \(t-1,t-2,...\). Therefore, we can now define a discrete Markov chain as a sequence of discrete random variables in which the probability distribution of the different states at each time \(t\) depends only on the state at time \(t-1\) [1,2].

Some important mathematical notation to describe a discrete Markov chain is [2,3]:
\[ P(E_{n+1}|E_{n})=i,E_{n-1}=i_{n-1},...,E_1=i_1,E_0=i_0)=P(E_{n+1}=j|E_n=i)\\ \mathbb{A \space discrete \space Markov \space is \space homogeneous \space if:}\\ P(E_{n+1}|E_{n})=P(E_i=j|E_{i-1}=i) \space \mathbb{for \space all} \space n, \mathbb{so} \space P(E_i=j|E_{i-1})=P_{ij}\\ P_{ij} \gt 0;i,j \ge 0; \sum_{j=0}^{\infty}P_{ij}=1 \\ \mathbb{A \space transition \space probability \space matrix \space can \space be \space assembled \space to \space represent \space the \space probabilities \space of \space} P_{ij} \space \\ P_{ij}= \begin{pmatrix} p_{00} & p_{01} & ... & p_{0M}\\ p_{00} & p_{11} & ... & p_{1M} \\ .\\ .\\ .\\ p_{M0} & p_{M1} & ... & p_{MM} \end{pmatrix} \\ \mathbb{This \space is \space an \space stochastic \space matrix, so \space all \space rows \space must \space sum \space to \space 1} \] The states of a discrete Markov chain are classified into the following categories [2,3,4]:

Recurrent:State \(E_i\) can reenter \(E_i\) often infinitely when the chain starts at this particular \(E_i\)., i.e. \(f_{i}=1\).
Transient: State \(E_i\) is transient if, starting at state \(E_i\), the number of time periods that the process will be in \(E_i\) has a geometric distribution with mean \(\frac{1}{1-f_i}\).
Absorbing: An state \(E_i\) is absorbinf if the probability of ever visiting any other state is zero.

Next, we are going to classify the chains in two different classes:

Irreducible: All states can communicate with one another[3].
Ergodic: Irreducible, recurrent, and aperiodic. Aperiodic means all of the Markov chain states are aperiodic [3].

For the remainder of this vignette, it is important to denote that we will be working with Markov chains that have no periodicities in them, which hardly arise in any genetical applications [2]. Finally, we need to define the concept of stationary distribution of a Markov chain. A stationary distribution may be defined with the following mathematical terms [3]:

\[ \mathbb{Consider \space}P^t \space \mathbb{as \space t \space gets \space large}:\\ \lim_{t\to\infty}(\pi^{(0)})^TP^t=\pi^T ; \space \mathbb{the \space \pi \space vector \space is \space the \space limiting \space distribution \space of \space the \space Markov \space chain}\\ \mathbb{Stationarity \space property \space of \space \pi}:\\ \pi^T=\pi^TP; \space \mathbb{because} \space \pi_i=\sum_k \pi_kP_{ki} \] For ergodic chains, the limiting distribution is also the stationary distribution and there is only one unique stationary distribution (i.e., equilibrium distribution) [3]. To obtain the stationary distribution of the Markov chain, we can use matrix exponentiation (each row of the resulting matrix will have \(\pi\) in each row), solve linear equations (solve for \(\pi^T=\pi^TP\)), and use eigendecomposition (\(\pi\) is left eigenvector of \(P\), so eigendecomposition can lead to \(\pi\)).

Markov Chains in Genetic Drift

A discrete Markov chain can be quite helpful to describe genetic drift in an ensemble of numerous identical and independent populations of a Wright-Fisher model [1]. A Markov chain model will help us understand how drift modifies, on average, the alleles frequencies of a biallelic locus in various population replicates through time. The transition probability for the Markov chain describing allele frequencies changes in a specific population (from an ensemble of replicate populations) is [3]:

\[ P_{ij}=P(p_{t+1}=\frac{j}{2N}|p_t=\frac{i}{2N})={2N \choose j}(\frac{i}{2N})^j(1-\frac{i}{2N})^{2N-j} \\ \mathbb{E}[p_{t+1}=\frac{j}{2N}|p_t=\frac{i}{2N}]=p_t\\ Var(p_{t+1}=\frac{j}{2N}|p_t=\frac{i}{2N})=\frac{p_t(1-p_t)}{2N} \] This transition probability equation describes the probability that a given Wright-Fisher population with a exactly \(i\) copies of allele \(A\) in generation t has exactly \(j\) copies of \(A\) in generation \(t+1\). This equation is another confirmation that the allele changes between discrete time units is Markovian, which means that the allele trajectories in \(t+1\) will exclusively depend on the frequencies at \(t\)[1-4]. A Markov chain describing genetic drift (no mutation and no selection) can have two absorbing states, which are when the allele frequency \(p_i=0\) (loss) or \(p_i=1\) (fixation) [5]. Besides these two absobing states, allele frequency can, in principle, change from any state to other states in a single generation. However, how big or small is the change between generations, will depend, in principle, in \(N\) and the frequency itself (remember the binomial sampling from the introductory vignette).

Code Simulations of Genetic Drifts using Markov Chains

We will use some simple code to simulate and model random genetic drift as a simple discrete Markov chain.

# We are going to give some examples of random genetic drift trajectories using the transition probability (Pij). 
#We are going to use the CRAN R package learnPopGen created by Liam J. Revell [5]

library(learnPopGen)

# Initial frequency p=0.01,N=50,500 and 50 population replicates

genetic.drift(p0=0.01,Ne=50,nrep=50, time=100, show = "p",pause = 0.1)

Version	Author	Date
dcdda13	jaurbanChicago	2022-03-18

genetic.drift(p0=0.01,Ne=500,nrep=50, time=100, show = "p",pause = 0.1)

Version	Author	Date
dcdda13	jaurbanChicago	2022-03-18

# Initial frequency p=0.5,N=50,500 and 50 population replicates

genetic.drift(p0=0.5,Ne=50,nrep=50, time=100, show = "p",pause = 0.1)

Version	Author	Date
dcdda13	jaurbanChicago	2022-03-18

genetic.drift(p0=0.5,Ne=500,nrep=50, time=100, show = "p",pause = 0.1)

Version	Author	Date
dcdda13	jaurbanChicago	2022-03-18

# Initial frequency p=0.8,N=50,500 and 50 population replicates

genetic.drift(p0=0.8,Ne=50,nrep=50, time=100, show = "p",pause = 0.1)

Version	Author	Date
dcdda13	jaurbanChicago	2022-03-18

genetic.drift(p0=0.8,Ne=500,nrep=50, time=100, show = "p",pause = 0.1)

Version	Author	Date
dcdda13	jaurbanChicago	2022-03-18

After taking a look at these simulations, we can observe that the size of \(N\) indeed affects the variability of the outcome of the random sampling in the Markov chain. We can also see that once \(A\) reaches fixation or gets lost in one of the replicates, it can never get out of that particular absorbing state. Finally, the frequency in each state \(t\) affects the variability of the trajectories and this becomes even more apparent when the allele \(A\) is either close to fixation or loss in each replicate.

Intractability of Genetic Drift as a Markov chain in a Wright-Fisher Model

One of the caveats of modeling genetic drift as a Markov chain in a Wright-Fisher population is the fact that it becomes mathematically intractable. This means that it is very difficult to obtain simple analytical results on the distributions of allele frequencies after \(t\) generations of genetic drift [6]. We can iterate over and over the Markov chain transition probabilities and we would never derive analytical formulas to describe the allele frequencies distributions. Diffusion theory has been used to approximate certain features of the distribution (it´s still not easy and will probably require some numerical techniques!!!) and other interesting results (fixation probabilities,equilibrium allele frequency spectrum, etc) [6]. You can find more about diffusion approximations of genetic drift in the following vignette.

References

Hamilton, M. B. (2008). Population genetics. John Wiley & Sons, 53-73.
Ewens, W. J. (2004). Mathematical population genetics: theoretical introduction (Vol. 1). New York: Springer, 86-91.
Hartl, D. L. (2020).A primer of population genetics and genomics. Oxford University Press,147-173.
Novembre,J. “Lecture: Discrete Markov Chains”,February 2022, University of Chicago, Chicago, IL. 5.Revell, L. J. (2019). learnPopGen: An R package for population genetic simulation and numerical analysis. Ecology and Evolution, 9(14), 7896-7902.
Berg,J.J. “Lecture:Forward Models 1”,February 2022, University of Chicago, Chicago, IL.

sessionInfo()

R version 4.1.2 (2021-11-01)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Catalina 10.15.7

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] learnPopGen_1.0.4

loaded via a namespace (and not attached):
 [1] phangorn_2.8.1          gtools_3.9.2            xfun_0.29              
 [4] bslib_0.3.1             lattice_0.20-45         phytools_1.0-1         
 [7] vctrs_0.3.8             expm_0.999-6            htmltools_0.5.2        
[10] yaml_2.2.2              utf8_1.2.2              rlang_1.0.2            
[13] jquerylib_0.1.4         later_1.3.0             pillar_1.7.0           
[16] glue_1.6.2              lifecycle_1.0.1         stringr_1.4.0          
[19] combinat_0.0-8          workflowr_1.7.0         codetools_0.2-18       
[22] coda_0.19-4             evaluate_0.14           knitr_1.37             
[25] fastmap_1.1.0           httpuv_1.6.5            parallel_4.1.2         
[28] fansi_1.0.2             highr_0.9               Rcpp_1.0.8.2           
[31] promises_1.2.0.1        plotrix_3.8-2           clusterGeneration_1.3.7
[34] scatterplot3d_0.3-41    jsonlite_1.8.0          tmvnsim_1.0-2          
[37] fs_1.5.2                fastmatch_1.1-3         mnormt_2.0.2           
[40] digest_0.6.29           stringi_1.7.6           numDeriv_2016.8-1.1    
[43] rprojroot_2.0.2         grid_4.1.2              quadprog_1.5-8         
[46] cli_3.2.0               tools_4.1.2             magrittr_2.0.2         
[49] sass_0.4.0              maps_3.4.0              tibble_3.1.6           
[52] crayon_1.5.0            ape_5.6-2               whisker_0.4            
[55] pkgconfig_2.0.3         ellipsis_0.3.2          MASS_7.3-55            
[58] Matrix_1.4-0            rmarkdown_2.11          rstudioapi_0.13        
[61] R6_2.5.1                igraph_1.2.11           nlme_3.1-155           
[64] git2r_0.29.0            compiler_4.1.2

This site was created with R Markdown

Markov Chain Model of Genetic Drift

Jose Antonio Urban Aragon, Ethan Zhong