Last updated: 2019-06-12
Checks: 7 0
Knit directory: fiveMinuteStats/analysis/
This reproducible R Markdown analysis was created with workflowr (version 1.4.0). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.
Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.
Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.
The command set.seed(12345)
was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.
Great job! Recording the operating system, R version, and package versions is critical for reproducibility.
Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.
Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.
Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility. The version displayed above was the version of the Git repository at the time these results were generated.
Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish
or wflow_git_commit
). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:
Ignored files:
Ignored: .Rhistory
Ignored: .Rproj.user/
Ignored: analysis/.Rhistory
Ignored: analysis/bernoulli_poisson_process_cache/
Untracked files:
Untracked: _workflowr.yml
Untracked: analysis/CI.Rmd
Untracked: analysis/gibbs_structure.Rmd
Untracked: analysis/libs/
Untracked: analysis/results.Rmd
Untracked: analysis/shiny/tester/
Untracked: docs/MH_intro_files/
Untracked: docs/citations.bib
Untracked: docs/figure/The Ancestral Process.Rmd/
Untracked: docs/figure/The Coalescent Process.Rmd/
Untracked: docs/hmm_files/
Untracked: docs/libs/
Untracked: docs/shiny/tester/
Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.
These are the previous versions of the R Markdown and HTML files. If you’ve configured a remote Git repository (see ?wflow_git_remote
), click on the hyperlinks in the table below to view them.
File | Version | Author | Date | Message |
---|---|---|---|---|
Rmd | 7da8091 | Matthew Stephens | 2019-06-12 | workflowr::wflow_publish(“analysis/intro_to_em.Rmd”) |
html | 5f62ee6 | Matthew Stephens | 2019-03-31 | Build site. |
html | 34bcc51 | John Blischak | 2017-03-06 | Build site. |
Rmd | 5fbc8b5 | John Blischak | 2017-03-06 | Update workflowr project with wflow_update (version 0.4.0). |
Rmd | 391ba3c | John Blischak | 2017-03-06 | Remove front and end matter of non-standard templates. |
html | fb0f6e3 | stephens999 | 2017-03-03 | Merge pull request #33 from mdavy86/f/review |
html | 0713277 | stephens999 | 2017-03-03 | Merge pull request #31 from mdavy86/f/review |
Rmd | d674141 | Marcus Davy | 2017-02-27 | typos, refs |
html | c3b365a | John Blischak | 2017-01-02 | Build site. |
Rmd | 67a8575 | John Blischak | 2017-01-02 | Use external chunk to set knitr chunk options. |
Rmd | 5ec12c7 | John Blischak | 2017-01-02 | Use session-info chunk. |
Rmd | 3bb3b73 | mbonakda | 2016-02-24 | add two mixture model vignettes + merge redundant info in markov chain vignettes |
This document assumes basic familiarity with mixture models.
In this note we introduced mixture models. Recall that if our observations Xi come from a mixture model with K mixture components, the marginal probability distribution of Xi is of the form: P(Xi=x)=K∑k=1πkP(Xi=x|Zi=k) where Zi∈{1,…,K} is the latent variable representing the mixture component for Xi, P(Xi|Zi) is the mixture component, and πk is the mixture proportion representing the probability that Xi belongs to the k-th mixture component.
In this note, we will introduce the expectation-maximization (EM) algorithm in the context of Gaussian mixture models. Let N(μ,σ2) denote the probability distribution function for a normal random variable. In this scenario, we have that the conditional distribution Xi|Zi=k∼N(μk,σ2k) so that the marginal distribution of Xi is: P(Xi=x)=K∑k=1P(Zi=k)P(Xi=x|Zi=k)=K∑k=1πkN(x;μk,σ2k)
Similarly, the joint probability of observations X1,…,Xn is therefore: P(X1=x1,…,Xn=xn)=n∏i=1K∑k=1πkN(xi;μk,σ2k)
This note describes the EM algorithm which aims to obtain the maximum likelihood estimates of πk,μk and σ2k given a data set of observations {x1,…,xn}.
Suppose we have n observations X1,…,Xn from a Gaussian distribution with unknown mean μ and known variance σ2. To find the maximum likelihood estimate for μ, we find the log-likelihood ℓ(μ), take the derivative with respect to μ, set it equal zero, and solve for μ: L(μ)=n∏i=11√2πσ2exp−(xi−μ)22σ2⇒ℓ(μ)=n∑i=1[log(1√2πσ2)−(xi−μ)22σ2]⇒ddμℓ(μ)=n∑i=1xi−μσ2
Setting this equal to zero and solving for μ, we get that μMLE=1n∑ni=1xi. Note that applying the log function to the likelihood helped us decompose the product and removed the exponential function so that we could easily solve for the MLE.
Now we attempt the same strategy for deriving the MLE of the Gaussian mixture model. Our unknown parameters are θ={μ1,…,μK,σ1,…,σK,π1,…,πK}, and so from the first section of this note, our likelihood is: L(θ|X1,…,Xn)=n∏i=1K∑k=1πkN(xi;μk,σ2k) So our log-likelihood is: ℓ(θ)=n∑i=1log(K∑k=1πkN(xi;μk,σ2k))
Taking a look at the expression above, we already see a difference between this scenario and the simple setup in the previous section. We see that the summation over the K components “blocks” our log function from being applied to the normal densities. If we were to follow the same steps as above and differentiate with respect to μk and set the expression equal to zero, we would get: n∑i=11∑Kk=1πkN(xi;μk,σk)πkN(xi;μk,σk)(xi−μk)σ2k=0
Now we’re stuck because we can’t analytically solve for μk. However, we make one important observation which provides intuition for whats to come: if we knew the latent variables Zi, then we could simply gather all our samples Xi such that Zi=k and simply use the estimate from the previous section to estimate μk.
Intuitively, the latent variables Zi should help us find the MLEs. We first attempt to compute the posterior distribution of Zi given the observations: P(Zi=k|Xi)=P(Xi|Zi=k)P(Zi=k)P(Xi)=πkN(μk,σ2k)∑Kk=1πkN(μk,σk)=γZi(k)
Now we can rewrite equation (1), the derivative of the log-likelihood with respect to μk, as follows: n∑i=1γZi(k)(xi−μk)σ2k=0
Even though γZi(k) depends on μk, we can cheat a bit and pretend that it doesn’t. Now we can solve for μk in this equation to get: ^μk=∑ni=1γzi(k)xi∑ni=1γzi(k)=1Nkn∑i=1γzi(k)xi.
Where we set Nk=∑ni=1γzi(k). We can think of Nk as the effective number of points assigned to component k. We see that ^μk is therefore a weighted average of the data with weights γzi(k). Similarly, if we apply a similar method to finding ^σ2k and ^πk, we find that: ^σ2k=1Nkn∑i=1γzi(k)(xi−μk)2^πk=Nkn
Again, remember that γZi(k) depends on the unknown parameters, so these equations are not closed-form expressions. This looks like a vicious circle. But, as Cosma Shalizi says, “one man’s vicious circle is another man’s successive approximation procedure.”
We are now in the following situation:
The EM algorithm, motivated by the two observations above, proceeds as follows:
The EM algorithm is sensitive to the initial values of the parameters, so care must be taken in the first step. However, assuming the initial values are “valid,” one property of the EM algorithm is that the log-likelihood increases at every step. This invariant proves to be useful when debugging the algorithm in practice.
The EM algorithm attempts to find maximum likelihood estimates for models with latent variables. In this section, we describe a more abstract view of EM which can be extended to other latent variable models.
Let X be the entire set of observed variables and Z the entire set of latent variables. The log-likelihood is therefore: log(P(X|Θ))=log(∑ZP(X,Z|Θ))
where we’ve simply marginalized Z out of the joint distribution.
As we noted above, the existence of the sum inside the logarithm prevents us from applying the log to the densities which results in a complicated expression for the MLE. Now suppose that we observed both X and Z. We call {X,Z} the complete data set, and we say X is incomplete. As we noted previously, if we knew Z, the maximization would be easy.
We typically don’t know Z, but the information we do have about Z is contained in the posterior P(Z|X,Θ). Since we don’t know the complete log-likelihood, we consider its expectation under the posterior distribution of the latent variables. This corresponds to the E-step above. In the M-step, we maximize this expectation to find a new estimate for the parameters.
In the E-step, we use the current value of the parameters θ0 to find the posterior distribution of the latent variables given by P(Z|X,θ0). This corresponds to the γZi(k) in the previous section. We then use this to find the expectation of the complete data log-likelihood, with respect to this posterior, evaluated at an arbitrary θ. This expectation is denoted Q(θ,θ0) and it equals: Q(θ,θ0)=EZ|X,θ0[log(P(X,Z|θ))]=∑ZP(Z|X,θ0)log(P(X,Z|θ))
In the M-step, we determine the new parameter ˆθ by maximizing Q: ˆθ=argmaxθQ(θ,θ0)
Now we derive the relevant quantities for Gaussian mixture models and compare it to our “informal” derivation above. The complete likelihood takes the form P(X,Z|μ,σ,π)=n∏i=1K∏k=1πI(Zi=k)kN(xi|μk,σk)I(Zi=k) so the complete log-likelihood takes the form: log(P(X,Z|μ,σ,π))=n∑i=1K∑k=1I(Zi=k)(log(πk)+log(N(xi|μk,σk)))
Note that for the complete log-likelihood, the logarithm acts directly on the normal density which leads to a simpler solution for the MLE. As we said, in practice, we do not observe the latent variables, so we consider the expectation of the complete log-likelihood with respect to the posterior of the latent variables.
The expected value of the complete log-likelihood is therefore: EZ|X[log(P(X,Z|μ,σ,π))]=EZ|X[n∑i=1K∑k=1I(Zi=k)(log(πk)+log(N(xi|μk,σk)))]=n∑i=1K∑k=1EZ|X[I(Zi=k)](log(πk)+log(N(xi|μk,σk))) Since EZ|X[I(Zi=k)]=P(Zi=k|X), we see that this is simply γZi(k) which we computed in the previous section. Hence, we have
EZ|X[log(P(X,Z|μ,σ,π))]=n∑i=1K∑k=1γZi(k)(log(πk)+log(N(xi|μk,σk)))
EM proceeds as follows: first choose initial values for μ,σ,π and use these in the E-step to evaluate the γZi(k). Then, with γZi(k) fixed, maximize the expected complete log-likelihood above with respect to μk,σk and πk. This leads to the closed form solutions we derived in the previous section.
In this example, we will assume our mixture components are fully specified Gaussian distributions (i.e the means and variances are known), and we are interested in finding the maximum likelihood estimates of the πk’s.
Assume we have K=2 components, so that: Xi|Zi=0∼N(5,1.5)Xi|Zi=1∼N(10,2)
The true mixture proportions will be P(Zi=0)=0.25 and P(Zi=1)=0.75. First we simulate data from this mixture model:
# mixture components
mu.true = c(5, 10)
sigma.true = c(1.5, 2)
# determine Z_i
Z = rbinom(500, 1, 0.75)
# sample from mixture model
X <- rnorm(10000, mean=mu.true[Z+1], sd=sigma.true[Z+1])
hist(X,breaks=15)
Now we write a function to compute the log-likelihood for the incomplete data, assuming the parameters are known. This will be used to determine convergence: ℓ(θ)=n∑i=1log(2∑k=1πkN(xi;μk,σ2k)⏟L[i,k])
compute.log.lik <- function(L, w) {
L[,1] = L[,1]*w[1]
L[,2] = L[,2]*w[2]
return(sum(log(rowSums(L))))
}
Since the mixture components are fully specified, for each sample Xi we can compute the likelihood P(Xi|Zi=0) and P(Xi|Zi=1). We store these values in the columns of L:
L = matrix(NA, nrow=length(X), ncol= 2)
L[, 1] = dnorm(X, mean=mu.true[1], sd = sigma.true[1])
L[, 2] = dnorm(X, mean=mu.true[2], sd = sigma.true[2])
Finally, we implement the E and M step in the EM.iter
function below. The mixture.EM
function is the driver which checks for convergence by computing the log-likelihoods at each step.
mixture.EM <- function(w.init, L) {
w.curr <- w.init
# store log-likehoods for each iteration
log_liks <- c()
ll <- compute.log.lik(L, w.curr)
log_liks <- c(log_liks, ll)
delta.ll <- 1
while(delta.ll > 1e-5) {
w.curr <- EM.iter(w.curr, L)
ll <- compute.log.lik(L, w.curr)
log_liks <- c(log_liks, ll)
delta.ll <- log_liks[length(log_liks)] - log_liks[length(log_liks)-1]
}
return(list(w.curr, log_liks))
}
EM.iter <- function(w.curr, L, ...) {
# E-step: compute E_{Z|X,w0}[I(Z_i = k)]
z_ik <- L
for(i in seq_len(ncol(L))) {
z_ik[,i] <- w.curr[i]*z_ik[,i]
}
z_ik <- z_ik / rowSums(z_ik)
# M-step
w.next <- colSums(z_ik)/sum(z_ik)
return(w.next)
}
#perform EM
ee <- mixture.EM(w.init=c(0.5,0.5), L)
print(paste("Estimate = (", round(ee[[1]][1],2), ",", round(ee[[1]][2],2), ")", sep=""))
[1] "Estimate = (0.29,0.71)"
Finally, we inspect the evolution of the log-likelihood and note that it is strictly increases:
plot(ee[[2]], ylab='incomplete log-likelihood', xlab='iteration')
sessionInfo()
R version 3.5.2 (2018-12-20)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Mojave 10.14.4
Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
loaded via a namespace (and not attached):
[1] workflowr_1.4.0 Rcpp_1.0.1 digest_0.6.19 rprojroot_1.3-2
[5] backports_1.1.4 git2r_0.25.2 magrittr_1.5 evaluate_0.14
[9] stringi_1.4.3 fs_1.3.1 whisker_0.3-2 rmarkdown_1.13
[13] tools_3.5.2 stringr_1.4.0 glue_1.3.1 xfun_0.7
[17] yaml_2.2.0 compiler_3.5.2 htmltools_0.3.6 knitr_1.23
This site was created with R Markdown