Loading [MathJax]/jax/output/HTML-CSS/jax.js
  • Pre-requisites
  • Overview
  • Review: MLE of Normal Distribution
  • MLE of Gaussian Mixture Model
    • EM, informally
  • EM, formally
    • Gaussian Mixture Models
  • Example
  • Session information

Last updated: 2017-01-02

Code version: 55e11cf8f7785ad926b716fb52e4e87b342f38e1

Pre-requisites

This document assumes basic familiarity with mixture models.

Overview

In this note we introduced mixture models. Recall that if our observations Xi come from a mixture model with K mixture components, the marginal probability distribution of Xi is of the form: P(Xi=x)=Kk=1πkP(Xi=x|Zi=k) where Zi{1,,K} is the latent variable representing the mixture component for Xi, P(Xi|Zi) is the mixture component, and πk is the mixture proportion representing the probability that Xi belongs to the k-th mixture component.

In this note, we will introduce the expectation-maximization (EM) algorithm in the context of Gaussian mixture models. Let N(μ,σ2) denote the probability distribution function for a normal random variable. In this scenario, we have that the conditional distribution Xi|Zi=kN(μk,σ2k) so that the marginal distribution of Xi is: P(Xi=x)=Kk=1P(Zi=k)P(Xi=x|Zi=k)=Kk=1πkN(x;μk,σ2k)

Similarly, the joint probability of observations X1,,Xn is therefore: P(X1=x1,,Xn=xn)=ni=1Kk=1πkN(xi;μk,σ2k)

This note describes the EM algorithm which aims to obtain the maximum likelihood estimates of πk,μk and σ2k given a data set of observations {x1,,xn}.

Review: MLE of Normal Distribution

Suppose we have n observations X1,,Xn from a Gaussian distribution with unknown mean μ and known variance σ2. To find the maximum likelihood estimate for μ, we find the log-likelihood (μ), take the derivative with respect to μ, set it equal zero, and solve for μ: L(μ)=ni=11σ2πexp(xiμ)22σ2(μ)=ni=1log(1σ2π)((xiμ)22σ2)ddμ(μ)=ni=1xiμ2σ2

Setting this equal to zero and solving for μ, we get that μMLE=1nni=1xi. Note that applying the log function to the likelihood helped us decompose the product and removed the exponential function so that we could easily solve for the MLE.

MLE of Gaussian Mixture Model

Now we attempt the same strategy for deriving the MLE of the Gaussian mixture model. Our unknown parameters are θ={μ1,,μK,σ1,,σK,π1,,πK}, and so from the first section of this note, our likelihood is: L(θ|X1,,Xn)=ni=1Kk=1πkN(xi;μk,σ2k) So our log-likelihood is: (θ)=ni=1log(Kk=1πkN(xi;μk,σ2k))

Taking a look at the expression above, we already see a difference between this scenario and the simple setup in the previous section. We see that the summation over the K components “blocks” our log function from being applied to the normal densities. If we were to follow the same steps as above and differentiate with respect to μk and set the expression equal to zero, we would get: ni=11Kk=1πkN(xi;μk,σk)πkN(xi;μk,σk)(xiμk)σ2k=0

Now we’re stuck because we can’t analytically solve for μk. However, we make one important observation which provides intuition for whats to come: if we knew the latent variables Zi, then we could simply gather all our samples Xi such that Zi=k and simply use the estimate from the previous section to estimate μk.

EM, informally

Intuitively, the latent variables Zi should help us find the MLEs. We first attempt to compute the posterior distribution of Zi given the observations: P(Zi=k|Xi)=P(Xi|Zi=k)P(Zi=k)P(Xi)=πkN(μk,σ2k)Kk=1πkN(μk,σk)=γZi(k)

Now we can rewrite equation (1), the derivative of the log-likelihood with respect to μk, as follows: ni=1γZi(k)(xiμk)σ2k=0

Even though γZi(k) depends on μk, we can cheat a bit and pretend that it doesn’t. Now we can solve for μk in this equation to get: ^μk=ni=1γzi(k)xini=1γzi(k)=1Nkni=1γzi(k)xi.

Where we set Nk=ni=1γzi(k). We can think of Nk as the effective number of points assigned to component k. We see that ^μk is therefore a weighted average of the data with weights γzi(k). Similarly, if we apply a similar method to finding ^σ2k and ^πk, we find that: ^σ2k=1Nkni=1γzi(k)(xiμk)2^πk=Nkn

Again, remember that γZi(k) depends on the unknown parameters, so these equations are not closed-form expressions. This looks like a vicious circle. But, as Cosma Shalizi says, “one man’s vicious circle is another man’s successive approximation procedure.”

We are now in the following situation:

  1. If we knew the parameters, we could compute the posterior probabilities γZi(k)
  2. If we knew the posteriors γZi(k), we could easily compute the parameters

The EM algorithm, motivated by the two observations above, proceeds as follows:

  1. Initialize the μk’s, σk’s and πk’s and evaluate the log-likelihood with these parameters.
  2. E-step: Evaluate the posterior probabilities γZi(k) using the current values of the μk’s and σk’s with equation (2)
  3. M-step: Estimate new parameters ^μk, ^σ2k and ^πk with the current values of γZi(k) using equations (3), (4) and (5).
  4. Evaluate the log-likelihood with the new parameter estimates. If the log-likelihood has changed by less than some small ϵ, stop. Otherwise, go back to step 2.

The EM algorithm is sensitive to the initial values of the parameters, so care must be taken in the first step. However, assuming the initial values are “valid,” one property of the EM algorithm is that the log-likelihood increases at every step. This invariant proves to be useful when debugging the algorithm in practice.

EM, formally

The EM algorithm attempts to find maximum likelihood estimates for models with latent variables. In this section, we describe a more abstract view of EM which can be extended to other latent variable models.

Let X be the entire set of observed variables and Z the entire set of latent variables. The log-likelihood is therefore: log(P(X|Θ))=log(ZP(X,Z|Θ))

where we’ve simply marginalized Z out of the joint distribution.

As we noted above, the existence of the sum inside the logarithm prevents us from applying the log to the densities which results in a complicated expression for the MLE. Now suppose that we observed both X and Z. We call {X,Z} the complete data set, and we say X is incomplete. As we noted previously, if we knew Z, the maximization would be easy.

We typically don’t know Z, but the information we do have about Z is contained in the posterior P(Z|X,Θ). Since we don’t know the complete log-likelhood, we consider its expectation under the posterior distribution of the latent variables. This corresponds to the E-step above. In the M-step, we maximize this expectation to find a new estimate for the parameters.

In the E-step, we use the current value of the parameters θ0 to find the posterior distribution of the latent variables given by P(Z|X,θ0). This corresponds to the γZi(k) in the previous section. We then use this to find the expectation of the complete data log-likelihood, with respect to this posterior, evaluated at an arbitrary θ. This expectation is denoted Q(θ,θ0) and it equals: Q(θ,θ0)=EZ|X,θ0[log(P(X,Z|θ))]=ZP(Z|X,θ0)log(P(X,Z|θ))

In the M-step, we determine the new parameter ˆθ by maximizing Q: ˆθ=argmaxθQ(θ,θ0)

Gaussian Mixture Models

Now we derive the relevant quantities for Gaussian mixture models and compare it to our “informal” derivation above. The complete likelihood takes the form P(X,Z|μ,σ,π)=ni=1Kk=1πI(Zi=k)kN(xi|μk,σk)I(Zi=k) so the complete log-likelihood takes the form: log(P(X,Z|μ,σ,π))=ni=1Kk=1I(Zi=k)(log(πk)+log(N(xi|μk,σk)))

Note that for the complete log-likelihood, the logarithm acts directly on the normal density which leads to a simpler solution for the MLE. As we said, in practice, we do not observe the latent variables, so we consider the expectation of the complete log-likelihood with respect to the posterior of the latent variables.

The expected value of the complete log-likelihood is therefore: EZ|X[log(P(X,Z|μ,σ,π))]=EZ|X[ni=1Kk=1I(Zi=k)(log(πk)+log(N(xi|μk,σk)))]=ni=1Kk=1EZ|X[I(Zi=k)](log(πk)+log(N(xi|μk,σk))) Since EZ|X[I(Zi=k)]=P(Zi=k|X), we see that this is simply γZi(k) which we computed in the previous section. Hence, we have

EZ|X[log(P(X,Z|μ,σ,π))]=ni=1Kk=1γZi(k)(log(πk)+log(N(xi|μk,σk)))

EM proceeds as follows: first choose initial values for μ,σ,π and use these in the E-step to evaluate the γZi(k). Then, with γZi(k) fixed, maximize the expected complete log-likelihood above with respect to μk,σk and πk. This leads to the closed form solutions we derived in the previous section.

Example

In this example, we will assume our mixture components are fully specified Gaussian distributions (i.e the means and variances are known), and we are interested in finding the maximum likelihood estimates of the πk’s.

Assume we have K=2 components, so that: Xi|Zi=0N(5,1.5)Xi|Zi=1N(10,2)

The true mixture proportions will be P(Zi=0)=0.25 and P(Zi=1)=0.75. First we simulate data from this mixture model:

# mixture components
mu.true    = c(5, 10)
sigma.true = c(1.5, 2)

# determine Z_i
Z = rbinom(500, 1, 0.75)
# sample from mixture model

X <- rnorm(10000, mean=mu.true[Z+1], sd=sigma.true[Z+1])
hist(X,breaks=15)

Now we write a function to compute the log-likelihood for the incomplete data, assuming the parameters are known. This will be used to determine convergence: (θ)=ni=1log(2k=1πkN(xi;μk,σ2k)L[i,k])

compute.log.lik <- function(X, L, w) {
  L[,1] = L[,1]*w[1]
  L[,2] = L[,2]*w[2]
  return(sum(log(rowSums(L))))
}

Since the mixture components are fully specified, for each sample Xi we can compute the likelihood P(Xi|Zi=0) and P(Xi|Zi=1). We store these values in the columns of L:

L = matrix(NA, nrow=length(X), ncol= 2)
L[, 1] = dnorm(X, mean=mu.true[1], sd = sigma.true[1])
L[, 2] = dnorm(X, mean=mu.true[2], sd = sigma.true[2])

Finally, we implement the E and M step in the EM.iter function below. The mixture.EM function is the driver which checks for convergence by computing the log-likelihoods at each step.

mixture.EM <- function(w.init, L, X) {
  
  w.curr <- w.init
  
  # store log-likehoods for each iteration
  log_liks <- c()
  ll       <- compute.log.lik(X, L, w.curr)
  log_liks <- c(log_liks, ll)
  delta.ll <- 1
  
  while(delta.ll > 1e-5) {
    w.curr   <- EM.iter(w.curr, L)
    ll       <- compute.log.lik(X, L, w.curr)
    log_liks <- c(log_liks, ll)
    delta.ll <- log_liks[length(log_liks)]  - log_liks[length(log_liks)-1]
  }
  return(list(w.curr, log_liks))
}

EM.iter <- function(w.curr, L, ...) {
  
  # E-step: compute E_{Z|X,w0}[I(Z_i = k)]
  z_ik <- L
  for(i in seq_len(ncol(L))) {
    z_ik[,i] <- w.curr[i]*z_ik[,i]
  }
  z_ik     <- z_ik / rowSums(z_ik)
  
  # M-step
  w.next   <- colSums(z_ik)/sum(z_ik)
  return(w.next)
}
#perform EM
ee <- mixture.EM(w.init=c(0.5,0.5), L, X)
print(paste("Estimate = (", round(ee[[1]][1],2), ",", round(ee[[1]][2],2), ")", sep=""))
[1] "Estimate = (0.29,0.71)"

Finally, we inspect the evolution of the log-likelihood and note that it is strictly increases:

plot(ee[[2]], ylab='incomplete log-likelihood', xlab='iteration')

Session information

sessionInfo()
R version 3.3.2 (2016-10-31)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 14.04.5 LTS

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] rmarkdown_1.1

loaded via a namespace (and not attached):
 [1] magrittr_1.5    assertthat_0.1  formatR_1.4     htmltools_0.3.5
 [5] tools_3.3.2     yaml_2.1.13     tibble_1.2      Rcpp_0.12.7    
 [9] stringi_1.1.1   knitr_1.14      stringr_1.0.0   digest_0.6.9   
[13] gtools_3.5.0    evaluate_0.9   

This site was created with R Markdown