This blog post is part of the Series: Monte Carlo Methods.

You can find this blog post on weisser-zwerg.dev or on github as either html or via nbviewer.

Monte Carlo Fundamental Concepts

Table of Contents

Boundary between statistics and machine learning / AI

Only very recently I found the work of 2010 Turing Award winner Leslie Valiant. In his book Probably Approximately Correct he defines the two concepts theoryful and theoryless. This would be the place where I would draw the line between statistics and machine learning.

Theoryful: a human has to provide a model and there may be some data that helps to tune the model. A good example would be Newtonian motion of the planets around the sun where Newton's laws provide the model and the data helps to pinpoint the gravitational constant.

Theoryless: you take a very generic algorithm and a lot of data and the algorithm figures out the rest. A good example would be a multilayer perceptron and pictures of objects, where the algorithm learns via data to identify the objects. The same multilayer perceptron architecture could be used to pilot a self-driving car.

Monte Carlo sampling methodology

The general Monte Carlo sampling methodology relies on 2 ingredients:

The construction of sampling algorithms based on fine-grained composable abstractions (FCA) will be the main focus of this blog post series.

In the case of Importance Sampling, you could argue that there is not only a single model, but a sequence of related models that form a "bridge" from something simple to the target model of interest. But we will get there in the course of this blog post series.

A word of caution about the picture of "Bayes updating"

Initially, when I started to read about the Bayesian method of statistics and its inference methodology, I often heard the term Bayes updating. The picture that this terminology evoked in me was that there is a sequential process of taking a prior, a likelihood for a single event and then perform "Bayes updating" to receive the posterior, with that posterior becoming the new prior for the next data-point from where the process continuous until all data-items are consumed.

$ \displaystyle \begin{array}{rcl} d_i&\in&\{d_1, d_2, ..., d_N\}\\ p(d_1,\theta)&=&p_0(\theta)\cdot p(d_1\,|\,\theta)\\ p(\theta\,|\,d_1)&=&\displaystyle \frac{p(d_1,\theta)}{\int p(d_1,\theta)\,\mathrm{d}\theta}=p_1(\theta)\\ p(d_2,\theta)&=&p_1(\theta)\cdot p(d_2\,|\,\theta)\\ p(\theta\,|\,d_2)&=&\displaystyle \frac{p(d_2,\theta)}{\int p(d_2,\theta)\,\mathrm{d}\theta}=p_2(\theta)\\ &...&\\ &=&p_N(\theta) \end{array} $

While this picture is of course correct and is what you would do if you calculated the posterior manually by using (for example) conjugate priors, this picture prevented me from understanding what is going on in especially MCMC methods, but also in SMC methods and I found it much more useful to think of the model as the (immutable – without updating) total joint probability including all data-points, e.g.:

$ \displaystyle \begin{array}{rcl} p(d_1,d_2,...,d_N,\theta)&=&p_0(\theta)\cdot p(d_1\,|\,\theta)\cdot p(d_2\,|\,\theta)\cdot ... \cdot p(d_N\,|\,\theta)\\ \end{array} $

That is all you need for the main Monte Carlo sampling methods, because they only depend on an unnormalized probability density function and if you take $p(d_1=d_1,d_2=d_2,...,d_N=d_N,\theta)$ (you simply take the total joint probability function and partially apply its data arguments so that you receive a function of the set of unknown parameters $\theta$ alone, e.g. $f(\theta)=p(d_1=d_1,d_2=d_2,...,d_N=d_N,\theta)$) then you are ready to go, because $f(\theta)\propto p(\theta\,|\,d_1=d_1,d_2=d_2,...,d_N=d_N)=\frac{f(\theta)}{Z}$ where $Z$ (the so called partition function)) is independent from $\theta$ and therefore a constant for the concrete data-set you're looking at.

Statistical Models

In (Bayesian) statistics a model is basically a total probability mass function (PMF; in the discrete case) or a probability density function (PDF; in the continuous case). It comes often in the form of a probabilistic graphical model (PGM; also see Bayesian hierarchical model). As an example, you see below the graphical representation of multiple coin tosses using the same coin:

In standard mathematical notation this would be:

$\displaystyle \begin{array}{rcl} p(\theta) &=& \mathrm{Beta}(\alpha=1,\beta=1) \\ p(X_i=d_i\,|\,\theta) &=& \mathrm{Bernoulli}(X_i=d_i;p=\theta) \\ \displaystyle p(X_1=d_1, X_2 = d_2, ..., X_n = d_n, \theta)&=& \displaystyle p(\theta) \cdot\prod_{i=1}^N p(X_i=d_i\,|\,\theta) \end{array} $

Where on the left hand side you have the total probability and on the right hand side you have the prior probability $p(\theta)$ (here given as a Beta distribution) multiplied by the conditional probabilities $p(X_i=d_i\,|\,\theta)$ (here given as a Bernoulli distribution). The important point to remember is YOU NEED THE TOTAL PROBABILITY FUNCTION. How you get it does not matter. The total probability function is a scalar function, e.g. once you provide all arguments you get a scalar.

For pure continuous probability density functions and pure discrete probability mass functions this is straight forward. In the pure continuous case you typically have a function like a product of Gauss/Normal: $f(x_1,x_2;\mu,\sigma)=\frac{1}{\sqrt{2\pi}\sigma}e^{-\frac{1}{2}\frac{(x_1-\mu)^2}{\sigma^2} }\frac{1}{\sqrt{2\pi}\sigma}e^{-\frac{1}{2}\frac{(x_2-\mu)^2}{\sigma^2} }$. In the pure discrete case, you get a multi-dimensional array like for two dice rolls:

For mixed cases I often got confused and therefore I elaborated two examples in painstaking detail in the appendix:

Quality of a model: model comparison, model selection, model averaging

While this topic is beyond the scope of this blog post series, I nevertheless wanted to mention it. Once you have several models that are supposed to describe the same observations how do you then decide which model is actually the "best" model?

The landmark rule is that a model is only as good as it is able to describe/predict future/unseen data-points. An approximation of that characteristic can be obtained via cross-validation, where one variant is leave-one-out cross-validation (LOOCV). You can either apply this directly or use some approximation like an information criterion. In "Statistical Rethinking: A Bayesian Course with Examples in R and STAN" in chapter 7. "Ulysses' Compass" Richard McElreath proposes to use Pareto-Smoothed Importance Sampling Cross-Validation (PSIS) or the Widely Applicable Information Criterion (WAIC). He adds a word of caution, though: Deviance is an assessment of predictive accuracy, not of truth!

John Kruschke proposes in "Doing Bayesian Data Analysis: A Tutorial with R, JAGS, and Stan" to use the Bayes factor (see for example my earlier blog post: Model Comparison via Bayes Factor) or to create a mixture model out of the different competing (sub-)models and then simply apply the machinery of the Monte Carlo sampling methodology to get probabilities for each sub-model as $p_1, p_2, ..., p_M$. The idea then is to say that the model with the highest probability describes the data best. This approach is also the basis for model averaging to create a new and better model by simply using this mixture model for your predictions.

How to construct models: keep the Bayes' theorem and the law of total probability (extending the conversation) always in mind

Bayes' theorem

$ \displaystyle p(x,y)=p(y|x)p(x)=p(x|y)p(y) $

This is how you typically construct (generative) models. You basically say: if I knew $x$ then knowing $y$ would be simple. Let's take a physical example like throwing a ball. In a vacuum the trajectory of the ball would be a parabola. Under realistic conditions you have wind and other disturbances. But if you say, if I know where the ball measured by location $x$ (a 3 dimensional vector) was at $t=t_0$ then I could tell where the ball would be at $t=t_1$:

$ \displaystyle \begin{array}{rcl} x_1& = & x_0+v(t_1-t_0)+(g/2)(t_1-t_0)^2+\varepsilon_m;\quad\text{with }\varepsilon_m\sim\mathcal{N}(0,\sigma_m)\\ \Rightarrow p(x_1\,|\,x_0) & = &\mathcal{N}(x_1; x_0+v(t_1-t_0)+(g/2)(t_1-t_0)^2,\sigma_m) \end{array} $

If you then in addition say that at time $t_0$ the ball was roughly at $\mu_0+\varepsilon_l$ with $\varepsilon_l\sim\mathcal{N}(0,\sigma_l)$, e.g. $p(x_0) = \mathcal{N}(x_0; \mu_0,\sigma_l)$ then you can tell the probability distribution of $p(x_1)$ by:

$ \displaystyle \begin{array}{rcl} p(x_0,x_1) &=& p(x_0)\cdot p(x_1\,|\,x_0) = \mathcal{N}(x_0; \mu_0,\sigma_l)\cdot\mathcal{N}(x_1; x_0+v(t_1-t_0)+(g/2)(t_1-t_0)^2,\sigma_m)\\ p(x_1)&=&\int p(x_0,x_1)\mathrm{d}x_0 \end{array} $

Law of total probability (extending the conversation)

Initially, when I saw the law of total probability for the first time I neglected it, because it looks like redundant to the Bayes' theorem:

$ p(x)=\int p(x\,|\,y)\cdot p(y)\,\mathrm{d}y $

I only learned to appreciate this law really when reading chapter 1.3.6 "The Law of Total Probability" in Causal Inference in Statistics - A Primer by Judea Pearl (Turing Award winner 2011), Madelyn Glymour and Nicholas P. Jewell. There they also presented it under different synonymous names like "the law of alternatives" or "extending the conversation". In the book they call it "conditionalizing on y", but the term "extending the conversation" was what stuck with me.

Sometimes you just don't know how to get to $p(x)$, but if you knew $y$ you also would know how to get to $x$, e.g. you know $p(x\,|\,y)$. You're "extending the conversation" to include $y$ in addition to $x$. And often getting to a $p(y)$ is also possible.

It is basically the reverse idea from above in the example with the motion of the ball. If you just said I want to know $p(x_1)$ I would not know how to get there, but if you introduce $p(x_1\,|\,x_0)$ this opens up the path for a recursive calculation to the point where the ball was thrown and at that point you either know the position exactly, e.g. $p(x)=\delta(x = x_\text{start})$ (where the $\delta$ is the Dirac delta) or you give it some uncertainty, e.g. $p(x)=\mathcal{N}(x; x_\text{start},\sigma_l)$. Eitherway, you can construct a generative model.

What is the "right" representation of (multi-dimensional) probability distributions?

That is another point that took me some time to digest. The names like probability-density-function (PDF) or probability-mass-function (PMF) suggest a mathematical form like a function. The typical example is the Gauss bell shape:

In this case the probability density function can be given as a mathematical function:

$\displaystyle \frac{1}{\sqrt{2\pi}\sigma}e^{- \frac{\left(x - \mu\right)^{2}}{2 \sigma^{2}}} $

But in most cases that is not possible. So what should you do then?

I cannot remember anymore in which source I've read it, but the author said that for probability distributions the "right" representation might actually be collections of samples.

Let's look at a toy example. We sample 1000 values from a normal distribution with mean 0 and standard deviation 1. Then we act as if we would only know that the samples are drawn from a normal distribution, but with unknown $\mu$ and $\sigma$. In PyMC3 this example looks like this:

Just as a reminder and to make the link to what we said above (about the model) more clear, let's be explicit and give the probability density function in sympy for two data-points instead of the 1000. All that PyMC3 does is to construct such a total joint probability density function and evaluate it at the given data-points. So in our example $x_1$ and $x_2$ are given and $\mu$ and $\sigma$ are "free" variables.

As another remark: the Gamma distribution is sometimes parametrized via $\alpha, \beta$ and sometimes via $k, \theta$: $\alpha = k; \beta=1/\theta$

PyMC3 created a trace object containing samples of the "free"/unknown variables. We can look at them as follows:

If we would want to know $\mu$ and $\sigma$ we could take the mean:

There is a built-in PyMC3 method that gives us more details in addition to the mean values:

The hdi stands for Highest (Posterior) Density Interval and gives the range in which credible values are located. As you can see, our real values 0.0 for $\mu$ and 1.0 for $\sigma$ are within those hdi invervals.

To get a better idea of the plausible parameter values we'll have a look at a density plot.

We will talk about it in more detail below, but what we did above is that we looked at the marginal likelihood of only $\mu$ or only $\sigma$. If your probability density function is given as samples then marginalizing out other parameters is trivial. You only take the parameter(s) (the columns in the sample dataframe) you're interested in.

It is important to understand that the samples, e.g. each row in the sample dataframe, are jointly plausible, e.g. they represent the joint probability distribution. If two parameters would be correlated you would see that in the pairsplot like below. In our example the two parameters are obviously uncorrelated.

Operations on probability distributions: marginalization and conditional probability

While the marginalization and conditional probability operations are quite involved when you have to deal with probability distributions in their mathematical functions form these operations are trivial if you have samples representing the probability distributions.

Let's start with marginalization. Let's assume you have a joint probability density function $p(x,y)$ for two random variables $X$ and $Y$. Then you marginalize out $y$ by integrating over all values of $y$:

$ \displaystyle \begin{array}{rcl} p(x)&=&\displaystyle\int p(x,y)\,\mathrm{d}y \end{array} $

This integration may be quite involved even in simple cases like a multi-variate normal distribution. Have a look at Bayes’ Theorem for Gaussians by Chris Bracegirdle for details.

In the case of samples, on the other hand, marginalization is trivial! You simply project (in the sense of an SQL projection) on the column(s) you're interested in and you get the marginal distribution for that parameter set.

The conditional probability is defined as:

$ \displaystyle \begin{array}{rcl} p(x,y)&=&\displaystyle p(y\,|\, x)p(x) \\ \Rightarrow p(y\,|\, x=x_0)&=&\displaystyle \frac{p(x_0,y)}{p(x_0)}\\ &=&\displaystyle \frac{p(x=x_0,y)}{\int p(x=x_0,y)\mathrm{d}y} \text{ you simply normalize the function} f(y)=p(x=x_0,y) \end{array} $

You have to divide two distributions and often have to calculate integrals. I also show the variation where you use the marginalization $\int p(x=x_0,y)\mathrm{d}y$, because sometimes that's useful to remember. $p(x=x_0,y)$ is no probability distribution in $y$ any longer, because $\int p(x=x_0,y)\mathrm{d}y < 1$.

If you are in the world of samples then the conditional probability is calculated like an SQL where clause, e.g. you sub-select the dataframe for values where $x=x_0$. For continuous variables this condition is nearly never true. Therefore you have to sub-select with a where condition like $x_0-\delta<x<x_0+\delta$. Finally you only keep the columns (SQL projection) of the variables that are on the left hand side of the conditional probability bar ($(\cdot\,|\,\cdot)$), e.g. $y$ in our example.

Calculations with random variables: convolutions and other nasty stuff

Calculating with random variables when their distribution is given as mathematical functions is just painfully difficult. The book The algebra of random variables by Melvine Dale Springer gives a detailed overview. But even the simplest example like adding two random variables like $X=X_1+X_2$ where $X_1\sim p_1(x_1)$ and $X_2\sim p_2(x_2)$ involves convolutions, e.g. $X\sim p(x)= \int p_1(x-x_2)p_2(x_2)\,\mathrm{d}x_2$. The library PaCal (ProbAbilistic Calculator) or SymPy may help. Nevertheless it is just best to avoid these intricate calculations completely.

Luckily, in the world of samples, all of that is simple again! Imagine you have a distribution given as samples from $x_1$ and $x_2$. You then construct samples of $x_1+x_2$ by simply adding them up for every row in your samples dataframe, e.g. you create a new column $x$ by row-wise calculating $x = x_1 + x_2$. And there you are, you have samples of the random variable $X$!

Another example might be like above, like when we calculated the trajectory of the ball. You could also see that example as $X_1 = X_0 + \Delta X$, where $\Delta X$ would be a distribution depending on the time delta $t_2-t_1$, the velocity $v$ and the acceleration $g$. In such cases it is typically easiest to see one of the variables as the starting point, e.g. $X_0$ and then use the law of total probability to construct $p(x_0, x_1)=p(x_0)p(x_1\,|\,x_0)$ and then calculate on the "conditional" side of $p(x_1\,|\,x_0)$, e.g.

$ \displaystyle p(x_1\,|\,x_0) = \mathcal{N}(x_1; x_0+v(t_1-t_0)+(g/2)(t_1-t_0)^2,\sigma_m) $

As you can see, on the "conditional" side of $p(x_1\,|\,x_0)$ (the side involving $x_0$) we simply calculate like normal! We simply add, subtract, multiply and whatever else we want to do. We also avoid the convolution by sampling from the total joint probability distribution $p(x_0, x_1)$ and marginalizing on the samples by simply dropping the $x_0$ column if we really wanted to. No convolutions and no other difficult operations on distributions! Just normal math and sampling on the extended joint space.

Density Estimation

In the past, whenever I was dealing with machine learning methods, I always felt disappointed; disappointed about the lack of "insight" and "understanding" and that machine learning methods only give me a uni-directional way of reasoning.

Let's take regression or classification for example. Even the recent really impressive success stories like object detection and image segmentation only give you a way from an input to an output. If you wanted to know the most distinguishing features of a dog vs. a cat you're out of luck (there may be ways, but no straight-forward ones).

In my first encounter with Bayesian networks I was so impressed with the fact that you can not only reason forward from disease to symptoms, but also backward like given the symptoms of a patient, what are the most likely root causes, e.g. which disease might be the root cause and which actions/drugs might be the best way forward.

It took me quite some time to understand "what I really wanted". I wanted "all the answers", not only partial answers. And again some more time went by until I found the "term" that refers to what I want: density estimation. This term is often accompanied by other terms like unsupervised learning and recent advances in machine learning are also targeting density estimation directly via deep generative models. In the end it all boils down to "discovering" the causal network of the world, so that in the end you can answer any question you'd come up with, not only the ones the designer of the "system" wanted you to ask.

The algorithm has to always perform the same heavy lifting: there is no ancestral sampling

When I read the paper Mixed Hamiltonian Monte Carlo for Mixed Discrete and Continuous Variables by Guangyao Zhou he used an exmaple of a 24 dimensional Gaussian Mixture Model without observations to show the power of his approach. Initially, I felt as if this was cheating, because in the end forward/ancestral sampling is trival in that case. I had to remind myself that the MCMC algorithm he invented has to perform the same heavy lifting in the given example without observations as it has to perform in a situation that involves observations and the same number of free variables. The probabilistic programming languages and frameworks like Stan or PyMC3 do not employ any additional intelligence to decide which method might be most appropriate in which case.

You always take a model in the form of a scalar valued function representing the (unnormalized) total joint probability function and then you always apply the same hammer like an SMC or MCMC algorithm.

If you want to be smart on that level you have to do it yourself as a human with your own brain.

Appendix

Examples of mixed discrete continuous probability functions

Below I am exaggerating the level of explicitness at each step of the calculation and the level to which I use SymPy to achieve maximum clarity. I hope that this will help you to avoid a similar level of confusion that I found myself in. In addition you might learn to appreciate to use SymPy, which is really great for symbolic algebra.

Two independent coin tosses

The first example is about two independent coin tosses where we don't know $\theta\in[0,1]$, the probability for heads coming up. The total joint probability function (I only say "probability function" and avoid the terms densite or mass, because in the mixed case it is not clear in which situation we are) is then given by: $p(d_1,d_2,\theta)=p(X_0=d_0\,|\,\theta)\cdot p(X_1=d_1\,|\,\theta)\cdot p(\theta)$. If we take $p(\theta)=\mathrm{Beta}(1,1)=1$ for $\theta\in[0,1]$ then we get for each coin a one dimensional matrix with two entries:

The entry with index 0 belongs to the case where we observe "tail" and the entry with index 1 belongs to the case where we observe "head". We can index into that vector like so:

And the indexing operation can be seen as a function that takes an integer and returns a scalar value. Exactly what we need for a probability function.

For the second coin we take a 2x1 matrix (a horizontal lying vector), so that when we multiply the two we will get a 2x2 matrix for the 2x2 combinations of tail and head for each coin:

We said above that a probability function needs to be a function that takes as arguments the values of all random variables and delivers a scalar. In python this would look like:

If we then wanted to know the probability of one head and one tail for a fair coin we would get:

The other way we could have done this would be to use the Kronecker delta instead of vector/matrix indexing and using the fact that any number to the power of $0$ is $1$: $x^0=1 \quad\forall x\in \mathbb{R}\setminus 0$, e.g. we receive the neutral element for our product of probability functions.

Below, the first $1$ is to remind us of the $p(\theta)=\mathrm{Beta}(1,1)=1$. In case that we used another prior this would be a different factor.

And of course you get the same result:

The Kronecker delta variant looks more like a function, but a computer works better with an index lookup in a matrix. Finally we could use an if-then-else structure, but many probabilistic programming environments behave clumsy when it comes to if-then-else constructs.

Two component mixture model

The second example is a two component mixture model. In most cases people will show a mixture of two Gaussians, but with two Gaussians you can get away with using an array of $\mu$ and $\sigma$. Let's look at mixing one Gaussian with one Exponential distribution:

For the mixture to work we use a mixture indicator $i\in\{0,1\}$. If $i$ is 0 then we draw from the Exponential distribution and when $i$ is 1 we draw from the normal distribution.

For a single observation the total joint probability function would look as follows:

$ \displaystyle p(p_1,i,\lambda,\mu,\sigma,x)=p(p_1)p(i\,|\,p_1)\begin{cases}p(\lambda)p(x\,|\,\lambda) & \text{for}\: i = 0\\ p(\mu)p(\sigma)p(x\,|\,\mu,\sigma) & \text{for}\: i = 1 \\0 & \text{otherwise} \end{cases} $

Below we're using SymPy to construct the total joint probability function for two observations:

Often, when you ask how to model a mixture model in a probabilistic programming system like Stan or PyMC3, people will recommend that you marginalize out the discrete variables like $i$ and $j$ above. Systems like Stan cannot deal with discrete free random variables at all. PyMC3 can deal with them, but sampling will be slow and hard, so that this recommendation actually makes sense.

Let's use SymPy to marginalize out the discrete variables $i$ and $j$. Marginalizing out discrete variables means summing over them:

For clarity we may only look at the likelihood part without the prior components:

Obviously, we can do the same in the matrix picture rather than using Kronecker deltas:

We can receive concrete values for a situation like if we assume that both data-points are from the exponential distribution:

The marginalization works again by summing over the discrete variables. This time we have to use sympy.Add rather than sympy.summation, because we don't perform a symbolic summation:

In this form it is not immediately obvious that the two expressions are actually the same, but we can let SymPy answer that by subtracting the two terms and showing that the result is 0: