Last updated: 2018-03-27

Code version: f7bbf4f

# Overview

The goal here is to introduce some basic ideas from decision theory, and particularly the notions of loss, decision rule, and integrated risk, in the context of a simple prediction problem.

To understand this vignette you will need to be familiar with the concept of probability distributions and expectations.

# The Prediction Problem

Consider the problem of predicting an outcome $$Y$$ on the basis of inputs (or “features” or “predictors”) $$X$$. Typically $$Y$$ might be a one-dimensional outcome (discrete or continuous) and $$X$$ a multi-dimensional input. If $$Y$$ is discrete then this is often referred to as a “classification problem”; if $$Y$$ is continuous then this is often referred to as a “regression problem”.

As a concrete example, here we attempted to classify an elephant tusk as being from a forest or savanna elephant ($$Y$$) based on its genetic data ($$X$$).

## Loss function

To make a rational decision about what value $$\hat{Y}$$ to predict for $$Y$$ we must specify how “bad” different types of errors are.

That is, we must specify, for each possible value of $$Y$$, and each possible prediction $$\hat{Y}$$, a (real) value $$L(\hat{Y},Y)$$ that measures how “wrong” the prediction is. Big values of $$L$$ indicate worse predictions. $$L$$ is called the “loss function”.

For example, if $$Y$$ is continuous and real-valued then some simple common loss functions are:

• squared loss: $$L(\hat{Y},Y) = (Y-\hat{Y})^2$$
• absolute loss: $$L(\hat{Y},Y) = |Y-\hat{Y}|$$

If $$Y$$ is discrete then a simple common loss function is 0-1 loss, which is 0 if the prediction is correct and 1 otherwise: $$L(\hat{Y},Y) = I(\hat{Y} \neq Y)$$ where $$I$$ denotes the indicator function.

## Decision Rule

In this context a decision rule is simply a way of predicting $$Y$$ from $$X$$. That is it is a function $$\hat{Y}(X)$$, which for any given $$X$$ produces a predicted value $$\hat{Y}$$ for $$Y$$.

## Expected Loss (Integrated Risk)

Now consider applying the decision rule $$\hat{Y}(X)$$ to a series of $$(X,Y)$$ pairs coming from some probability distribution $$p(X,Y)$$. A natural way to measure how good (or bad) the decision rule is, is to compute the expected loss (sometimes referred to as the Integrated Risk, and here denoted $$r$$):

$r(\hat{Y}) := \int \int L(\hat{Y}(X),Y) p(X,Y) \,dX \,dY.$

## The optimal decision rule

So what decision rule $$\hat{Y}$$ is “optimal” in terms of minimizing the expected loss $$r$$? It is easy to show that the following decision rule minimizes $$r$$:

Optimal Decision Rule: For each $$X$$ choose the prediction $$\hat{Y}_\text{opt}(X)$$ that minimizes the conditional expected loss: $\hat{Y}_\text{opt}(X) = \arg \min_a \int L(a, Y) \, p(Y | X) dY.$

The proof is straightforward. Since $$p(X,Y)= p(Y|X) p(X)$$, we can rewrite the expected loss for any decision rule $$\hat{Y}$$ as: $r(\hat{Y}) = \int \biggl[ \int L(\hat{Y}(X),Y) p(Y|X) \, dY \biggr] p(X) \, dX$ Note that, by definition, $$\hat{Y}_\text{opt}(X)$$ minimizes the term inside $$[]$$. Thus $r(\hat{Y}) \leq \int \biggl[ \int L(\hat{Y}_\text{opt}(X),Y) p(Y|X) \, dY \biggr] p(X) \, dX = r(\hat{Y}_\text{opt}).$

## Examples

As defined above, finding $$\hat{Y}_\text{opt}(X)$$ requires solving an optimization problem. For the standard loss functions given above, solving this optimization problem is easy provided that the conditional distribution $$p(Y|X)$$ is easy to work with.

### Squared loss:

For example, using squared loss, the optimization problem becomes $\hat{Y}_\text{opt}(X) = \arg \min_a f(a)$ where $f(a) = \int (a-Y)^2 \, p(Y | X) \, dY.$ We can optimize this by differentiating with respect to $$a$$ and setting the derivative to 0. Differentiating $$f(a)$$ gives $f'(a) = 2 \int (a-Y) p(Y|X) \, dY = 2(a - E(Y|X)).$ which is zero at $$a=E(Y|X)$$.

Thus, for squared loss, the optimal decision rule is to predict $$Y$$ using its conditional mean given $$X$$: $$\hat{Y}_\text{opt}(X) = E(Y | X)$$.

### Absolute loss:

Although not quite as easy to show, under absolute loss the optimal decision rule is to set $$\hat{Y}$$ to the median of the conditional distribution $$Y|X$$.

### 0-1 loss:

Under 0-1 loss, with $$Y$$ discrete,
the optimal decision rule is to set $$\hat{Y}$$ to the mode of the conditional distribution $$p(Y|X)$$. That is $\hat{Y}_\text{opt}(X) = \arg \max_a p(Y=a |X).$ Showing this is left as an Exercise.

## Connection with Bayesian inference: Bayes risk and Bayes decision rules

The conditional distribution $$Y|X$$ is sometimes be referred to as the “posterior” distribution of $$Y$$ given data $$X$$, and computing this distribution is sometimes referred to as “performing Bayesian inference for $$Y$$”.

Thus, the above result (“Optimal Decision Rule” section) can be thought of as characterizing the optimality of Bayesian inference in terms of a “frequentist” measure ($$r$$) which measures long-run performance across many samples $$(X,Y)$$ from $$p(X,Y)$$. For example, predicting $$Y$$ by its posterior mean, $$E(Y|X)$$, is optimal in terms of expected squared loss (with expectation taken across $$p(X,Y)$$).

Because of this connection with Bayesian inference, the optimal value $$r(\hat{Y}_\text{opt})$$ is sometimes referred to as the “Bayes risk”, and $$\hat{Y}_\text{opt}$$ is referred to as a “Bayes decision rule”.

## Theory vs Practice

Note that the optimal decision rule depends on the distribution $$p(Y,X)$$ – or, more specifically, on the conditional distribution $$p(Y|X)$$. Typically one does not know this distribution exactly, and so one cannot implement the optimal decision rule. (An exception is in artificial simulation experiments, where the “true” distribution $$p(Y,X)$$ is known; in these cases the optimal rule can be computed, and may provide a useful yardstick against which other rules can be compared.)

One way (but not the only way) to proceed in practice is to perform Bayesian inference for $$Y$$ anyway, by simpely positing (assuming) some “prior” distribution $$p(Y)$$, and a “likelihood” $$p(X|Y)$$, and using these to compute a posterior distribution $$p(Y|X)$$. The result above shows that inference based on this posterior will be optimal, on average, across large numbers of samples of $$(X,Y)$$ drawn from $$p(X,Y)= p(Y)p(X|Y)$$. But, of course, the result does not guarantee optimality, on average, across samples from some other distribution, $$q(X,Y)$$ say. One might summarize this as “Bayesian inference is optimal, on average, if both the prior distribution and likelihood are correct’”.

## Session information

sessionInfo()
R version 3.3.2 (2016-10-31)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X El Capitan 10.11.6

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

loaded via a namespace (and not attached):
[1] backports_1.1.2 magrittr_1.5    rprojroot_1.3-2 tools_3.3.2
[5] htmltools_0.3.6 yaml_2.1.16     Rcpp_0.12.14    stringi_1.1.6
[9] rmarkdown_1.8   knitr_1.18      git2r_0.20.0    stringr_1.2.0
[13] digest_0.6.13   evaluate_0.10.1`

This site was created with R Markdown