Last updated: 2026-01-06

Checks: 7 0

Knit directory: fiveMinuteStats/analysis/

This reproducible R Markdown analysis was created with workflowr (version 1.7.1). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.


Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.

Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.

The command set.seed(12345) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.

Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.

Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.

The results in this page were generated with repository version 0c715fc. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


working directory clean

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.


These are the previous versions of the repository in which changes were made to the R Markdown (analysis/likelihood_ratio_simple_models.Rmd) and HTML (docs/likelihood_ratio_simple_models.html) files. If you’ve configured a remote Git repository (see ?wflow_git_remote), click on the hyperlinks in the table below to view the files as they were in that past version.

File Version Author Date Message
html 74deb19 Peter Carbonetto 2026-01-06 Various misc updates.
Rmd 551db67 Peter Carbonetto 2026-01-06 A few small updates to the likelihood_ratio_simple_continuous_data vignette.
Rmd a361e11 Peter Carbonetto 2026-01-06 A bunch of small edits to the bayes_multiclass vignette.
html 9781cd8 Peter Carbonetto 2026-01-06 Ran wflow_publish("analysis/likelihood_ratio_simple_models.Rmd").
Rmd 0593f14 Peter Carbonetto 2026-01-06 wflow_publish("analysis/likelihood_ratio_simple_models.Rmd")
Rmd fc8b1d8 Peter Carbonetto 2026-01-06 Some edits to the exercises in the likelihood_ratio_simple_models vignette.
Rmd 8a0af86 Peter Carbonetto 2026-01-06 A few more small edits to the likelihood_ratio_simple_models vignette.
Rmd c67ccfc Peter Carbonetto 2026-01-06 A few more edits to the likelihood_ratio_simple_models vignette.
Rmd 306bf92 Peter Carbonetto 2026-01-06 A few minor edits to the likelihood_ratio_simple_models vignette.
Rmd 7a8ffe4 Peter Carbonetto 2025-12-29 Created pdf version of LR_and_BF vignette.
html 7a8ffe4 Peter Carbonetto 2025-12-29 Created pdf version of LR_and_BF vignette.
Rmd 49b0c33 Peter Carbonetto 2025-12-29 Added a link.
html 49b0c33 Peter Carbonetto 2025-12-29 Added a link.
Rmd 8c88ffc Peter Carbonetto 2025-12-29 More basic infrastructure work.
html 8c88ffc Peter Carbonetto 2025-12-29 More basic infrastructure work.
Rmd 599a8d7 Peter Carbonetto 2025-12-29 A few small adjustments to the workflowr settings.
Rmd 04994f5 Peter Carbonetto 2025-12-29 Added basic Makefile infrastructure.
html 6cefaaa Peter Carbonetto 2025-12-29 Build site.
Rmd c8270f5 Peter Carbonetto 2025-12-29 workflowr::wflow_publish("analysis/likelihood_ratio_simple_models.Rmd",
html 35b85cd Peter Carbonetto 2025-12-29 Build site.
html 96cc4f4 Peter Carbonetto 2025-12-29 Build site.
Rmd cd79db8 Peter Carbonetto 2025-12-29 wflow_publish("analysis/likelihood_ratio_simple_models.Rmd",
Rmd c345a1a Nan Xiao 2024-03-03 Fix broken and moved URL
html 5f62ee6 Matthew Stephens 2019-03-31 Build site.
Rmd 0cd28bd Matthew Stephens 2019-03-31 workflowr::wflow_publish(all = TRUE)
html 219e6c4 stephens999 2018-03-27 Build site.
html 34bcc51 John Blischak 2017-03-06 Build site.
Rmd 2aa70fe John Blischak 2017-03-06 Manually udpate some Rmd files.
Rmd 5fbc8b5 John Blischak 2017-03-06 Update workflowr project with wflow_update (version 0.4.0).
Rmd 391ba3c John Blischak 2017-03-06 Remove front and end matter of non-standard templates.
html 8e61683 Marcus Davy 2017-03-03 rendered html using wflow_build(all=TRUE)
Rmd d674141 Marcus Davy 2017-02-26 typos, refs
Rmd f7af269 Marcus Davy 2017-02-25 knitr style citations
Rmd 3be128c Marcus Davy 2017-02-24 start bibtex ref, urls etc
html c3b365a John Blischak 2017-01-02 Build site.
Rmd 67a8575 John Blischak 2017-01-02 Use external chunk to set knitr chunk options.
Rmd 5ec12c7 John Blischak 2017-01-02 Use session-info chunk.
Rmd 6a2fec7 stephens999 2016-04-07 fix typo
Rmd b616830 Arjun Biddanda 2016-01-18 Fixed Typo between fF in table and in code
Rmd d6f4bea stephens999 2016-01-12 add examples
Rmd 6c7d0e0 stephens999 2016-01-08 updated exercises
Rmd 93476ef stephens999 2016-01-07 update LR to make clear it is for discrete data
Rmd 8f61232 stephens999 2016-01-07 add simple LR example

See here for a PDF version of this vignette.

Summary

This document introduces the concept of likelihood ratio for comparing two simple (“fully specified”) statistical models, one of the most fundamental concepts in statistical inference.

Prerequisites

Be familiar with basic concepts from probability, particularly the concept of a probability distribution.

Motivating Problem

Technical note: To simplify this problem I have assumed that elephants are [haploid][plody], which they are not. If you do not know what this means you should simply ignore this comment.

There are two subspecies of African Elephant: savannah and forest elephants, which differ in their genetic makeup. Interpol have seized an illegally-smuggled elephant tusk, and they want to know which subspecies of elephant the tusk came from. To try to answer this, they collect DNA from the tusk and measure it at a number of locations (“markers” in genetics jargon) along the elephant genome. At each marker, the DNA can be one of two types (“alleles” in genetics jargon), which for simplicity we will label 0 and 1. So the available data on the tusk from one elephant might look something like this:

marker allele
1 1
2 0
3 1
4 0
5 0
6 1

Interpol also have information on the frequency of each allele in each of the two subspecies — this was obtained by measuring the DNA of a large number of savanna elephants and a large number of forest elephants. We will use \(f_{Sj}\) and \(f_{Fj}\) to denote the frequency of the “1” allele at marker \(j\) in savanna and forest elephants, respectively. (And since there are only two alleles, the frequency of the “0” allele is \(1 - f_{Sj}\) and \(1 - f_{Fj}\), respectively.) Here is a table of this information:

marker, \(j\) \(f_{Sj}\) \(f_{Fj}\)
1 0.40 0.80
2 0.12 0.20
3 0.21 0.11
4 0.12 0.17
5 0.02 0.23
6 0.32 0.25

The question before us is: which subspecies of elephant did the tusk come from, and how confident should we be in this conclusion?

To get some intuition, let us examine the data at the first few markers. At marker 1, our tusk has the allele 1. This allele is less common in savanna elephants than forest elephants (40% of savanna elephants carry this allele, vs. 80% of forest elephants), so this observation seems to support the sample coming from forest. However, at the same time 40% of savanna samples carry this allele, so it remains plausible that the sample came from savanna.

Moving to marker 2, our tusk has the allele 0, which is more common in savanna (88%) than forest elephants (80%). And at marker 3 the tusk has allele 1 which is also more common in savanna (21% vs. 11%). So, in contrast to marker 1, the data at these two markers are more consistent with the tusk coming from a savanna elephant than a forest elephant.

We could continue in this way, but the point should be clear: each marker contains some information, but not a huge amount, and sometimes the information points in different directions. By taking a statistical approach to this problem, we aim to quantify this kind of information, and to efficiently combine the information across markers to come to an overall conclusion.

(This is a simplified version of a real problem: Interpol and other authorities want to know the origin of poached tusks to help focus efforts on curbing this illegal activity. In practice, they are interested in much finer-level discrimination, and measure many genetic markers to get more information. See Wasser et al (2007) and Wasser et al (2008) for more details.)

Solution

We can phrase this problem as a “model comparison” problem. We have data \(X=x\) from our tusk, and we have two different models for how those data might have arisen: it could have been sampled from a savanna elephant, or it could have been sampled from a forest elephant. We will use \(M_S\) and \(M_F\) as shorthand for these two models. A key point is that these two models imply different probability distributions for \(X\): some values of \(X\) are more common under \(M_S\) and others are more common under \(M_F\).

Denoting the probability mass functions of these two distributions as \(p(x \mid M_S)\) and \(p(x \mid M_F)\), and assuming the data at different markers are independent, these probability distributions are \[ p(x \mid M_S) = \prod_{j=1}^J f_{Sj}^{x_j} (1-f_{Sj})^{1-x_j} \] and \[p(x \mid M_F) = \prod_{j=1}^J f_{Fj}^{x_j} (1-f_{Fj})^{1-x_j}, \] where \(J = 6\) is the number of markers, and the values of \(f_{Sj}\) and \(f_{Fj}\) are given in the table above.

Note that a key feature of these models is that they are fully specified. This means there are no unknown values in the probability distributions. Comparing fully-specified models is the simplest kind of model comparison, and so is a good place to start.

The likelihood ratio

A key idea is that a useful summary of how strongly the data \(x\) support one model vs. another model is given by the “likelihood ratio” (LR).

The LR comparing two fully-specified models is simply the ratio of the probabilities of the data under each model. (In saying the “probability of the data”, we are assuming that the data and models involved are discrete. For continuous data and models, the LR is the ratio of the probability densities of the two models evaluated at the data, as discussed in detail here.)

The name “likelihood ratio” comes from the fact that the probability of the data under each model is called the “likelihood” for that model. I recommended saying “likelihood for” the model, or “likelihood under” the model, rather than “likelihood of” the model, to help avoid confusing likelihood with probability.

That is, given data \(x\), the likelihood for a fully-specified (discrete) model \(M\) is defined as \[ L(M) := p(x \mid M), \] where \(p(x \mid M)\) denotes the probability mass function for model \(M\). And the likelihood ratio comparing two fully-specified (discrete) models \(M_1\) vs. \(M_0\) is defined as \[ \mathrm{LR}(M_1, M_0) := \frac{L(M_1)}{L(M_0)}. \] Note that the likelihood for model \(M\) depends on the data \(x\). To make this dependence explicit, the likelihood is sometimes written as \(L(M;x)\) instead of just \(L(M)\).

Large values of \(\mathrm{LR}(M_1, M_0)\) indicate that the data are much more probable under model \(M_1\) than under model \(M_0\), and so indicate support for \(M_1\). Conversely, small values of \(\mathrm{LR}(M_1, M_0)\) indicate support for model \(M_0\).

Example

Using the numbers in the tables above, we can compute the likelihood and LR for \(M_S\) vs. \(M_F\):

x <- c(1,0,1,0,0,1)
fS <- c(0.40,0.12,0.21,0.12,0.02,0.32)
fF <- c(0.8,0.2,0.11,0.17,0.23,0.25)
L <- function (f, x)
  prod(f^x * (1-f)^(1-x))
LR <- L(fS,x)/L(fF,x)
LR
# [1] 1.81359

So \(\mathrm{LR}(M_S, M_F)\) is 1.8135904. This means that the data favor the tusk coming from a savanna elephant by a factor of about 1.8. This is a fairly modest factor — not large enough to draw a convincing conclusion. We will have more to say about interpreting LRs, and what values might be considered “convincing” later.

Note that we have deliberately focused on the likelihood ratio, and not the actual likelihood values themselves. This is because actual likelihood values are generally not useful — it is only the ratios that matter when comparing the models. One way of thinking about this is that the actual likelihood values are very context dependent, and so likelihoods from different data sets are not comparable with one another. However, the meaning of the likelihood ratio is in some sense consistent across contexts: LR = 1.8 means that the data favour the first model by a factor of 1.8, whatever the context.

The inverse likelihood ratio and the log-likelihood ratio

Notice that, from the definition of the LR, the LR for \(M_0\) vs. \(M_1\) is the just inverse of the LR for \(M_1\) vs. \(M_0\). That is, \[ \mathrm{LR}(M_0,M_1) = \frac{1}{\mathrm{LR}(M_1,M_0)}. \] So for example, if \(\mathrm{LR}(M_1,M_0) = 0.01\), then \(\mathrm{LR}(M_0,M_1) = 100\), and the data favour \(M_0\) vs. \(M_1\) by a factor of 100.

For many reasons, it is common to work with the log-likelihood ratio, \(\mathrm{LLR} := \log\mathrm{LR}\). Usually mathematicians work with logarithms in base \(e\), and we will use that convention unless otherwise stated. So for example, if LR = 1.8, then \(\mathrm{LLR} = \log 1.8 =\) 0.5877867.

Although the usual convention is to use base \(e\), it can sometimes be useful to work with logarithms in base 10 to make the inverse logarithm operation easier for human calculation. For example, if I tell you that the LLR, base 10, is 3, then you can immediately tell me that the LR is 1,000.

Exercises

  1. You are playing a game with a friend. The friend has two six-sided dice, one blue and one green. The sides on the blue die are numbered 1, 2, 3, 3, 3 and 4. The sides on the green die are labeled 1, 2, 2, 3, 4 and 4. He picks one of the dice without telling you, and rolls it 10 times, obtaining the results 3, 3, 2, 3, 1, 2, 3, 3, 4, 3. Looking at these results, does intuition say that they support the green die or the blue die? Strongly or weakly? Phrase the problem as a model comparison problem. State your modelling assumptions, and compute a likelihood ratio. Does it support your intuition?

  2. (Read the whole question before starting!)

    1. Perform the following simulation study. Simulate 1,000 tusks (values of \(x\)) from each of the models \(M_S\) and \(M_F\). For each simulated tusk, compute the LR for \(M_S\) vs. \(M_F\), so you have computed 2,000 LRs. Now consider using the LR to classify each tusk as being from a savanna or a forest elephant. Recall that large values for LR indicate support for \(M_S\), so a natural classification rule is “classify as savanna if \(\mathrm{LR} > c\), otherwise classify as forest”, for some threshold \(c\). Plot the misclassification rate (number of tusks wrongly classified divided by 2,000) for this rule, as \(c\) ranges from 0.01 to 100. What value of \(c\) minimizes the misclassification rate? [Hint: the plot will look best if you do things on the log scale, so you could let \(\log_{10}c\) vary from \(-2\) to \(2\) using an equally spaced grid, then plot the misclassification rate on the Y axis vs. \(\log_{10}c\) on the X axis.]

    2. Repeat the above simulation study using 100 tusks from \(M_S\) and 1,900 tusks from \(M_F\). What value of \(c\) minimizes the misclassification rate? Comment on this result.

Note: it is good practice when writing computer code for a simulation study like this to separate the code into small functions each of which accomplishes a clearly defined task. So for example, you should write a function to simulate data, and another to compute the LR (possibly using the “L” function defined above), another to compute the misclassification rate, etc. It is also good practice to comment your functions, e.g., start by describing the inputs and outputs. I strongly suggest using roxygen2 format. See Karl Broman’s tutorial for example.

DO NOT REPEAT THE SIMULATION STUDY BY COPYING AND PASTING CODE AND MODIFYING A FEW NUMBERS. GET INTO THE HABIT OF DOING IT PROPERLY.

To be more explicit, at the top level your code should include a function that looks something like this:

#' @title Run Elephant Tusk Simulation STudy
#'
#' @description Simulate data from forest and savanna tusks with
#'   prespecified frequencies, then use LRs to classify them.
#'
#' @param nS Number of savanna tusks.
#'
#' @param nF Number of forest tusks.
#'
#' @param fS Allele frequencies of the savanna elephants.
#'
#' @param fF Allele frequencies of the forest elephants.
#'
#' @param lc Vector of `log(c)` values of the LR cutoff to use.
#'
#' @param seed Set the random seed using `set.seed` for reproducibility.
#'
#' @return A plot of misclassification rate vs. lc.
#'
#' @example
#' simulation_study(1000,1000,
#'                  c(0.40,0.12,0.21,0.12,0.02,0.32),
#'                  c(0.8,0.2,0.11,0.17,0.23,0.25),
#'                  seq(-2,2,length.out = 100),seed)
#'
elephant_simulation_study <- function (nS, nF, fS, fF, lc, seed) {
  # Add your code here.
}

Then calling this function twice should provide the results needed to complete 2a and 2b.

  1. Now consider modifying the model to allow for errors in the data. Specifically, suppose that there is an error probability of 0.02 when measuring each marker: with probability 0.98 you observe the true \(x_j\), but with probability 0.02 an error is made, and you observe \(1-x_j\). Assume that errors occur independently at each marker \(j\). Let \(M'_S\) and \(M'_F\) denote the models with this error process incorporated. Derive expressions for the likelihood for \(M'_S\) and \(M'_F\), and compute the LR for the tusk data given at the beginning of the example.

Summary

Repeat after me:

  1. “The likelihood for a model is the probability of the data under the model.”

  2. “Individual likelihood values are mostly irrelevant; it is likelihood ratios that matter.”

  3. “If the likelihood ratio for model 1 vs. model 2 is \(x > 1\), this means the data favour model 1 by a factor of \(x\).” (Or, if \(x < 1\), this means the data favour model 2 by a factor of \(1/x\).)


sessionInfo()
# R version 4.3.3 (2024-02-29)
# Platform: aarch64-apple-darwin20 (64-bit)
# Running under: macOS 15.7.1
# 
# Matrix products: default
# BLAS:   /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib 
# LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0
# 
# locale:
# [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
# 
# time zone: America/Chicago
# tzcode source: internal
# 
# attached base packages:
# [1] stats     graphics  grDevices utils     datasets  methods   base     
# 
# loaded via a namespace (and not attached):
#  [1] vctrs_0.6.5       cli_3.6.5         knitr_1.50        rlang_1.1.6      
#  [5] xfun_0.52         stringi_1.8.7     promises_1.3.3    jsonlite_2.0.0   
#  [9] workflowr_1.7.1   glue_1.8.0        rprojroot_2.0.4   git2r_0.33.0     
# [13] htmltools_0.5.8.1 httpuv_1.6.14     sass_0.4.10       rmarkdown_2.29   
# [17] evaluate_1.0.4    jquerylib_0.1.4   tibble_3.3.0      fastmap_1.2.0    
# [21] yaml_2.3.10       lifecycle_1.0.4   whisker_0.4.1     stringr_1.5.1    
# [25] compiler_4.3.3    fs_1.6.6          Rcpp_1.1.0        pkgconfig_2.0.3  
# [29] later_1.4.2       digest_0.6.37     R6_2.6.1          pillar_1.11.0    
# [33] magrittr_2.0.3    bslib_0.9.0       tools_4.3.3       cachem_1.1.0