Last updated: 2026-01-06
Checks: 6 1
Knit directory: fiveMinuteStats/analysis/
This reproducible R Markdown analysis was created with workflowr (version 1.7.1). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.
The R Markdown file has unstaged changes. To know which version of
the R Markdown file created these results, you’ll want to first commit
it to the Git repo. If you’re still working on the analysis, you can
ignore this warning. When you’re finished, you can run
wflow_publish to commit the R Markdown file and build the
HTML.
Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.
The command set.seed(12345) was run prior to running the
code in the R Markdown file. Setting a seed ensures that any results
that rely on randomness, e.g. subsampling or permutations, are
reproducible.
Great job! Recording the operating system, R version, and package versions is critical for reproducibility.
Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.
Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.
Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.
The results in this page were generated with repository version 2d6d41e. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.
Note that you need to be careful to ensure that all relevant files for
the analysis have been committed to Git prior to generating the results
(you can use wflow_publish or
wflow_git_commit). workflowr only checks the R Markdown
file, but you know if there are other scripts or data files that it
depends on. Below is the status of the Git repository when the results
were generated:
Untracked files:
Untracked: analysis/#bayes_multiclass.Rmd#
Unstaged changes:
Modified: analysis/likelihood_ratio_simple_continuous_data.Rmd
Modified: analysis/likelihood_ratio_simple_models.Rmd
Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.
These are the previous versions of the repository in which changes were
made to the R Markdown
(analysis/likelihood_ratio_simple_models.Rmd) and HTML
(docs/likelihood_ratio_simple_models.html) files. If you’ve
configured a remote Git repository (see ?wflow_git_remote),
click on the hyperlinks in the table below to view the files as they
were in that past version.
| File | Version | Author | Date | Message |
|---|---|---|---|---|
| Rmd | a361e11 | Peter Carbonetto | 2026-01-06 | A bunch of small edits to the bayes_multiclass vignette. |
| html | 9781cd8 | Peter Carbonetto | 2026-01-06 | Ran wflow_publish("analysis/likelihood_ratio_simple_models.Rmd"). |
| Rmd | 0593f14 | Peter Carbonetto | 2026-01-06 | wflow_publish("analysis/likelihood_ratio_simple_models.Rmd") |
| Rmd | fc8b1d8 | Peter Carbonetto | 2026-01-06 | Some edits to the exercises in the likelihood_ratio_simple_models vignette. |
| Rmd | 8a0af86 | Peter Carbonetto | 2026-01-06 | A few more small edits to the likelihood_ratio_simple_models vignette. |
| Rmd | c67ccfc | Peter Carbonetto | 2026-01-06 | A few more edits to the likelihood_ratio_simple_models vignette. |
| Rmd | 306bf92 | Peter Carbonetto | 2026-01-06 | A few minor edits to the likelihood_ratio_simple_models vignette. |
| Rmd | 7a8ffe4 | Peter Carbonetto | 2025-12-29 | Created pdf version of LR_and_BF vignette. |
| html | 7a8ffe4 | Peter Carbonetto | 2025-12-29 | Created pdf version of LR_and_BF vignette. |
| Rmd | 49b0c33 | Peter Carbonetto | 2025-12-29 | Added a link. |
| html | 49b0c33 | Peter Carbonetto | 2025-12-29 | Added a link. |
| Rmd | 8c88ffc | Peter Carbonetto | 2025-12-29 | More basic infrastructure work. |
| html | 8c88ffc | Peter Carbonetto | 2025-12-29 | More basic infrastructure work. |
| Rmd | 599a8d7 | Peter Carbonetto | 2025-12-29 | A few small adjustments to the workflowr settings. |
| Rmd | 04994f5 | Peter Carbonetto | 2025-12-29 | Added basic Makefile infrastructure. |
| html | 6cefaaa | Peter Carbonetto | 2025-12-29 | Build site. |
| Rmd | c8270f5 | Peter Carbonetto | 2025-12-29 | workflowr::wflow_publish("analysis/likelihood_ratio_simple_models.Rmd", |
| html | 35b85cd | Peter Carbonetto | 2025-12-29 | Build site. |
| html | 96cc4f4 | Peter Carbonetto | 2025-12-29 | Build site. |
| Rmd | cd79db8 | Peter Carbonetto | 2025-12-29 | wflow_publish("analysis/likelihood_ratio_simple_models.Rmd", |
| Rmd | c345a1a | Nan Xiao | 2024-03-03 | Fix broken and moved URL |
| html | 5f62ee6 | Matthew Stephens | 2019-03-31 | Build site. |
| Rmd | 0cd28bd | Matthew Stephens | 2019-03-31 | workflowr::wflow_publish(all = TRUE) |
| html | 219e6c4 | stephens999 | 2018-03-27 | Build site. |
| html | 34bcc51 | John Blischak | 2017-03-06 | Build site. |
| Rmd | 2aa70fe | John Blischak | 2017-03-06 | Manually udpate some Rmd files. |
| Rmd | 5fbc8b5 | John Blischak | 2017-03-06 | Update workflowr project with wflow_update (version 0.4.0). |
| Rmd | 391ba3c | John Blischak | 2017-03-06 | Remove front and end matter of non-standard templates. |
| html | 8e61683 | Marcus Davy | 2017-03-03 | rendered html using wflow_build(all=TRUE) |
| Rmd | d674141 | Marcus Davy | 2017-02-26 | typos, refs |
| Rmd | f7af269 | Marcus Davy | 2017-02-25 | knitr style citations |
| Rmd | 3be128c | Marcus Davy | 2017-02-24 | start bibtex ref, urls etc |
| html | c3b365a | John Blischak | 2017-01-02 | Build site. |
| Rmd | 67a8575 | John Blischak | 2017-01-02 | Use external chunk to set knitr chunk options. |
| Rmd | 5ec12c7 | John Blischak | 2017-01-02 | Use session-info chunk. |
| Rmd | 6a2fec7 | stephens999 | 2016-04-07 | fix typo |
| Rmd | b616830 | Arjun Biddanda | 2016-01-18 | Fixed Typo between fF in table and in code |
| Rmd | d6f4bea | stephens999 | 2016-01-12 | add examples |
| Rmd | 6c7d0e0 | stephens999 | 2016-01-08 | updated exercises |
| Rmd | 93476ef | stephens999 | 2016-01-07 | update LR to make clear it is for discrete data |
| Rmd | 8f61232 | stephens999 | 2016-01-07 | add simple LR example |
See here for a PDF version of this vignette.
This document introduces the concept of likelihood ratio for comparing two simple (“fully specified”) statistical models, one of the most fundamental concepts in statistical inference.
Be familiar with basic concepts from probability, particularly the concept of a probability distribution.
Technical note: To simplify this problem I have assumed that elephants are [haploid][plody], which they are not. If you do not know what this means you should simply ignore this comment.
There are two subspecies of African Elephant: savannah and forest elephants, which differ in their genetic makeup. Interpol have seized an illegally-smuggled elephant tusk, and they want to know which subspecies of elephant the tusk came from. To try to answer this, they collect DNA from the tusk and measure it at a number of locations (“markers” in genetics jargon) along the elephant genome. At each marker, the DNA can be one of two types (“alleles” in genetics jargon), which for simplicity we will label 0 and 1. So the available data on the tusk from one elephant might look something like this:
| marker | allele |
|---|---|
| 1 | 1 |
| 2 | 0 |
| 3 | 1 |
| 4 | 0 |
| 5 | 0 |
| 6 | 1 |
Interpol also have information on the frequency of each allele in each of the two subspecies — this was obtained by measuring the DNA of a large number of savanna elephants and a large number of forest elephants. We will use \(f_{Sj}\) and \(f_{Fj}\) to denote the frequency of the “1” allele at marker \(j\) in savanna and forest elephants, respectively. (And since there are only two alleles, the frequency of the “0” allele is \(1 - f_{Sj}\) and \(1 - f_{Fj}\), respectively.) Here is a table of this information:
| marker, \(j\) | \(f_{Sj}\) | \(f_{Fj}\) |
|---|---|---|
| 1 | 0.40 | 0.80 |
| 2 | 0.12 | 0.20 |
| 3 | 0.21 | 0.11 |
| 4 | 0.12 | 0.17 |
| 5 | 0.02 | 0.23 |
| 6 | 0.32 | 0.25 |
The question before us is: which subspecies of elephant did the tusk come from, and how confident should we be in this conclusion?
To get some intuition, let us examine the data at the first few markers. At marker 1, our tusk has the allele 1. This allele is less common in savanna elephants than forest elephants (40% of savanna elephants carry this allele, vs. 80% of forest elephants), so this observation seems to support the sample coming from forest. However, at the same time 40% of savanna samples carry this allele, so it remains plausible that the sample came from savanna.
Moving to marker 2, our tusk has the allele 0, which is more common in savanna (88%) than forest elephants (80%). And at marker 3 the tusk has allele 1 which is also more common in savanna (21% vs. 11%). So, in contrast to marker 1, the data at these two markers are more consistent with the tusk coming from a savanna elephant than a forest elephant.
We could continue in this way, but the point should be clear: each marker contains some information, but not a huge amount, and sometimes the information points in different directions. By taking a statistical approach to this problem, we aim to quantify this kind of information, and to efficiently combine the information across markers to come to an overall conclusion.
(This is a simplified version of a real problem: Interpol and other authorities want to know the origin of poached tusks to help focus efforts on curbing this illegal activity. In practice, they are interested in much finer-level discrimination, and measure many genetic markers to get more information. See Wasser et al (2007) and Wasser et al (2008) for more details.)
We can phrase this problem as a “model comparison” problem. We have data \(X=x\) from our tusk, and we have two different models for how those data might have arisen: it could have been sampled from a savanna elephant, or it could have been sampled from a forest elephant. We will use \(M_S\) and \(M_F\) as shorthand for these two models. A key point is that these two models imply different probability distributions for \(X\): some values of \(X\) are more common under \(M_S\) and others are more common under \(M_F\).
Denoting the probability mass functions of these two distributions as \(p(x \mid M_S)\) and \(p(x \mid M_F)\), and assuming the data at different markers are independent, these probability distributions are \[ p(x \mid M_S) = \prod_{j=1}^J f_{Sj}^{x_j} (1-f_{Sj})^{1-x_j} \] and \[p(x \mid M_F) = \prod_{j=1}^J f_{Fj}^{x_j} (1-f_{Fj})^{1-x_j}, \] where \(J = 6\) is the number of markers, and the values of \(f_{Sj}\) and \(f_{Fj}\) are given in the table above.
Note that a key feature of these models is that they are fully specified. This means there are no unknown values in the probability distributions. Comparing fully-specified models is the simplest kind of model comparison, and so is a good place to start.
A key idea is that a useful summary of how strongly the data \(x\) support one model vs. another model is given by the “likelihood ratio” (LR).
The LR comparing two fully-specified models is simply the ratio of the probabilities of the data under each model. (In saying the “probability of the data”, we are assuming that the data and models involved are discrete. For continuous data and models, the LR is the ratio of the probability densities of the two models evaluated at the data, as discussed in detail here.)
The name “likelihood ratio” comes from the fact that the probability of the data under each model is called the “likelihood” for that model. I recommended saying “likelihood for” the model, or “likelihood under” the model, rather than “likelihood of” the model, to help avoid confusing likelihood with probability.
That is, given data \(x\), the likelihood for a fully-specified (discrete) model \(M\) is defined as \[ L(M) := p(x \mid M), \] where \(p(x \mid M)\) denotes the probability mass function for model \(M\). And the likelihood ratio comparing two fully-specified (discrete) models \(M_1\) vs. \(M_0\) is defined as \[ \mathrm{LR}(M_1, M_0) := \frac{L(M_1)}{L(M_0)}. \] Note that the likelihood for model \(M\) depends on the data \(x\). To make this dependence explicit, the likelihood is sometimes written as \(L(M;x)\) instead of just \(L(M)\).
Large values of \(\mathrm{LR}(M_1, M_0)\) indicate that the data are much more probable under model \(M_1\) than under model \(M_0\), and so indicate support for \(M_1\). Conversely, small values of \(\mathrm{LR}(M_1, M_0)\) indicate support for model \(M_0\).
Using the numbers in the tables above, we can compute the likelihood and LR for \(M_S\) vs. \(M_F\):
x <- c(1,0,1,0,0,1)
fS <- c(0.40,0.12,0.21,0.12,0.02,0.32)
fF <- c(0.8,0.2,0.11,0.17,0.23,0.25)
L <- function (f, x)
prod(f^x * (1-f)^(1-x))
LR <- L(fS,x)/L(fF,x)
LR
# [1] 1.81359
So \(\mathrm{LR}(M_S, M_F)\) is 1.8135904. This means that the data favor the tusk coming from a savanna elephant by a factor of about 1.8. This is a fairly modest factor — not large enough to draw a convincing conclusion. We will have more to say about interpreting LRs, and what values might be considered “convincing” later.
Note that we have deliberately focused on the likelihood ratio, and not the actual likelihood values themselves. This is because actual likelihood values are generally not useful — it is only the ratios that matter when comparing the models. One way of thinking about this is that the actual likelihood values are very context dependent, and so likelihoods from different data sets are not comparable with one another. However, the meaning of the likelihood ratio is in some sense consistent across contexts: LR = 1.8 means that the data favour the first model by a factor of 1.8, whatever the context.
Notice that, from the definition of the LR, the LR for \(M_0\) vs. \(M_1\) is the just inverse of the LR for \(M_1\) vs. \(M_0\). That is, \[ \mathrm{LR}(M_0,M_1) = \frac{1}{\mathrm{LR}(M_1,M_0)}. \] So for example, if \(\mathrm{LR}(M_1,M_0) = 0.01\), then \(\mathrm{LR}(M_0,M_1) = 100\), and the data favour \(M_0\) vs. \(M_1\) by a factor of 100.
For many reasons, it is common to work with the log-likelihood ratio, \(\mathrm{LLR} := \log\mathrm{LR}\). Usually mathematicians work with logarithms in base \(e\), and we will use that convention unless otherwise stated. So for example, if LR = 1.8, then \(\mathrm{LLR} = \log 1.8 =\) 0.5877867.
Although the usual convention is to use base \(e\), it can sometimes be useful to work with logarithms in base 10 to make the inverse logarithm operation easier for human calculation. For example, if I tell you that the LLR, base 10, is 3, then you can immediately tell me that the LR is 1,000.
You are playing a game with a friend. The friend has two six-sided dice, one blue and one green. The sides on the blue die are numbered 1, 2, 3, 3, 3 and 4. The sides on the green die are labeled 1, 2, 2, 3, 4 and 4. He picks one of the dice without telling you, and rolls it 10 times, obtaining the results 3, 3, 2, 3, 1, 2, 3, 3, 4, 3. Looking at these results, does intuition say that they support the green die or the blue die? Strongly or weakly? Phrase the problem as a model comparison problem. State your modelling assumptions, and compute a likelihood ratio. Does it support your intuition?
(Read the whole question before starting!)
Perform the following simulation study. Simulate 1,000 tusks (values of \(x\)) from each of the models \(M_S\) and \(M_F\). For each simulated tusk, compute the LR for \(M_S\) vs. \(M_F\), so you have computed 2,000 LRs. Now consider using the LR to classify each tusk as being from a savanna or a forest elephant. Recall that large values for LR indicate support for \(M_S\), so a natural classification rule is “classify as savanna if \(\mathrm{LR} > c\), otherwise classify as forest”, for some threshold \(c\). Plot the misclassification rate (number of tusks wrongly classified divided by 2,000) for this rule, as \(c\) ranges from 0.01 to 100. What value of \(c\) minimizes the misclassification rate? [Hint: the plot will look best if you do things on the log scale, so you could let \(\log_{10}c\) vary from \(-2\) to \(2\) using an equally spaced grid, then plot the misclassification rate on the Y axis vs. \(\log_{10}c\) on the X axis.]
Repeat the above simulation study using 100 tusks from \(M_S\) and 1,900 tusks from \(M_F\). What value of \(c\) minimizes the misclassification rate? Comment on this result.
Note: it is good practice when writing computer code for a simulation study like this to separate the code into small functions each of which accomplishes a clearly defined task. So for example, you should write a function to simulate data, and another to compute the LR (possibly using the “L” function defined above), another to compute the misclassification rate, etc. It is also good practice to comment your functions, e.g., start by describing the inputs and outputs. I strongly suggest using roxygen2 format. See Karl Broman’s tutorial for example.
DO NOT REPEAT THE SIMULATION STUDY BY COPYING AND PASTING CODE AND MODIFYING A FEW NUMBERS. GET INTO THE HABIT OF DOING IT PROPERLY.
To be more explicit, at the top level your code should include a function that looks something like this:
#' @title Run Elephant Tusk Simulation STudy
#'
#' @description Simulate data from forest and savanna tusks with
#' prespecified frequencies, then use LRs to classify them.
#'
#' @param nS Number of savanna tusks.
#'
#' @param nF Number of forest tusks.
#'
#' @param fS Allele frequencies of the savanna elephants.
#'
#' @param fF Allele frequencies of the forest elephants.
#'
#' @param lc Vector of `log(c)` values of the LR cutoff to use.
#'
#' @param seed Set the random seed using `set.seed` for reproducibility.
#'
#' @return A plot of misclassification rate vs. lc.
#'
#' @example
#' simulation_study(1000,1000,
#' c(0.40,0.12,0.21,0.12,0.02,0.32),
#' c(0.8,0.2,0.11,0.17,0.23,0.25),
#' seq(-2,2,length.out = 100),seed)
#'
elephant_simulation_study <- function (nS, nF, fS, fF, lc, seed) {
# Add your code here.
}
Then calling this function twice should provide the results needed to complete 2a and 2b.
Repeat after me:
“The likelihood for a model is the probability of the data under the model.”
“Individual likelihood values are mostly irrelevant; it is likelihood ratios that matter.”
“If the likelihood ratio for model 1 vs. model 2 is \(x > 1\), this means the data favour model 1 by a factor of \(x\).” (Or, if \(x < 1\), this means the data favour model 2 by a factor of \(1/x\).)
sessionInfo()
# R version 4.3.3 (2024-02-29)
# Platform: aarch64-apple-darwin20 (64-bit)
# Running under: macOS 15.7.1
#
# Matrix products: default
# BLAS: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib
# LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.11.0
#
# locale:
# [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
#
# time zone: America/Chicago
# tzcode source: internal
#
# attached base packages:
# [1] stats graphics grDevices utils datasets methods base
#
# loaded via a namespace (and not attached):
# [1] vctrs_0.6.5 cli_3.6.5 knitr_1.50 rlang_1.1.6
# [5] xfun_0.52 stringi_1.8.7 promises_1.3.3 jsonlite_2.0.0
# [9] workflowr_1.7.1 glue_1.8.0 rprojroot_2.0.4 git2r_0.33.0
# [13] htmltools_0.5.8.1 httpuv_1.6.14 sass_0.4.10 rmarkdown_2.29
# [17] evaluate_1.0.4 jquerylib_0.1.4 tibble_3.3.0 fastmap_1.2.0
# [21] yaml_2.3.10 lifecycle_1.0.4 whisker_0.4.1 stringr_1.5.1
# [25] compiler_4.3.3 fs_1.6.6 Rcpp_1.1.0 pkgconfig_2.0.3
# [29] later_1.4.2 digest_0.6.37 R6_2.6.1 pillar_1.11.0
# [33] magrittr_2.0.3 bslib_0.9.0 tools_4.3.3 cachem_1.1.0