Last updated: 2026-01-06

Checks: 7 0

Knit directory: fiveMinuteStats/analysis/

This reproducible R Markdown analysis was created with workflowr (version 1.7.1). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.

R Markdown file: up-to-date

Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.

Environment: empty

Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.

Seed: set.seed(12345)

The command set.seed(12345) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.

Session information: recorded

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.

Cache: none

Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.

File paths: relative

Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.

Repository version: 551db67

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.

The results in this page were generated with repository version 551db67. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


working directory clean

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.

These are the previous versions of the repository in which changes were made to the R Markdown (analysis/likelihood_ratio_simple_continuous_data.Rmd) and HTML (docs/likelihood_ratio_simple_continuous_data.html) files. If you’ve configured a remote Git repository (see ?wflow_git_remote), click on the hyperlinks in the table below to view the files as they were in that past version.

File	Version	Author	Date	Message
Rmd	551db67	Peter Carbonetto	2026-01-06	A few small updates to the likelihood_ratio_simple_continuous_data vignette.
html	551db67	Peter Carbonetto	2026-01-06	A few small updates to the likelihood_ratio_simple_continuous_data vignette.
Rmd	cb29296	Peter Carbonetto	2025-12-29	Added pdf version of likelihood_ratio_simple_continuous_data vignette.
html	56054d9	Matthew Stephens	2022-04-26	Build site.
Rmd	4a62d37	GitHub	2022-04-26	Fix typo in likelihood_ratio_simple_continuous_data.Rmd
html	5f62ee6	Matthew Stephens	2019-03-31	Build site.
Rmd	0cd28bd	Matthew Stephens	2019-03-31	workflowr::wflow_publish(all = TRUE)
html	34bcc51	John Blischak	2017-03-06	Build site.
Rmd	5fbc8b5	John Blischak	2017-03-06	Update workflowr project with wflow_update (version 0.4.0).
Rmd	391ba3c	John Blischak	2017-03-06	Remove front and end matter of non-standard templates.
html	8e61683	Marcus Davy	2017-03-03	rendered html using wflow_build(all=TRUE)
html	5d0fa13	Marcus Davy	2017-03-02	wflow_build() rendered html files
Rmd	d674141	Marcus Davy	2017-02-26	typos, refs
html	c3b365a	John Blischak	2017-01-02	Build site.
Rmd	67a8575	John Blischak	2017-01-02	Use external chunk to set knitr chunk options.
Rmd	5ec12c7	John Blischak	2017-01-02	Use session-info chunk.
Rmd	9714939	stephens999	2016-03-28	add example to show how approximation can break down
Rmd	a424331	stephens999	2016-01-19	minor update
Rmd	8b7278d	stephens999	2016-01-11	add LR for continuous data

See here for a PDF version of this vignette.

Summary

This document introduces the likelihood ratio for continuous data and models, and explains its connection with discrete models.

Prerequisites

Be familiar with the likelihood ratio for discrete data.

Definition

Recall that if \(M_0\) and \(M_1\) are fully-specified models for discrete data \(X = x\), with probability mass functions \(p(x \mid M_0)\) and \(p(x \mid M_1)\), then the likelihood ratio for \(M_1\) vs. \(M_0\) is \[ \mathrm{LR}(M_1,M_0) := \frac{p(x \mid M_1)}{p(x \mid M_0)}. \] Now suppose that the data and models are continuous. So instead of a probability mass function, each model has a probability density function. Then the likelihood ratio for \(M_1\) vs. \(M_0\) is usually defined as the ratio of the probability density functions. That is, we have exactly the same expression for the LR, \[ \mathrm{LR}(M_1,M_0) := \frac{p(x \mid M_1)}{p(x \mid M_0)}, \] but now \(p(\cdot | M_1)\) and \(p(\cdot | M_0)\) are probability density functions instead of probability mass functions.

Example

A medical screening test for a disease involves measuring the concentration (\(X\)) of a protein in the blood. In normal individuals, \(X\) has a Gamma distribution with mean 1 and shape 2. In diseased individuals, the protein becomes elevated, and \(X\) has a Gamma distribution with mean 2 and shape 2. Plotting the probability density functions of these distributions yields:

x <- seq(0,10,length.out = 100)
plot(x,dgamma(x,scale = 0.5,shape = 2),type = "l",col = "blue",lwd = 2,
     xlab = "protein concentration",ylab = "probability density")
lines(x,dgamma(x,scale = 1,shape = 2),type = "l",col = "red",lwd = 2)

Suppose that for a particular patient we observe \(X=4.02\). Then the likelihood ratio for the model that this patient is from the normal group (\(M_n\)) vs. the model that the patient is from the diseased group (\(M_d\)) is dgamma(4.02,scale = 0.5,shape = 2)/dgamma(4.02,scale = 1,shape = 2) which is 0.0718. That is, the data favour this individual being diseased by a factor of approximately 14.

Connection with discrete models

Often the likelihood ratio for continuous models is simply defined as the ratio of the densities, as above. However, an alternative approach, which can yield greater insight, is instead to derive this result as an approximation, from the definition of likelihood ratio for discrete models, as follows.

The first step is to recognize that in practice all observations are actually discrete, because of finite precision. Sometimes the measurement precision is made explicit, but often it is implicit in the number of decimal places used to report an observation. For example, in the example above, where we were told that we observed a protein concentration of \(X=4.02\), it would be reasonable to think that the measurement precision is 2 decimal places, and that this observation actually corresponds to “\(X\) lies in the interval \([4.015,4.025)\)”. The probability of this observation, under a continuous model for \(X\), is the integral of the probability density function from \(4.015\) to \(4.025\). In other words, it is \(F_X(4.025) - F_X(4.015)\), where \(F_X\) denotes the cumulative distribution function for \(X\).

With this view, the likelihood for the “observation” \(X=4.02\) under \(M_n\) is actually pgamma(4.025,scale = 0.5,shape = 2) - pgamma(4.015,scale = 0.5,shape = 2), which is 0.00005183. Similarly, the likelihood under \(M_d\) is 0.0007217, and therefore the likelihood ratio is 0.0718126.

As you can see, this approach yields a LR that is numerically very close to that obtained using the ratio of the densities, as above. This is not a coincidence! Here is why we should expect this to happen more generally. Suppose we assume that the measurement precision is \(\epsilon\). So the “observation” \(X = x\) really means \(X \in [x-\epsilon, x+\epsilon]\). Then the likelihood for a model \(M\), given this observation, is \(\Pr(X \in [x-\epsilon,x+\epsilon] \mid M)\). Provided that the density \(p(x \mid M)\) is approximately constant in the region within radius \(\epsilon\) around \(x\), then this probability is approximately \(2 \epsilon p(x \mid M)\). Thus, the LR for two models \(M_1\) vs. \(M_0\) is \[ LR = \frac{\Pr(X \in [x-\epsilon,x+\epsilon] \mid M_1)} {\Pr(X \in [x-\epsilon,x+\epsilon] \mid M_0)} \approx \frac{2\epsilon p(x \mid M_1)}{2\epsilon p(x \mid M_0)} = \frac{p(x \mid M_1)}{p(x \mid M_0)}. \]

An example where the approximation breaks down

The approximation usually works well, but here is a simple example to illustrate how the approximation could break down in principle.

Consider observing a single data point \(X\) and we compare the models that \(M_0: X \sim N(0, \sigma_0)\) vs. \(M_1: X \sim N(0, \sigma_1)\). Suppose that we observe \(X = 0.00\), assumed to be correct to the nearest 0.01. So the “true” LR is given by

trueLR <- function (s0, s1) {
  L0 <- pnorm(0.005,sd = s0) - pnorm(-0.005,sd = s0)
  L1 <- pnorm(0.005,sd = s1) - pnorm(-0.005,sd = s1)
  return(L0/L1)
}

and the approximation is

approxLR <- function (s0, s1)
  dnorm(0,sd = s0)/dnorm(0,sd = s1)

Now, if \(\sigma_0\) and \(\sigma_1\) are both not too small, the approximation works fine. For example, for \(\sigma_0 = 0.5, \sigma_1 = 1\), the truth and approximation are 1.999975 and 2.

But \(\sigma_0\) or \(\sigma_1\) is small, we have the problem that the density is not approximately constant within the region \([-0.005, 0.005]\). For example, at \(\sigma_0 = 0.001, \sigma_1 = 1\) we have the truth and approximation as 250.6637282 and 1000.

Summary

In most cases, the likelihood ratio for model \(M_1\) vs. model \(M_0\) for a continuous random variable \(X\), given an observation \(X = x\), can be well approximated by the ratio of the model densities of \(X\), evaluated at \(x\). This approximation comes from assuming that the model density functions are approximately constant within the neighborhood of \(x\) that has radius equal to the measurement precision.

sessionInfo()
# R version 4.3.3 (2024-02-29)
# Platform: aarch64-apple-darwin20 (64-bit)
# Running under: macOS 15.7.1
# 
# Matrix products: default
# BLAS:   /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib 
# LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0
# 
# locale:
# [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
# 
# time zone: America/Chicago
# tzcode source: internal
# 
# attached base packages:
# [1] stats     graphics  grDevices utils     datasets  methods   base     
# 
# loaded via a namespace (and not attached):
#  [1] vctrs_0.6.5       cli_3.6.5         knitr_1.50        rlang_1.1.6      
#  [5] xfun_0.52         stringi_1.8.7     promises_1.3.3    jsonlite_2.0.0   
#  [9] workflowr_1.7.1   glue_1.8.0        rprojroot_2.0.4   git2r_0.33.0     
# [13] htmltools_0.5.8.1 httpuv_1.6.14     sass_0.4.10       rmarkdown_2.29   
# [17] evaluate_1.0.4    jquerylib_0.1.4   tibble_3.3.0      fastmap_1.2.0    
# [21] yaml_2.3.10       lifecycle_1.0.4   whisker_0.4.1     stringr_1.5.1    
# [25] compiler_4.3.3    fs_1.6.6          Rcpp_1.1.0        pkgconfig_2.0.3  
# [29] later_1.4.2       digest_0.6.37     R6_2.6.1          pillar_1.11.0    
# [33] magrittr_2.0.3    bslib_0.9.0       tools_4.3.3       cachem_1.1.0

The likelihood ratio for continuous data

Matthew Stephens

January 6, 2025

Summary

Prerequisites

Definition

Example

Connection with discrete models

An example where the approximation breaks down

Summary