Pre-requisites

Overview

Infinite and zero LRs

Avoid focussing on the likelihood itself: only ratios matter

Only compare likelihoods for the same data!

Dealing with missing data; and the missing at random assumption

Last updated: 2019-03-31

Checks: 6 0

Knit directory: fiveMinuteStats/analysis/

This reproducible R Markdown analysis was created with workflowr (version 1.2.0). The Report tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.

R Markdown file: up-to-date

Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.

Environment: empty

Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.

Seed: set.seed(12345)

The command set.seed(12345) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.

Session information: recorded

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.

Cache: none

Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.

Repository version: 0cd28bd

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility. The version displayed above was the version of the Git repository at the time these results were generated.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Ignored files:
    Ignored:    .Rhistory
    Ignored:    .Rproj.user/
    Ignored:    analysis/.Rhistory
    Ignored:    analysis/bernoulli_poisson_process_cache/

Untracked files:
    Untracked:  _workflowr.yml
    Untracked:  analysis/CI.Rmd
    Untracked:  analysis/gibbs_structure.Rmd
    Untracked:  analysis/libs/
    Untracked:  analysis/results.Rmd
    Untracked:  analysis/shiny/tester/
    Untracked:  docs/MH_intro_files/
    Untracked:  docs/citations.bib
    Untracked:  docs/figure/MH_intro.Rmd/
    Untracked:  docs/figure/hmm.Rmd/
    Untracked:  docs/hmm_files/
    Untracked:  docs/libs/
    Untracked:  docs/shiny/tester/

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.

These are the previous versions of the R Markdown and HTML files. If you’ve configured a remote Git repository (see ?wflow_git_remote), click on the hyperlinks in the table below to view them.

File	Version	Author	Date	Message
html	34bcc51	John Blischak	2017-03-06	Build site.
Rmd	5fbc8b5	John Blischak	2017-03-06	Update workflowr project with wflow_update (version 0.4.0).
Rmd	391ba3c	John Blischak	2017-03-06	Remove front and end matter of non-standard templates.
html	fb0f6e3	stephens999	2017-03-03	Merge pull request #33 from mdavy86/f/review
html	0713277	stephens999	2017-03-03	Merge pull request #31 from mdavy86/f/review
Rmd	d674141	Marcus Davy	2017-02-27	typos, refs
html	c3b365a	John Blischak	2017-01-02	Build site.
Rmd	67a8575	John Blischak	2017-01-02	Use external chunk to set knitr chunk options.
Rmd	5ec12c7	John Blischak	2017-01-02	Use session-info chunk.
Rmd	ae24830	stephens999	2016-09-04	add example of computing LR on different data
Rmd	d6f4bea	stephens999	2016-01-12	add examples

Pre-requisites

Likelihood ratio for
- discrete data
- continuous data.

Overview

The aim here is to give some simple (somewhat artificial) examples to illustrate the idea of a likelihood ratio, and to mention some pitfalls to be avoided.

Infinite and zero LRs

Suppose you are throwing a six-sided die with sides marked 1,2,3,4,5 and 6. Consider comparing the models $M_0$ : the die is fair (i.e. each face has probability 1/6) vs $M_1$ : the die is loaded, and will always land 6.

If we observe a “6” then the likelihood ratio for $M_1$ is 1/(1/6)=6.

If we observe any other number then the likelihood ratio for $M_1$ vs $M_0$ is $0/(1/6)$ = 0.

Note that LR=0 in the latter case because the data are impossible under $M_1$ . Indeed, $LR(M_1,M_0)=0$ if and only if the data are impossible under $M_1$ , and so $LR=0$ means that the data exclude $M_1$ . Note also that in this case the LR for $M_0$ vs $M_1$ is infinity. However, in general strong support for $M_0$ vs $M_1$ does not imply that $M_0$ is “true”, or even a good model. It only implies that $M_0$ is favored over $M_1$ . There could always be other models that explain the data much better than $M_0$ !

Avoid focussing on the likelihood itself: only ratios matter

Suppose we toss a coin 100 times, and get 50 Heads and 50 Tails (in some order). If the coin is fair, (ie 50% chance of landing heads independently for each toss) then the probability of any given sequence with 50 heads and 50 tails is $0.5^{100}$ . That is, given these data, the likelihood for the model $M$ “the coin is fair” is $0.5^{100}$ .

Is this a big likelihood or a small likelihood? The point here is that this is not really a meaningful question. Although the number $0.5^{100}$ is, in most contexts, “small”, in this context it would be wrong to call this a “small” likelihood. Indeed, the data are entirely consistent with the model!

Don’t focus on likelihoods: focus on likelihood ratios.

Similarly when looking at log-likelihoods, it is the difference between log-likelihoods (ie the log-likelihood ratio) that matters, not the actual log-likelihoods. For example, suppose the log-likelihood (all logs base e here) for model $M_0$ is -33999445.1 and for model $M_1$ is -33998325.7. Because both these numbers are very big (in absolute terms) it is tempting the view the difference (1119.4) as not very big relative to these big numbers. But remember that the actual log-likelihoods themselves are irrelevant! It is only the logLR, or the difference in the log-likelihoods, that matters. So here the logLR is 1119.4 and the data support the model 1 by a factor of more than $exp(1000)$ .

Only compare likelihoods for the same data!

Remember that the likelihood ratio is computed for two different models on the same data. It must really be exactly the same data. In the continuous case this means it can’t even be a 1-1 transform of the same data - it has to be the same data.

For example, suppose you observe data $x_1,\dots,x_n$ , and you want to compare the models $M_0: x_1,\dots,x_n$ are normally distributed vs $M_1: log(x_1),\dots,\log(x_n)$ are normally distributed. You have to rephrase $M_1$ in terms of the original $x_j$ : that is $M_1: x_1,\dots,x_n$ are log-normally distributed.

Here is an extended version of this example.

Dealing with missing data; and the missing at random assumption

Consider the tusk example, and suppose now that at marker 1 the DNA assay failed, and so the data are “missing”. How does this impact the LR?

The trick here is to note that “the data are missing” is really an “observation”. The likelihood ratio for an observation is the ratio of the probability of that observation under the two models, so the LR for this marker alone for $M_S$ vs $M_F$ is $LR(M_S,M_F) = \Pr(\text{data missing} | M_S)/\Pr(\text{data missing} | M_F).$

If the probability of getting missing data is the same for both models then the LR is 1 (and we don’t actually have to worry about what that probability of getting missing data is).

On the other hand, it is conceivable that missing data occurs more commonly in one group than in another, for one reason or another. In this case the LR for a missing observation could be something other than 1. This is called “informative missingness”, and to compute the LR our models would have to explicitly incorporate probabilities for observations to be missing.

sessionInfo()

R version 3.5.2 (2018-12-20)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Mojave 10.14.1

Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
 [1] workflowr_1.2.0 Rcpp_1.0.0      digest_0.6.18   rprojroot_1.3-2
 [5] backports_1.1.3 git2r_0.24.0    magrittr_1.5    evaluate_0.12  
 [9] stringi_1.2.4   fs_1.2.6        whisker_0.3-2   rmarkdown_1.11 
[13] tools_3.5.2     stringr_1.3.1   glue_1.3.0      xfun_0.4       
[17] yaml_2.2.0      compiler_3.5.2  htmltools_0.3.6 knitr_1.21

This site was created with R Markdown

Likelihood Ratios: examples and pitfalls

Matthew Stephens

2016-01-11