Processing math: 97%
  • Pre-requisites
  • Overview
  • Infinite and zero LRs
  • Avoid focussing on the likelihood itself: only ratios matter
  • Only compare likelihoods for the same data!
  • Dealing with missing data; and the missing at random assumption

Last updated: 2019-03-31

Checks: 6 0

Knit directory: fiveMinuteStats/analysis/

This reproducible R Markdown analysis was created with workflowr (version 1.2.0). The Report tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.


Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.

Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.

The command set.seed(12345) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.

Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility. The version displayed above was the version of the Git repository at the time these results were generated.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Ignored files:
    Ignored:    .Rhistory
    Ignored:    .Rproj.user/
    Ignored:    analysis/.Rhistory
    Ignored:    analysis/bernoulli_poisson_process_cache/

Untracked files:
    Untracked:  _workflowr.yml
    Untracked:  analysis/CI.Rmd
    Untracked:  analysis/gibbs_structure.Rmd
    Untracked:  analysis/libs/
    Untracked:  analysis/results.Rmd
    Untracked:  analysis/shiny/tester/
    Untracked:  docs/MH_intro_files/
    Untracked:  docs/citations.bib
    Untracked:  docs/figure/MH_intro.Rmd/
    Untracked:  docs/figure/hmm.Rmd/
    Untracked:  docs/hmm_files/
    Untracked:  docs/libs/
    Untracked:  docs/shiny/tester/

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.


These are the previous versions of the R Markdown and HTML files. If you’ve configured a remote Git repository (see ?wflow_git_remote), click on the hyperlinks in the table below to view them.

File Version Author Date Message
html 34bcc51 John Blischak 2017-03-06 Build site.
Rmd 5fbc8b5 John Blischak 2017-03-06 Update workflowr project with wflow_update (version 0.4.0).
Rmd 391ba3c John Blischak 2017-03-06 Remove front and end matter of non-standard templates.
html fb0f6e3 stephens999 2017-03-03 Merge pull request #33 from mdavy86/f/review
html 0713277 stephens999 2017-03-03 Merge pull request #31 from mdavy86/f/review
Rmd d674141 Marcus Davy 2017-02-27 typos, refs
html c3b365a John Blischak 2017-01-02 Build site.
Rmd 67a8575 John Blischak 2017-01-02 Use external chunk to set knitr chunk options.
Rmd 5ec12c7 John Blischak 2017-01-02 Use session-info chunk.
Rmd ae24830 stephens999 2016-09-04 add example of computing LR on different data
Rmd d6f4bea stephens999 2016-01-12 add examples

Pre-requisites

Overview

The aim here is to give some simple (somewhat artificial) examples to illustrate the idea of a likelihood ratio, and to mention some pitfalls to be avoided.

Infinite and zero LRs

Suppose you are throwing a six-sided die with sides marked 1,2,3,4,5 and 6. Consider comparing the models M0: the die is fair (i.e. each face has probability 1/6) vs M1: the die is loaded, and will always land 6.

If we observe a “6” then the likelihood ratio for M1 is 1/(1/6)=6.

If we observe any other number then the likelihood ratio for M1 vs M0 is 0/(1/6) = 0.

Note that LR=0 in the latter case because the data are impossible under M1. Indeed, LR(M1,M0)=0 if and only if the data are impossible under M1, and so LR=0 means that the data exclude M1. Note also that in this case the LR for M0 vs M1 is infinity. However, in general strong support for M0 vs M1 does not imply that M0 is “true”, or even a good model. It only implies that M0 is favored over M1. There could always be other models that explain the data much better than M0!

Avoid focussing on the likelihood itself: only ratios matter

Suppose we toss a coin 100 times, and get 50 Heads and 50 Tails (in some order). If the coin is fair, (ie 50% chance of landing heads independently for each toss) then the probability of any given sequence with 50 heads and 50 tails is 0.5100. That is, given these data, the likelihood for the model M “the coin is fair” is 0.5100.

Is this a big likelihood or a small likelihood? The point here is that this is not really a meaningful question. Although the number 0.5100 is, in most contexts, “small”, in this context it would be wrong to call this a “small” likelihood. Indeed, the data are entirely consistent with the model!

Don’t focus on likelihoods: focus on likelihood ratios.

Similarly when looking at log-likelihoods, it is the difference between log-likelihoods (ie the log-likelihood ratio) that matters, not the actual log-likelihoods. For example, suppose the log-likelihood (all logs base e here) for model M0 is -33999445.1 and for model M1 is -33998325.7. Because both these numbers are very big (in absolute terms) it is tempting the view the difference (1119.4) as not very big relative to these big numbers. But remember that the actual log-likelihoods themselves are irrelevant! It is only the logLR, or the difference in the log-likelihoods, that matters. So here the logLR is 1119.4 and the data support the model 1 by a factor of more than exp(1000).

Only compare likelihoods for the same data!

Remember that the likelihood ratio is computed for two different models on the same data. It must really be exactly the same data. In the continuous case this means it can’t even be a 1-1 transform of the same data - it has to be the same data.

For example, suppose you observe data x1,,xn, and you want to compare the models M0:x1,,xn are normally distributed vs M1:log(x1),,log(xn) are normally distributed. You have to rephrase M1 in terms of the original xj: that is M1:x1,,xn are log-normally distributed.

Here is an extended version of this example.

Dealing with missing data; and the missing at random assumption

Consider the tusk example, and suppose now that at marker 1 the DNA assay failed, and so the data are “missing”. How does this impact the LR?

The trick here is to note that “the data are missing” is really an “observation”. The likelihood ratio for an observation is the ratio of the probability of that observation under the two models, so the LR for this marker alone for MS vs MF is LR(MS,MF)=Pr

If the probability of getting missing data is the same for both models then the LR is 1 (and we don’t actually have to worry about what that probability of getting missing data is).

On the other hand, it is conceivable that missing data occurs more commonly in one group than in another, for one reason or another. In this case the LR for a missing observation could be something other than 1. This is called “informative missingness”, and to compute the LR our models would have to explicitly incorporate probabilities for observations to be missing.



sessionInfo()
R version 3.5.2 (2018-12-20)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Mojave 10.14.1

Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
 [1] workflowr_1.2.0 Rcpp_1.0.0      digest_0.6.18   rprojroot_1.3-2
 [5] backports_1.1.3 git2r_0.24.0    magrittr_1.5    evaluate_0.12  
 [9] stringi_1.2.4   fs_1.2.6        whisker_0.3-2   rmarkdown_1.11 
[13] tools_3.5.2     stringr_1.3.1   glue_1.3.0      xfun_0.4       
[17] yaml_2.2.0      compiler_3.5.2  htmltools_0.3.6 knitr_1.21     

This site was created with R Markdown