Pre-requisite

You should know what a $p$ value is.

Background

A key problem with $p$ values, when testing null hypotheses, is that they can be difficult to calibrate. That is, it is hard to answer the question “If I get a $p$-value of 0.01 (or any other number) how strong is the evidence against the null hypothesis?”

Example

Here we just give a simple (but artificial) example of an test in which a $p$ value of 0.01 actually corresponds to evidence for the null, even though 0.01 is usually considered to be strong evidence against the null. (This example is modified from the book Bayesian Analysis, by J Berger, p25.)

Suppose $x \in \{1,2,3\}$ and $\theta \in \{0,1\}$ with

x	1	2	3
$p(x \| \theta=0)$	0.005	0.005	0.99
$p(x \| \theta=1)$	0.999	0.001	0

Note that the likelihood ratios for $H_1$ vs $H_0$ for $x=1,2,3$ are $999/5, 1/5$ and $0$ respectively. So as $x$ increases the evidence against $H_0$ decreases.

Now, let us suppose that we observe $x=2$. Then by definition the $p$ value for this observation is \[p:= \Pr(\text{we would see evidence as strong or stronger against $H_0$ than $x=2$} | \theta=0).\]

Here “evidence as strong or stronger against $H_0$ than $x=2$” is $x \in \{1,2\}$. And the probability of this under $H_0$ is \[\Pr(x \in \{1,2\} | H_0) = 0.005+0.005 = 0.01.\]

So the $p$ value for $x=2$ is 0.01.

And yet, the observation $x=2$ is 5 times more probable under $H_0$ than under $H_1$! So $x=2$ has $p$ value 0.01 but is actually evidence for $H_0$.

Discussion

This example is obviously contrived to make a point: so it only demonstrates that it is possible to contrive a situation where $p=0.01$ corresponds to evidence for $H_0$.

However, given this it seems natural to ask: in “typical” situations, does $p=0.01$ correspond to evidence for or against $H_0$? Of course, the answer to this depends on what one views as “typical”. For a start towards answering this question see here.

Session information

sessionInfo()

R version 3.3.2 (2016-10-31)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X El Capitan 10.11.6

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.10         highr_0.6            git2r_0.18.0        
 [4] BiocInstaller_1.24.0 workflowr_0.4.0      bitops_1.0-6        
 [7] iterators_1.0.8      tools_3.3.2          digest_0.6.12       
[10] evaluate_0.10        lattice_0.20-34      Matrix_1.2-8        
[13] foreach_1.4.3        graph_1.52.0         BiocCheck_1.10.1    
[16] yaml_2.1.14          parallel_3.3.2       httr_1.2.1          
[19] stringr_1.2.0        knitr_1.15.1         REBayes_0.73        
[22] stats4_3.3.2         rprojroot_1.2        grid_3.3.2          
[25] getopt_1.20.0        optparse_1.3.2       Biobase_2.34.0      
[28] R6_2.2.0             XML_3.98-1.5         RBGL_1.50.0         
[31] rmarkdown_1.4        ashr_2.1-10          magrittr_1.5        
[34] backports_1.0.5      codetools_0.2-15     htmltools_0.3.5     
[37] biocViews_1.42.0     MASS_7.3-45          BiocGenerics_0.20.0 
[40] RUnit_0.4.31         assertthat_0.2.0     stringi_1.1.2       
[43] RCurl_1.95-4.8       pscl_1.4.9           doParallel_1.0.10   
[46] truncnorm_1.0-7      SQUAREM_2016.8-2

This site was created with R Markdown

Example of difficulty of calibrating p values

Matthew Stephens

2017-04-17

Pre-requisite

Background

Example

Discussion

Session information

x	1	2	3
\(p(x \| \theta=0)\)	0.005	0.005	0.99
\(p(x \| \theta=1)\)	0.999	0.001	0