The “Multiple Comparisons Problem” is the problem that standard statistical procedures can be misleading when researchers conduct a large group of hypothesis tests. When a researcher does more than one test of a hypothesis (or set of closely related hypotheses), the chances are that some finding will appear “significant” even when there’s nothing going on.
Classical hypothesis tests assess statistical significance by calculating the probability under a null hypothesis of obtaining estimates as large or larger as the observed estimate. When multiple tests are conducted, however, classical pvalues can mislead — they no longer reflect the true probability under the null.
This guide ^{1} will help you guard against drawing false conclusions from your experiments. We focus on the big ideas and provide examples and tools that you can use in R.
Let’s set up 1 test: we have 1 coin and we will flip it 10 times. We don’t know if the coin is fair but decide that if we flip over 9 heads and 1 tails, we will say that the coin is unfair. Let’s suppose that this is a fair coin; a coin that has a 50% chance of showing a head in a single flip. We are therefore not likely to flip over 9 heads and 1 tails and will most likely find that the coin is fair.
Now, let’s set up 5 tests: we have 5 coins and we will flip each of them 10 times. Again, we decide that the coin is unfair if we flip 9 heads and 1 tails; again, let’s suppose that all 5 coins are fair. The chance that we see 9 heads and 1 tails for a single coin is still very low. However, after doing 5 tests, it is much more likely that at least 1 of the coins will show 9 heads and 1 tails purely by chance. This result may then lead us to think that some or all of the coins are unfair, even though in reality all the coins are fair.
We can see from the coin flipping example that the more tests we run, the more likely we are to see an effect or relationship and the more likely we are to mistakenly claim to have detected an effect when in reality there is no effect and what we observe occurred by chance; this is the multiple comparisons problem.
In this guide, we will describe three main approaches for addressing the multiple comparisons problem:
In classical hypothesis testing, the “\(\alpha\) level” describes how willing the researcher is to make a certain kind of mistake: a false positive or a “Type I error” where a researcher falsely concludes that an observed difference is “real,” when in fact there is no difference. From the coin flipping example, concluding that we have an unfair coin because we flipped 9 heads and 1 tails when in reality we have fair coins is an example of a Type I error or false positive error. After setting the \(\alpha\) level, the researcher conducts a hypothesis test and if the pvalue \(\leq \alpha\), we call the result “statistically significant”. In many social science applications, the alpha level, or Type I error rate, is set to 0.05. This means that the researcher is willing to commit a Type I error 5% of the time.
In the world of multiple testing, the Type I error rate is associated with one of two things: the FamilyWise Error Rate (FWER) or the False Discovery Rate (FDR). To explain their differences, let’s first consider Table 1 which shows different types of errors. After conducting our hypothesis tests, we observe \(m\) (the total number of hypothesis tests), \(R\) (the total number of hypothesis tests that are statistically significant), \(mR\), (the total number of hypothesis tests that are not statistically significant). However, we do not know how many tests from \(R\) are false positives (FP, Type I error) or true positives (TP).
Fail to reject null hypothesis (p > 0.05) 
Reject null hypothesis (p \(\leq\) 0.05) 
Total Hypotheses  

Null hypothesis is true  TN (True Negative) 
FP (Type I error, False Positive) 
\(m_0\) 
Null hypothesis is false  FN (Type II error, False Negative) 
TP (True Positive) 
\(m  m_0\) 
Total Hypotheses  \(m  R\)  \(R\)  \(m\) 
The FWER is the probability of incorrectly rejecting even one null hypothesis, or \(P(FP \geq 1)\) across all of our tests. Suppose we have three null hypotheses, all of which are true. When the null hypothesis is true, but we nevertheless reject it in favor of some alternative, we commit a Type I error. If we set \(\alpha\) (the Type I error rate) to be 0.05, we have a [\(1−(1−0.05)^3=0.142\)] chance of rejecting at least one of them. In order to control the FWER (i.e., reduce it from 14.2% back down to 5%), we need to employ a correction. We’ll explore three ways to control the FWER (Bonferroni, Holm, and simulation) in the sections below.
The FDR is subtly different. It is the expected proportion of false discoveries among all discoveries, or \(E[FP/R]\)^{2}. In the case where no discoveries are found (\(R=0\)), then the “share” of false discoveries is taken to be zero. There are some connections to the FWER. For example, if in fact all null hypotheses are true then the FWER and FDR are the same. To see this, note that in this case if no significant rejections of the null are found, then the share that are false discoveries is zero. But if some are found (no matter how many), then the share that are false is 100% (since all null hypotheses are true). So the false discovery rate in this case is just the probability that some true null hypothesis is falsely rejected—the FWER. Beyond this case however the FDR is less stringent than the FWER. We’ll also explore some ways to control FDR in the sections below.
The Bonferroni correction is a simple and commonlyused approach for addressing the multiple comparisons problem although it usually understates how much information is available to detect an effect. If you conduct \(m\) tests but want to make sure that you make no more than \(\alpha\) errors out of the total \(m\) tests, the target significance level should be \(\alpha/m\), or, equivalently, you multiply your pvalues by \(m\), and apply the standard \(\alpha\) level. (The trouble with multiplying the pvalues is sometimes you end up with values over one, rendering the interpretation of the pvalues incoherent so software just replaces those pvalues with the number 1.)
For example, suppose you conduct an experiment that has 3 dependent variables. You conduct three differenceinmeans tests that yield the following classical pvalues: 0.004, 0.020, and 0.122. If your \(\alpha\) level is the standard 0.05 threshold, then you would usually declare the first two tests statistically significant and the last test insignificant. The Bonferroni correction, however, adjusts the target pvalue to \(0.05/3 = 0.016\). We then declare only the first test to be statistically significant.
The Bonferroni correction works under the most extreme circumstances, that is, when all \(m\) tests are independent from one another. To see how this works, imagine we are testing three true null hypotheses using a classical \(\alpha\) level of 0.05. Each test, therefore has a 5% chance of yielding the wrong answer that the null hypothesis is false.
But our chances of making at least one mistake are much greater than 5% because we have three chances to get it wrong. As above, this probability is in fact [\(1  (1  0.05)^3 = 0.142\)]. If we use the Bonferroni correction, however, our chances of getting it wrong fall back to our target \(\alpha\) value: [\(1  (1  0.05/3)^3 \approx 0.05\)] .
This correction works in the worstcase scenario that all tests are independent. But in most cases, tests are not independent. That is, if your treatment moves outcome A, it probably moves outcome B too, at least a little. So what tends to happen is, researchers report that their results “withstand” Bonferroni when they are extremely strong, but decry Bonferroni as too extreme when the results are weaker.
Instead of using the Bonferroni correction, you can use the Holm correction. It is strictly more powerful than Bonferroni, and is valid under the same assumptions. It also controls the FWER. Suppose you have \(m\) pvalues. Order them from smallest to largest. Find the smallest pvalue that satisfies this condition: \(p_{k}>\frac{\alpha}{m+1−k}\), where \(k\) is the pvalue’s index. This and all larger pvalues are insignificant; all smaller pvalues are significant.
Taking our three pvalues from above: 0.004, 0.020, and 0.122: \[0.004<\frac{0.05}{3+1−1}=0.017\] \[0.020<\frac{0.05}{3+1−2}=0.025\] \[0.122>\frac{0.05}{3+1−3}=0.050\]
Under the Holm correction, the first two tests are significant, but the last test is not.
The Benjamini–Hochberg (BH) procedure controls the FDR. Like the Holm correction, you also begin by ordering \(m\) pvalues. Then you find the largest pvalue that satisfies: \(p_{k}≤\frac{k}{m}\alpha\). This test, and all tests with smaller pvalues are declared significant.
\[0.004<\frac{1}{3}0.05=0.017\] \[0.020<\frac{2}{3}0.05=0.033\] \[0.122>\frac{3}{3}0.05=0.050\]
Using the Benjamini–Hochberg procedure, the first two tests are significant, but the third is not. Notice that the FDR does not control the same error rate as the FWER. It is common for those controlling the FDR error rate to take “significant tests” to indicate places to direct more attention for confirmatory tests (which, would then be analyzed using the stricter FWER control). The idea that one uses the FDR to explore and FWER to confirm is also common in genetics and other applications of large scale hypothesis testing.
In R, the p.adjust()
function contains many of the
corrections devised by statisticians to address the multiple comparisons
problem. The p.adjust()
function is in base R, so no
additional packages are required. The p.adjust()
function
gives adjusted pvalues after implementing a correction.
# Set seed for reproducibility
set.seed(343)
# Generate 50 test statistics
# Half are drawn from a normal with mean 0
# The other half are drawn from a normal with mean 3
x < rnorm(50, mean = c(rep(0, 25), rep(3, 25)))
# Obtain 50 pvalues
p < round(2*pnorm(sort(abs(x))), 3)
# Choose alpha level
alpha < 0.05
# Without any corrections
sig < p < alpha
# Conduct three corrections
# and compare to target alpha
bonferroni_sig < p.adjust(p, "bonferroni") < alpha
holm_sig < p.adjust(p, "holm") < alpha
BH_sig < p.adjust(p, "BH") <alpha
The results of this simulation are presented in the table and figure below.
Correction Type  No Correction  BenjaminiHochberg  Holm  Bonferroni 

Statistically Significant  25  22  11  8 
Not Statistically Significant  25  28  39  42 
Of the 25 null hypotheses that would be rejected if no correction were made, the Bonferroni correction only rejects 8, the Holm procedure rejects 11, and the Benjamini–Hochberg procedure rejects 22 (or tags 22 hypotheses as promising for future exploration). Of these three corrections, Bonferroni is the most stringent while Benjamini–Hochberg is the most lenient.
Instead of R, you can also use this calculator to adjust your pvalues.
The trouble with the corrections above is that they struggle to address the extent to which the multiple comparisons are correlated with one another. A straightforward method of addressing this problem is simulation under the sharp null hypothesis of no effect for any unit on any dependent variable. Note that this is a familywise sharp null.
If the treatment has no effect at all on any outcome, then we observe all potential outcomes for all subjects. We can rerandomize the experiment 1000 or more times and conduct all \(m\) hypothesis tests each time. We know for sure that all \(m\) null hypotheses are true, because the treatment has no effect by construction.
The next step is picking the right threshold value below which results are deemed statistically significant. If \(\alpha\) is 0.05, we need to find the target pvalue that, across all simulations under the sharp null, yields 5% significant hypothesis tests.
Once we have the right threshold value, it’s as easy as comparing the uncorrected pvalues to the threshold value — those below the threshold are deemed significant.
# Control the FWER through simulation
rm(list=ls())
library(mvtnorm)
library(randomizr)
# Helper functions
do_t_test < function(Y, Z){
t.test(Y[Z==1], Y[Z==0])$p.value
}
permute_treat < function(){
treatment_sim < complete_ra(n, m=n/2)
ps_sim < apply(outcomes, 2, do_t_test, Z = treatment_sim)
return(ps_sim)
}
threshold_finder< function(threshold){
mean(apply(many_ps, 2, x < function(x) sum(x <= threshold) > 0 ))
}
# Set a seed
set.seed(343)
# Generate correlated outcomes
# Outcomes are unrelated to treatment
# All null hypotheses are true
n < 1000
k < 100; r < .7; s < 1
sigma < matrix(s*r, k,k)
diag(sigma) < s
outcomes < rmvnorm(n=n, mean=rep(0, k), sigma=sigma)
# Complete Random Assignment
treatment < complete_ra(n, m=n/2)
# Conduct k hypothesis tests
p_obs < apply(outcomes, 2, do_t_test, Z = treatment)
# Simulate under the sharp null
many_ps < replicate(1000, permute_treat(), simplify = TRUE)
# Obtain the Type I error rate for a series of thresholds
thresholds < seq(0, 0.05, length.out = 1000)
type_I_rate < sapply(thresholds, threshold_finder)
# Find the largest threshold that yields an alpha type I error rate
target_p_value < thresholds[max(which(type_I_rate <=0.05))]
# Apply target p_value to observed p_values
sig_simulated < p_obs <= target_p_value
# Compare to raw pvalues
sig < p_obs <= 0.05
The target pvalue obtained by the simulation is 0.002 — hypothesis tests with raw pvalues below 0.002 are deemed significant. Compare this with the Bonferroni method, which would require a pvalue below 0.05/100 = 0.0005, an order of magnitude smaller. The closer the correlation of the tests (the parameter “r” in the code above) is to zero, the closer the two methods will be.
The flexibility of the simulation method is both an advantage and a disadvantage. The advantage is that it can accommodate any set of testing procedures, returning a studyspecific correction that will generally be more powerful than other methods to control the FWER. The disadvantage is that it requires the researcher to code up a simulation — there are no prewritten functions that will apply across research contexts.
Here are some guidelines and tips for writing your own simulation.
Recall from earlier that the BenjaminiHochberg procedure to control FDR can also be too conservative when all null hypotheses are true. There are a number of different variants of the FDR and with each variant, their own procedures for control. In general, these are more powerful than the BH procedure. Below is a table of a few types:
FDR Variants  Methods for Control  Software 

pFDR: positive FDR, \(E[V/RR>0]\)  The StoreyBH procedure sets a rejection area and estimates the corresponding error rate. This contrasts the BH procedure sets an error rate \(\alpha\) and estimates its rejection area with the adjustments (Storey, 2002).  See R package
qvalue 
mFDR or Fdr: marginal FDR, \(E[FP]/E[R]\).  \(\alpha\)investing, a procedure where, after rejecting a null hypothesis, the researcher can “invest” in the \(\alpha\) threshold by increasing the threshold for subsequent tests (Foster and Stine, 2008).  See R package
onlineFDR

fdr: Local false discovery rate, \(fdr(z)\)  An empirical Bayes approach to estimate local false discovery rate as a function of the size of the test statistic \(z\) (Efron, 2004).  See R pacakge
locfdr 
There is the increased chance of accepting more false discoveries with these other methods over the basic BH approach. However, in cases with many, many tests, where the researcher might be willing to accept a few more false discoveries along with making more true discoveries, these methods for controlling FDR are good alternatives.
Suppose^{3} a researcher measures \(k>1\) dependent variables. Indexing allows the researcher to reduce these \(k\) outcomes into a single measure (or several thematicallygrouped measures). These indexing methods effectively condense the number of dependent variables that investigators test in order to address the multiple comparisons problem. There are a number of ancillary benefits of these indexing methods:
There are two principal indexing methods in the literature:
Kling, Liebman, and Katz (2004) employ a mean effects index, constructed as follows:
\[\tilde{y}_{ik}= \frac{y_{ik} \bar{y}_k^{Z=0}}{\sigma_{k}^{y,Z=0}}\]
Sum the \(z\)scores, \(\sum_{i=1}^K \tilde{y}_{ik}\) (optionally) divide by \(K\) to generate the index.
Optional: It may be desirable to normalize the final index by the control group mean and standard deviation.
In the presence of missing outcomes, one of two approaches could be employed:
Anderson (2008) provides a similar approach that constructs an index that employs inverse covariance weighting. This weighting scheme improves efficiency relative to the mean effects index above by affording less weight to highly correlated outcomes. The Anderson index can be constructed through the following procedure:
\[\tilde{y}_{ik}= \frac{y_{ik} \bar{y}_k^{Z=0}}{\sigma_{k}^{y,Z=0}}\]
Construct and invert the (variance)covariance matrix of the resultant matrix of \(z\)scores calculated in step 2. Call this \(k \times k\) inverted (variance)covariance matrix \(\hat{\boldsymbol{\Sigma}}^{1}\).
The weighted indexed outcome, \(\bar{s}_i\) can be estimated via the following procedure, where \(\textbf{1}\) is a \(k \times 1\) vector of ones and \(\textbf{y}_{ik}\) is the \(n \times k\) matrix of \(z\)scores calculated in step 2.
\[\bar{s}_i = (\textbf{1}^T \hat{\boldsymbol{\Sigma}}^{1} \textbf{1})^{1}(\textbf{1}^T \hat{\boldsymbol{\Sigma}}^{1} \textbf{y}_{ik})\]
As with the mean effects index, this varible \(\bar{s}_i\) the serves as the dependent variable in your analysis. One potential drawback to the inverse covariance weighting index is that there is no guarantee that elements in the inverted covariance matrix (\(\boldsymbol{\Sigma}^{1}\)) are positive. As such, it is possible to generate negative weights using this indexing method. Given that outcomes are oriented in the same direction, a negative weight effectively reverses the direction of the effect on negativelyweighted outcomes in the construction of the index.
The following functions implement both the mean effects and inverse covariance weighted index methods and evaluate both functions on a DGP with 50 outcome measures:
stopifnot(require(mvtnorm))
stopifnot(require(dplyr))
stopifnot(require(randomizr))
stopifnot(require(ggplot2))
set.seed(1234)
calculate_mean_effects_index < function(Z, outcome_mat, to_reorient, reorient = FALSE, greedy = TRUE,
impute = FALSE){
if(length(Z) != nrow(outcome_mat)) stop("Error: Treatment assignment, outcome matrix require same n!")
if(impute == TRUE){
R < 1 * is.na(outcome_mat)
means_for_imputation < rbind(apply(outcome_mat[Z==0,], MAR = 2, FUN = mean, na.rm = T),
apply(outcome_mat[Z==1,], MAR = 2, FUN = mean, na.rm = T))
to_impute < R * means_for_imputation[Z+1,]
outcome_mat[is.na(outcome_mat)] < 0
outcome_mat < outcome_mat + to_impute
}
c_mean < apply(X = outcome_mat[Z==0,], MARGIN = 2, FUN = mean, na.rm = T)
c_sd < apply(X = outcome_mat[Z==0,], MARGIN = 2, FUN = sd, na.rm = T)
z_score < t(t(sweep(outcome_mat, 2, c_mean))/ c_sd)
index_numerator < rowSums(z_score)
if(greedy == TRUE){
n_outcomes < rowSums(!is.na(z_score))
}
else if(greedy == FALSE){
n_outcomes < ncol(outcome_mat)
}
index < index_numerator/n_outcomes
index < (index  mean(index[Z==0], na.rm =T))/sd(index[Z==0], na.rm =T)
return(index)
}
calculate_inverse_covariance_weighted_index < function(Z, outcome_mat, to_reorient, reorient = FALSE){
if(length(Z) != nrow(outcome_mat)) stop("Error: Treatment assignment, outcome matrix require same n!")
if(reorient == TRUE){
outcome_mat[, c(to_reorient)] < outcome_mat[, c(to_reorient)]
}
c_mean < apply(X = outcome_mat[Z==0,], MARGIN = 2, FUN = mean, na.rm = T)
c_sd < apply(X = outcome_mat[Z==0,], MARGIN = 2, FUN = sd, na.rm = T)
z_score < t(t(sweep(outcome_mat, 2, c_mean))/ c_sd)
Sigma_hat < solve(cov(z_score, y = z_score, use = "complete.obs"))
one_vec < as.vector(rep(1, ncol(outcome_mat)))
if(sum(is.na(outcome_mat))>0){
z_score[is.na(z_score)] < 0
}
w_ij < t(solve(t(one_vec) %*% Sigma_hat %*% one_vec) %*% (t(one_vec) %*% Sigma_hat))
if(sum(w_ij < 0) > 0){warning('Warning, at least one weight is negative!')}
s_ij < t(solve(t(one_vec) %*% Sigma_hat %*% one_vec) %*% (t(one_vec) %*% Sigma_hat %*% t(z_score)))
index < (s_ij  mean(s_ij[Z==0], na.rm = T))/sd(s_ij[Z==0], na.rm = T)
return(s_ij)
}
We can see how these indices perform in a setting with \(k = 5\) outcome variables.
# A DGP with K outcome variables
# Untreated potential outcomes drawn from multivariate normal distribution
K < 5
r < runif(n = K, min = .9, max = .9)
sigma < outer(r, r, FUN = "*")
diag(sigma) < 1
mat < rmvnorm(n = 200, mean = rep(0, K), sigma = sigma)
# Treatment assignment
Z < complete_ra(200)
# Created observed potential outcomes
# Assume that ATEs are all oriented in the same direction for the time being
ATEs < rnorm(K, mean = .25, sd = 1)
for(i in 1:K){
mat[,i] < mat[,i] + rnorm(n = 200, mean = Z * ATEs[i], sd = 1)
}
mean_effects_index < calculate_mean_effects_index(Z = Z, outcome_mat = mat, reorient = F)
inv_cov_weighted_index < calculate_inverse_covariance_weighted_index(Z = Z, outcome_mat = mat,reorient = F)
First, we can examine the properties of the indices alongside our five outcome variables by looking at the covariance matrix.
knitr::kable(cov(data.frame(mat, mean_effects_index, inv_cov_weighted_index)), digits = 3)
X1  X2  X3  X4  X5  mean_effects_index  inv_cov_weighted_index  

X1  2.096  0.247  0.330  0.394  0.443  0.448  0.249 
X2  0.247  2.059  0.195  0.057  0.054  0.747  0.247 
X3  0.330  0.195  2.614  0.172  0.450  0.641  0.295 
X4  0.394  0.057  0.172  1.784  0.249  0.534  0.242 
X5  0.443  0.054  0.450  0.249  2.525  0.755  0.305 
mean_effects_index  0.448  0.747  0.641  0.534  0.755  0.999  0.428 
inv_cov_weighted_index  0.249  0.247  0.295  0.242  0.305  0.428  0.189 
We can also plot the two indices to show their similarities (or differences). Note that with the final normalization included in the functions above, both indices are on the same scale.
data.frame(Z, mean_effects_index, inv_cov_weighted_index) %>%
mutate(Group = ifelse(Z==1, "Treatment", "Control")) %>%
ggplot(aes(x = mean_effects_index, y = inv_cov_weighted_index, colour = Group)) +
geom_point() + theme_bw() + xlab("Mean Effects Index") + ylab("Inverse Covariance Weighted Index")