- 1. What is meta-analysis?
- 2. How does a meta-analysis differ from a literature review?
- 3. What types of data are used as input for a meta-analysis?
- 4. What is the estimand in meta-analysis?
- 5. Meta-analysis and Bayes’ Rule
- 6. Fixed effects versus random effects estimation
- 7. Is it okay to summarize both experimental and observational research findings?
- 8. Publication bias as a threat to meta-analysis
- 9. Modeling inter-study heterogeneity using meta-regression
- 10. Methods for assessing the accuracy of meta-analytic results

Meta-analysis is a method for summarizing the statistical findings from a research literature. For example, if five experiments have been conducted using the same intervention and outcome measure on the same population of people with five separate estimates of an average treatment effect, one might imagine pooling these five studies together into a single dataset and analyzing them jointly. In broad strokes, in such a case, we could act as though the studies came from five blocks within a single experiment rather than five separate experiments. The benefit of such an approach would be more statistical power in the estimation of one overall average treatment effect. In essence, a meta-analysis produces a weighted average of the five studies’ results. As explained below, this method is also used to summarize research literatures that comprise a diverse array of interventions and outcomes measured in diverse settings, under the assumption that the interventions are theoretically similar and the outcome measures tap into a shared underlying trait.

Meta-analysis is often characterized as a form of systematic review
insofar as it involves a specific set of data collection and analysis
procedures. These procedures are spelled out in great detail by the Campbell
Collaboration and by the James
Lind Library. Particular attention is paid to gathering both
published and unpublished studies (see below). By comparison, most
conventional literature reviews cite the most noteworthy theoretical or
empirical contributions but rarely attempt to be comprehensive or to
summarize the findings quantitatively. Critics of conventional reviews
point out the possibility that the most noteworthy or memorable studies
present findings that are unrepresentative of the broader research
literature and therefore may be a poor guide for policy. On the other
hand, critics of meta-analysis point out that the flaws of individual
studies are often lost sight of when their estimates are blended
together to generate an overarching conclusion. As Uri Simonsohn once
quipped, “Meta-analysis is a sausage factory that uses sausages by other
factories as inputs.”^{1}

In principle, meta-analysis could be applied to the original data from each study, but in practice such data are seldom available for all relevant studies. Instead, researchers typically cull estimated treatment effects from research papers or other reports. This process presents scholars conducting a meta-analysis with an array of decisions when the research papers present results involving multiple outcomes, treatments, and estimation approaches. Often meta-analysis focuses on the “main” results, but identifying the main or primary results can be a judgment call. The reproducibility of meta-analysis hinges on careful documentation of such decisions. In addition to locating the key estimates, the meta-analyst must also track down measures of statistical uncertainty (e.g., standard errors, confidence intervals), as these statistics will be used to assign weights to each study in the averaging process. As noted below, meta-analysis tends to assign more weight to studies with less sampling variability such as studies with larger sample sizes.

Some meta-analyses are narrowly tailored to specific treatments and outcomes. For example, a vast literature dating back to the 1920s focuses on the extent to which mailings that encourage voting in fact cause people to vote. In this case, one could imagine a population parameter that represents the average causal effect of mailed voting encouragements on a population of people in a specified region over some specified time period.

Other meta-analyses are more abstract, focusing on a broad class of
treatments and outcomes. For example, the literature on prejudice
reduction comprises hundreds of studies on the effects of interpersonal
contact between people with different racial, ethnic, religious, age, or
gender backgrounds (Pettigrew and Tropp 2006).^{2} Contact ranges from a
brief conversation to a year of co-habitation in a college dormitory.
Outcomes also range widely from overt behaviors to self-reported
feelings about in-groups and out-groups. Since these interventions and
outcomes are on different scales and may refer to different concepts,
the underlying population parameter is ambiguous. Researchers try to
sidestep this issue by standardizing the outcomes (e.g., by dividing the
outcome by the standard deviation in the control group), but there
remains the problem of what to make of treatments that vary in intensity
and duration. In effect, the population parameter in such cases becomes
the average extent to which an ad hoc collection of interventions
changes putative measures of prejudice within some location and time
period.

Meta-analyses can also struggle when the underlying population represented by the component studies is vague or abstract. For example, in a literature dominated by laboratory experiments conducted in the United States, the meta-analysis will implicitly assess an underlying average treatment effect in which the “average” gives disproportionate weight to American undergraduates.

Consider the simple case in which two experiments are conducted using the same treatments and outcomes. Imagine that both studies draw their subject pool from the same population. In this case, if we had the individual data for the experiments, we could pool them together into a single dataset and analyze them as though they were part of the same block-randomized experiment. But suppose that we did not have the individual data; instead, we only have the estimated ATE and estimated standard error from each study. How might we combine the two studies to form our best guess of the average treatment effect in the population from which the subjects were drawn? With some simplifying assumptions, we could apply Bayes’ Rule. Let’s assume that the sampling distribution of each experimental estimate is normal. (This is a reasonable assumption under the Central Limit Theorem, since we are using an average to estimate the average treatment effect and we assume that each experiment has at least a few dozen subjects and that the outcome distribution is not too skewed.) Since these experiments are independent of one another, Bayes’ Rule takes a simple form: take a weighted average of the two estimates, where the weights are the inverse of each study’s squared standard error (\(\hat{\sigma}_j^2\) is the squared estimated standard error for study \(j\)).

\[ \hat{ATE_{pooled}} = \frac{\frac{1}{\hat{\sigma}_1^2}}{\frac{1}{\hat{\sigma}_1^2} + \frac{1}{\hat{\sigma}_2^2}}\hat{ATE_1} + \frac{\frac{1}{\hat{\sigma}_1^2}}{\frac{1}{\hat{\sigma}_1^2} + \frac{1}{\hat{\sigma}_2^2}}\hat{ATE_2} \]

This formula turns out to be the same as a so-called “fixed effects” meta-analysis. This formula is sometimes called a “precision-weighted average,” where the term “precision” refers to the inverse of the squared standard error. In a simple two-arm, completely randomized study, the standard error of the simple estimator of the average treatment effect is a function of sample size and variation in the outcome, and ratio of treated to control units. So, notice that in this case the study with the smaller standard error (i.e. larger sample, less variable outcome, more equal ratio of treated to control units) received more weight in the pooled meta-analytic result.

Most meta-analysis software^{3} presents users with a choice between fixed
effects estimation and random effects estimation. Fixed effects
estimation is simply a precision-weighted average.^{4} And random effects
estimation is a special case of more general Bayesian meta-analysis. In
either case, the studies with the smallest standard errors are accorded
the most weight. Random effects estimation applies a different set of
weights depending on the extent to which the estimates vary more than
would be expected by chance under a fixed effects model. Therefore, the
weights for the random effects estimation not only consider the variance
within each study but also an estimate of the between-study variance
(\(\tau^2\)).

\[ \hat{ATE_{pooled}^*} = \frac{\frac{1}{\hat{\sigma}_1^2+\tau^2}}{\frac{1}{\hat{\sigma}_1^2+\tau^2} + \frac{1}{\hat{\sigma}_2^2+\tau^2}}\hat{ATE_1} + \frac{\frac{1}{\hat{\sigma}_1^2+\tau^2}}{\frac{1}{\hat{\sigma}_1^2+\tau^2} + \frac{1}{\hat{\sigma}_2^2+\tau^2}}\hat{ATE_2} \]

The more heterogeneous the estimated effects–perhaps due to variations in experimental techniques, outcome measurement, or context–the more the resulting weighted average represents a simple average rather than a precision-weighted average. Typically, researchers start with a fixed effects meta-analysis, test whether the estimates are significantly overdispersed given the fixed effects model, and, if so, estimate and report a random effects meta-analysis.

Beware of meta-analyses that combine experimental and observational
estimates. When properly executed, experiments provide unbiased
estimates of the average treatment effect. An observational study, on
the other hand, is prone to bias insofar as the treatments are not
randomly assigned. The nominal standard errors associated with
observational studies ignore the potential for bias; the standard errors
are biased downward because they assume the best-case scenario, namely,
that nature assigned treatments in a manner that was as good as random.
Gerber, Green, and Kaplan (2004)^{5} show that merely being uncertain about the
bias of an observational study is equivalent to according it a larger
standard error. Although many prominent meta-analyses include both
experiments and observational studies (e.g., Lau and Sigelman^{6}; Pettigrew
and Tropp 2006), this practice is frowned upon by leading scholars
conducting biomedical meta-analyses.

Because meta-analyses draw their data from reported results, publication bias presents a serious threat to the interpretability of meta-analytic results. If the only results that see the light of day are splashy or statistically significant, meta-analysis may simply amplify publication bias. Methodological guidance to meta-analytic researchers therefore places special emphasis on conducting and carefully documenting a broad-ranging search for relevant studies, whether published or not, including languages other than English. This task is, in principle, aided by pre-registration of studies in public archives; unfortunately, pre-registration in the social sciences is not sufficiently comprehensive to make this a dependable approach on its own.

When assembling a meta-analysis, it is often impossible to know
whether one has missed relevant studies. Some statistical methods have
been developed in order to detect publication bias, but these tests tend
to have low power and therefore may give more reassurance than is
warranted. For example, one common approach is to construct a
scatterplot to assess the relationship between study size (whether
measured by the N of subjects or the standard error of the estimated
treatment effect) and effect size. A telltale symptom of publication
bias is a tendency for smaller studies to produce larger effects (as
would be the case if studies were published only if they showed
statistically significant results; to reach the significance bar, small
studies (with large standard errors) would need to generate larger
effect estimates. Unfortunately, this test often produces ambiguous
results (Bürkner and Doebler 2014),^{7} and methods to correct publication bias in
the wake of such diagnostic tests (e.g., the trim-and-fill method) may
do little to reduce bias. Given growing criticism of statistical tests
for publication bias and accompanying statistical correctives, there is
an increasing sense that the quality of a meta-analysis hinges on
whether research reports in a given domain can be assembled in a
comprehensive manner.

Researchers often seek to investigate systematic sources of treatment effect heterogeneity. These systematic sources may reflect differences among subjects (Do certain drugs work especially well for men or women?), contexts (Do lab studies of exposure to mass media produce stronger effects than field studies?), outcomes (Are treatment effects especially large when outcomes are measured via opinion surveys as opposed to direct observation of behavior?), or treatments (Are partisan messages more effective at mobilizing voters than nonpartisan messages?). Quite often, these investigations are best studied directly, via an experimental design. For example, variation in treatment may be studied by randomly assigning different treatment arms. Variation in effects associated with different outcome measures may also be studied in the context of a given experiment by gathering data on more than one outcome or by randomly assigning how outcomes are measured.

A second-best approach is to compare studies that differ on one or more of these dimensions (subjects, treatments, context, or outcomes). The drawback of this approach is that it is essentially descriptive rather than causal – the researcher is basically characterizing the features of studies that contribute to especially large or small effect sizes. That said, this exercise can be conducted via meta-regression: the estimated effect size is the dependent variable, while study attributes (e.g., whether outcomes were measured through direct observation or via survey self-reports) constitute the independent variables. Note that meta-regression is a generalization of random effects meta-analysis, with measured predictors of effect sizes as well as unmeasured sources of heterogeneity.

Since meta-analysis is a technique for combining information across different studies, we do not here discuss the detection or modeling of heterogeneous treatment effects within any single study. See our guide 10 Things to Know About Heterogeneous Treatment Effects for more on this topic.

A skeptic might ask whether meta-analysis improves our understanding of cause-and-effect in any practical way. Do we learn anything from pooling existing studies via a weighted average versus presenting the studies one at a time and leaving the synthesis to the reader? To address this question EGAP conducted an experiment among the academics and policy experts attending a conference to reveal the results of the first round of EGAP’s Metaketa Initiative, which focused on conducting a coordinated meta-analysis on the impact of information and accountability programs on electoral outcomes. The round consisted of six studies measuring the impact of the same causal mechanism.

To test the idea that accumulated knowledge (in the form of
meta-analysis) allows for better inferences about the effect of a given
program, the Metaketa committee randomized the audience to hear a
presentation of the meta-analysis, each component study, a placebo, and
an external study of a similar intervention that was not part of the
Metaketa round or the subsequent meta-analysis. Each group of
participants was not exposed to one of the above group of studies. And
the participants were asked to predict the results of the left out
study. This allowed the committee to measure the effect of each study
type on attendees’ predictive abilities. The event attendees were then
asked to predict the findings of the one study they had not yet seen.
The resulting analysis found that exposure to the meta-analysis led to
greater accuracy in predicting the effect in the left-out study in
comparison to the external study (which, as a reminder, was not part of
the meta-analysis in any way). For more on this Metaketa round, along
with a more substantial discussion of this “evidence summit” see the
book Information, Accountability, and Cumulative Learning: Lessons from
Metaketa I.^{8}

Pettigrew, T.F. & Tropp, L.R. (2006). A Meta-Analytic Test of Intergroup Contact Theory.

*Journal of Personality and Social Psychology, 90(5)*, 751–783.↩︎for a list of R packages useful in conducting meta-analysis, see here: https://cran.r-project.org/web/views/MetaAnalysis.html↩︎

See for example https://www.stata.com/support/faqs/statistics/meta-analysis/ and https://cran.r-project.org/web/views/MetaAnalysis.html↩︎

Gerber, A.S., Green, D.P., & Kaplan, E.H. (2004). The illusion of learning from observational research. In I. Shapiro, R.M. Smith, & T.E. Masoud (Eds.),

*Problems and Methods in the Study of Politics*(251-273). Cambridge, England: Cambridge University Press.↩︎Lau, R.R., Sigelman, L., & Rovner, I.B. (2007). The Effects of Negative Political Campaigns: A Meta‐Analytic Reassessment.

*The Journal of Politics, 69(4)*, 1176-1209.↩︎Bürkner, P. C., & Doebler, P. (2014). Testing for publication bias in diagnostic meta‐analysis: a simulation study.

*Statistics in Medicine, 33(18)*, 3061-3077.↩︎Dunning, T., Grossman, G., Humphreys, M., Hyde, S. D., McIntosh, C., & Nellis, G. (Eds.). (2019).

*Information, accountability, and cumulative learning: Lessons from Metaketa I.*Cambridge: Cambridge University Press.↩︎