Field experiments enable estimation of the causal effects of an intervention. Estimates of causal effects—like the average treatment effect (ATE)—can, in turn, generate different types of learning. In this Standards Discussion, I consider two forms of learning from experimental estimates. I first consider what we can learn from the magnitude of estimated treatment effects. I second describe what we can learn from the statistical significance (or lack thereof) of hypothesis tests building on recent arguments by Abadie (2020). I explain below how learning about the magnitude of effects (the estimation-based approach) differs from learning about arguments from effects (the testing-based approach) and about how a “null effect” means something different under each approach. In particular, I argue that determining which type of learning is sought ex-ante can help us reason about the costs and benefits of running an experiment and inform our decision to run an experiment in the first place.

Learning from Estimated Effects

Estimates of causal quantities such as the average treatment effect (ATE) are particularly useful for studying the policy implications of interventions. A finding that a conditional cash transfer (CCT) program for mothers improved child health indicators or school attendance may teach program architects or taxpayers that such policies should continue. If a program should not be canceled, however, new questions arise: by how much did children’s health improve? How expensive was the intervention? Even if we can detect an effect using a hypothesis test, is the cost-vs-benefit trade-off of the intervention and the outcome worthwhile? Point or interval estimates of causal effects may help to bring clarity to such policy questions. For example, in the study of CCTs, we may wonder how big of a decrease in child stunting is needed to justify the costs of the transfers. From this perspective, the main learning from an experiment comes from the estimated effect size.

As consumers of research, we are often predisposed to seek large positive or negative effects of an intervention. However, from the perspective of using results to inform policy via some type of welfare or cost-benefit analysis, there is no reason that a large positive effect of an intervention is more informative—or useful—than a zero effect. Suppose we were testing the effects of a cost-cutting measure. A zero effect on outcomes could be persuasive justification for adoption of the policy: the cost of services was reduced without sacrificing service quality or citizen satisfaction.

To be more clear about this point, I suggest we consider learning from a policy experiment as a sign that some decision maker used Bayesian updating from some prior belief over the distribution of possible treatment effects to create posterior beliefs on the basis of evidence generated by the experiment. Prior beliefs are just what they sound like: beliefs about what treatment effects sound far-fetched and what kind of treatment effects sound plausible before the experiment is conducted. Posterior beliefs are the beliefs we have about the plausibility (or surprisingness) of certain treatment effects after the experiment has been done and new observations have been taken into account. The process of combining prior beliefs with observations to create new beliefs or posterior beliefs can be formalized using Bayes rule (see Bayesian updating link for more details). This sounds very technical but the idea of formalizing “learning” as Bayesian updating allows us to simplify and clarify our thinking about how status quo knowledge (the prior) changes given what we observe in an experiment: if the posterior, or beliefs after the experiment is done, differs from the prior then we say that the decision maker has learned. And the question then is how much can be learned from different approaches to experiments.

In policy-making contexts, a theory of change often posits a straightforward expectation. For example, we may have a theory that a reminder nudge should make citizens more likely to complete a task. But even these straightforward expectations may correspond to a relatively diffuse prior about the effects of some nudge: yes we think that the nudge should matter, but we would having trouble making a bet that the treatment effect is greater than 1 percentage point, or less than 10 percentage points, or any concrete treatment effect number. Relevant considerations include: how much does an estimated ATE of a nudge treatment allow us to update our beliefs about the efficacy of the intervention? Does this change in belief also change the content of a policy recommendation about whether a government agency should invest in disseminating reminders?

A very simple illustration of how a Bayesian policy maker might learn about effect sizes of an intervention is shown in Figure 1. The means of the priors (in black solid lines) and the posteriors (blue dashed lines) are directly underneath the middle peaks of the distributions. In that figure, I hold constant the estimated ATE (at 0). The results suggest that the extent of learning depends both on the prior and the precision of the estimated ATE (the new observations). Comparing the prior and the posteriors, there is greater movement in the mean ATE from prior to posterior as the distance between the prior and the evidence, or new observations, increases. Moreover, as the precision of the ATE increases (as the standard error shrinks), the precision of the posterior similarly increases.

Figure 1. Priors (black solid lines) and posteriors (blue dashed lines). All plots assume that the estimated ATE from the observed data is 0 and that N=10,000. The plots vary the prior mean and the standard error of the ATE estimated from the observed data. When the ATE estimates from the observed data are very precise, the posterior beliefs become more precise. When the prior mean and the observed data ATE diverge, the difference between the prior and the posterior increases --- there is more learning.

Figure 1. Priors (black solid lines) and posteriors (blue dashed lines). All plots assume that the estimated ATE from the observed data is 0 and that N=10,000. The plots vary the prior mean and the standard error of the ATE estimated from the observed data. When the ATE estimates from the observed data are very precise, the posterior beliefs become more precise. When the prior mean and the observed data ATE diverge, the difference between the prior and the posterior increases — there is more learning.

Learning by Testing Hypotheses

Academic literature often justifies experiments as tests of a directional theory or argument. We examine whether get-out-the-vote (GOTV) appeals increase turnout or whether social contact with outgroup members reduces negative sentiment toward outgroups. In standard practice, researchers test a null hypothesis that the causal effect of interest is equal to zero. Rejection of this null hypothesis (presumably in the predicted direction) constitutes evidence in favor of the argument. Failure to reject this null, or a “null result,” is often interpreted as evidence against an argument even if, in fact, failure to reject means “not enough information.”

Notice that directional arguments do not make predictions about effect size. Such claims may make predictions about relative effect sizes, i.e., Treatment A has a larger ATE than Treatment B. However, they do not make predictions that GOTV appeals increase voter turnout by two percentage points or that social contact yields a 0.2 standard deviation reduction in negative sentiment. Even when directional theories are formalized, common causal estimands are (almost always) reduced-form tests. This practice means that, in this mode of research, our core inferences come from rejection of or failure to reject the null hypothesis even when we have more elaborate theories relating the cause to the effect.

In a context in which the central inference we seek is rejection of or failure to reject a null hypothesis, what do we learn from either outcome? In a recent paper, Abadie (2020) contends that we may learn more from a null finding than from a non-null finding. In particular, if Bayesian consumers of research expect an intervention to have an effect, a null finding can generate greater updating than a significant (non-null) finding.

Figure 2 illustrates this insight. Here, we are testing a null that a quantity of interest (e.g., a causal effect that I call \(\theta\)) is equal to zero. The upper left panel reproduces the first example in Abadie (2020). A research consumer has a prior belief that \(\theta \sim \mathcal{N}(1,1)\), which is shown by the black curve in the density plot. Conditional on the realization of a null result from an observed experiment, the research consumer’s posterior belief is given by the red curve. The posterior puts greater density closer to \(\theta = 0\). In contrast, conditional on a realization of a significant result (i.e. a hypothesis test rejecting a hypothesis of no effects with \(p < .05\)), her posterior belief puts less density on \(\theta=0\), but more density on negative and positive values of \(\theta\) given the two-tailed hypothesis test.

The remaining panels of Figure 2 show that the amount of learning from a null vs. non-null finding depends on:

  1. The prior belief. In the upper panels, the prior belief is given by \(\theta \sim \mathcal{N}(1,1)\) as in Abadie (2020). The bottom panels employ a \(\theta \sim \mathcal{N}(0,1)\) prior.
  2. The hypothesis being tested. The right panels tests a one-tailed (upper) hypothesis instead of the two-tailed hypothesis in the left panels.