null_results

Learning from Estimated Effects
Learning by Testing Hypotheses
“Types” of learning need not be mutually exclusive
A Role for Pre-Analysis Plans
When should we run experiments? A set of considerations
Reference

Field experiments enable estimation of the causal effects of an intervention. Estimates of causal effects—like the average treatment effect (ATE)—can, in turn, generate different types of learning. In this Standards Discussion, I consider two forms of learning from experimental estimates. I first consider what we can learn from the magnitude of estimated treatment effects. I second describe what we can learn from the statistical significance (or lack thereof) of hypothesis tests building on recent arguments by Abadie (2020). I explain below how learning about the magnitude of effects (the estimation-based approach) differs from learning about arguments from effects (the testing-based approach) and about how a “null effect” means something different under each approach. In particular, I argue that determining which type of learning is sought ex-ante can help us reason about the costs and benefits of running an experiment and inform our decision to run an experiment in the first place.

Learning from Estimated Effects

Estimates of causal quantities such as the average treatment effect (ATE) are particularly useful for studying the policy implications of interventions. A finding that a conditional cash transfer (CCT) program for mothers improved child health indicators or school attendance may teach program architects or taxpayers that such policies should continue. If a program should not be canceled, however, new questions arise: by how much did children’s health improve? How expensive was the intervention? Even if we can detect an effect using a hypothesis test, is the cost-vs-benefit trade-off of the intervention and the outcome worthwhile? Point or interval estimates of causal effects may help to bring clarity to such policy questions. For example, in the study of CCTs, we may wonder how big of a decrease in child stunting is needed to justify the costs of the transfers. From this perspective, the main learning from an experiment comes from the estimated effect size.

As consumers of research, we are often predisposed to seek large positive or negative effects of an intervention. However, from the perspective of using results to inform policy via some type of welfare or cost-benefit analysis, there is no reason that a large positive effect of an intervention is more informative—or useful—than a zero effect. Suppose we were testing the effects of a cost-cutting measure. A zero effect on outcomes could be persuasive justification for adoption of the policy: the cost of services was reduced without sacrificing service quality or citizen satisfaction.

To be more clear about this point, I suggest we consider learning from a policy experiment as a sign that some decision maker used Bayesian updating from some prior belief over the distribution of possible treatment effects to create posterior beliefs on the basis of evidence generated by the experiment. Prior beliefs are just what they sound like: beliefs about what treatment effects sound far-fetched and what kind of treatment effects sound plausible before the experiment is conducted. Posterior beliefs are the beliefs we have about the plausibility (or surprisingness) of certain treatment effects after the experiment has been done and new observations have been taken into account. The process of combining prior beliefs with observations to create new beliefs or posterior beliefs can be formalized using Bayes rule (see Bayesian updating link for more details). This sounds very technical but the idea of formalizing “learning” as Bayesian updating allows us to simplify and clarify our thinking about how status quo knowledge (the prior) changes given what we observe in an experiment: if the posterior, or beliefs after the experiment is done, differs from the prior then we say that the decision maker has learned. And the question then is how much can be learned from different approaches to experiments.

In policy-making contexts, a theory of change often posits a straightforward expectation. For example, we may have a theory that a reminder nudge should make citizens more likely to complete a task. But even these straightforward expectations may correspond to a relatively diffuse prior about the effects of some nudge: yes we think that the nudge should matter, but we would having trouble making a bet that the treatment effect is greater than 1 percentage point, or less than 10 percentage points, or any concrete treatment effect number. Relevant considerations include: how much does an estimated ATE of a nudge treatment allow us to update our beliefs about the efficacy of the intervention? Does this change in belief also change the content of a policy recommendation about whether a government agency should invest in disseminating reminders?

A very simple illustration of how a Bayesian policy maker might learn about effect sizes of an intervention is shown in Figure 1. The means of the priors (in black solid lines) and the posteriors (blue dashed lines) are directly underneath the middle peaks of the distributions. In that figure, I hold constant the estimated ATE (at 0). The results suggest that the extent of learning depends both on the prior and the precision of the estimated ATE (the new observations). Comparing the prior and the posteriors, there is greater movement in the mean ATE from prior to posterior as the distance between the prior and the evidence, or new observations, increases. Moreover, as the precision of the ATE increases (as the standard error shrinks), the precision of the posterior similarly increases.

Figure 1. Priors (black solid lines) and posteriors (blue dashed lines). All plots assume that the estimated ATE from the observed data is 0 and that N=10,000. The plots vary the prior mean and the standard error of the ATE estimated from the observed data. When the ATE estimates from the observed data are very precise, the posterior beliefs become more precise. When the prior mean and the observed data ATE diverge, the difference between the prior and the posterior increases — there is more learning.

Learning by Testing Hypotheses

Academic literature often justifies experiments as tests of a directional theory or argument. We examine whether get-out-the-vote (GOTV) appeals increase turnout or whether social contact with outgroup members reduces negative sentiment toward outgroups. In standard practice, researchers test a null hypothesis that the causal effect of interest is equal to zero. Rejection of this null hypothesis (presumably in the predicted direction) constitutes evidence in favor of the argument. Failure to reject this null, or a “null result,” is often interpreted as evidence against an argument even if, in fact, failure to reject means “not enough information.”

Notice that directional arguments do not make predictions about effect size. Such claims may make predictions about relative effect sizes, i.e., Treatment A has a larger ATE than Treatment B. However, they do not make predictions that GOTV appeals increase voter turnout by two percentage points or that social contact yields a 0.2 standard deviation reduction in negative sentiment. Even when directional theories are formalized, common causal estimands are (almost always) reduced-form tests. This practice means that, in this mode of research, our core inferences come from rejection of or failure to reject the null hypothesis even when we have more elaborate theories relating the cause to the effect.

In a context in which the central inference we seek is rejection of or failure to reject a null hypothesis, what do we learn from either outcome? In a recent paper, Abadie (2020) contends that we may learn more from a null finding than from a non-null finding. In particular, if Bayesian consumers of research expect an intervention to have an effect, a null finding can generate greater updating than a significant (non-null) finding.

Figure 2 illustrates this insight. Here, we are testing a null that a quantity of interest (e.g., a causal effect that I call $\theta$) is equal to zero. The upper left panel reproduces the first example in Abadie (2020). A research consumer has a prior belief that $\theta \sim \mathcal{N}(1,1)$, which is shown by the black curve in the density plot. Conditional on the realization of a null result from an observed experiment, the research consumer’s posterior belief is given by the red curve. The posterior puts greater density closer to $\theta = 0$. In contrast, conditional on a realization of a significant result (i.e. a hypothesis test rejecting a hypothesis of no effects with $p < .05$), her posterior belief puts less density on $\theta=0$, but more density on negative and positive values of $\theta$ given the two-tailed hypothesis test.

The remaining panels of Figure 2 show that the amount of learning from a null vs. non-null finding depends on:

The prior belief. In the upper panels, the prior belief is given by $\theta \sim \mathcal{N}(1,1)$ as in Abadie (2020). The bottom panels employ a $\theta \sim \mathcal{N}(0,1)$ prior.
The hypothesis being tested. The right panels tests a one-tailed (upper) hypothesis instead of the two-tailed hypothesis in the left panels.

$Figure 2. Prior beliefs (black solid lines), posterior beliefs conditional on a null finding (blue dashed lines), and posterior beliefs conditional on a significant finding (red dotted lines). A significant finding is a hypothesis test that rejects the hypothesis that $\tau = 0$ with $p \leq 0.05$. The plots vary the prior mean and the hypothesis test (one- vs. two-tailed).$

Figure 2. Prior beliefs (black solid lines), posterior beliefs conditional on a null finding (blue dashed lines), and posterior beliefs conditional on a significant finding (red dotted lines). A significant finding is a hypothesis test that rejects the hypothesis that $\tau = 0$ with $p \leq 0.05$. The plots vary the prior mean and the hypothesis test (one- vs. two-tailed).

In these graphs “learning” constitutes the difference between prior and posterior beliefs. For example, in the two-tailed test with a $\mathcal{N}(1,1)$ prior (upper left), a null finding (the red line) is “more informative” than a significant finding (the blue line) because there is a larger distance between the red posterior and the prior than the distance between the blue posterior and the prior, across possible values of $\theta$. One could measure learning in different ways. Abadie (2020) suggests total variation distance. In this exercise, total variation distance summarizes these distances across all values of $\theta$.

In the case of total variation distance, null results are more informative in either two-tailed test: notice in the left column that the “distance” between the red posterior (after a null finding) and the prior is greater than the distance between the blue posterior (after a significant finding) and the prior. However, with a one-tailed test which outcome is more informative depends on the prior. In the top right panel when the prior is $\mathcal{N}(1,1)$, a null result is more informative than a significant result. However, in the bottom right panel, when the prior is $\mathcal{N}(0,1)$, a significant result is more informative than a null result. Here, the distance between the blue posterior (a significant result) and the prior is greater than the distance between the red posterior (a null result) and the prior when evaluated across all possible values of $\theta$. Thus the relative information conveyed by a null or significant result depend on the hypothesis test and, with some hypothesis tests, the prior used.

“Types” of learning need not be mutually exclusive

The above discussion describes learning from estimated effects and the outcomes of hypothesis tests as distinct. However, these “types” of learning need not be mutually exclusive. There exist many examples of research that focus on the outcome of hypothesis tests to test an argument while considering policy implications premised upon some causal estimand like an Intent to Treat Effect or other average of the causal effects.

A Role for Pre-Analysis Plans

To the extent that we learn different things—or to different degrees—from estimates versus hypothesis tests, clarifying anticipated learning ex-ante, for example, in a pre-analysis plan, can be quite useful. From the perspective of Abadie (2020), we can learn from a null finding without resorting to a focus on effect sizes. Clarifying what learning is sought ex-ante can discipline discussions of what is learned from an experiment and encourage greater consideration of the goals of a project in earlier stages

When should we run experiments? A set of considerations

Conducting experiments with human subjects generally introduces some risk or costs to subjects. The benefits of experiments typically come in the form of learning, or contribution to knowledge. While IRBs and other ethics discussions and guidelines emphasize risk mitigation, the above discussion suggests that the benefit accrued from experiments is an important—if understudied—variable.

The above discussions about learning from null and non-null results, and from large and small estimated effects, suggests that ex-ante considerations of the scope for learning should consider several aspects of the motivation, design, and priors, as follows:

Motivation/goals
- Is the project motivated by learning the magnitude of a treatment effect, testing a theory/argument, or both?
- If a researcher is motivated by the magnitude of treatment effects to inform policy, what is the minimum degree of updating (however defined) that would be necessary to change the policy recommendation? For example, if any positive treatment effect would justify adoption of a policy and the practitioner’s prior has support only on positive treatment effects, how large of a treatment effect would we need to find to update beliefs sufficiently to justify a change in the policy?
- If a researcher is motivated to test an argument, how obvious are the results ex-ante? Per Abadie (2020), significant findings may convey (comparatively) less information than is generally conveyed in the literature. From this stance, if a treatment effect is very “obvious” ex-ante, there may be limited justification for running the experiment.
Priors
- Which stakeholders’ priors are relevant? In general, researchers, the general public, and practitioners may disagree in their prior beliefs about the outcomes of an experiment. Consideration of how to reconcile (or weigh) varying priors is important to understanding the scope for learning. For example, a policymaker who is wildly optimistic about the program that they designed may converse with a curmudgeonly, skeptical researcher.
- How precise is the prior? If a prior is highly precise or the literature suggests highly predictable results, less learning about the estimated ATE is possible. Where priors are extremely precise, running an experiment may have few benefits in terms of learning. For example, if we had a very precise prior, even if we estimate an effect that is far from the prior mean, the finding will not “change” our posterior as substantially as it would if the prior was more diffuse. Since that “change” constitutes our measure of learning, highly precise priors constrain possible learning.
Research design
- How can precision in the estimate of the ATE (or other causal estimand) be maximized? Researchers should consider the $n$ of the experiment in addition to blocking or covariate-adjustment research designs that can improve the precision of estimates.
- If readers should be updating beliefs on the basis of inferences from a hypothesis test, which test is used matters for the informativeness of each outcome, as shown in Figure 2.¹

These considerations complement ethical considerations premised on the costs of experiments to subjects. Because researchers justify experiments in terms of learning, the ability of an experiment to generate learning or knowledge gains should similarly be weighed.

Reference

Abadie, Alberto. 2020. “Statistical Nonsignificance in Empirical Economics.” American Economic Review: Insights, 2 (2): 193-208.

It is beyond the scope of this essay to discuss test statistics, but one could easily show that, say, learning from a rank-based test statistic might differ from learning from a mean-based test statistic or other summary of the data.↩