Four field experiments meet at the polished wooden bar of their neighborhood coffee shop.
“Y’all do not look like the bright and energetic agents of evidence-based policy making that I am used to seeing each morning. What’s up?” The barista climbs down from dusting the sign hanging over the rows of coffee cups and tea pots lining the wall behind the bar and turns to the customers. The sign says, “No Rules of Thumb.”
“We are block randomized today,” grumps the first experiment.
“Randomizing treatment assignment within pre-defined strata or blocks was supposed to make our findings more precise without adding any more complications to our analyses,” sighs the second experiment.
“Yeah.” They said, “A block-randomized experiment is just a series of independent mini-experiments. And, of course, we know how to estimate treatment effects and standard errors for one experiment, but now we can’t agree on how to combine our estimates across more than one block,” explains the third.
“And it just doesn’t matter how we estimate our effects. We should just always use fixed-effects,” grumbles the fourth.
“No! That could be a big problem!” The first experiment almost jumps out of the chair.
“Hey!,” the barista blasts a bit of steam from the steam wand and holds up a hand to pause the conflict. “Aren’t block randomized experiments just series of small independent experiments? Within each block you should be able to estimate the ATE and SE, right? And you just need to combine those all together somehow, right? So shouldn’t it be simple?”
“Ah,” the first experiment leans over to look at what the barista is doing. “Right. None of us has very small blocks — we each only have three blocks and we have more than 10 people in each block. So we can calculate randomization-justified ATEs and SEs within each block. But we are arguing about the weights. Say I found an effect of 5 in one block and 1 in another block. Should the overall effect just be a simple average that weights each block the same like (5+1)/2=3? What if the first block has 100 people and second block has 10 people? Shouldn’t the effect be (5 \(\times\) 100)/110 + (1 \(\times\) 10)/110 = 4.64?”
“So, you are saying that you should combine block-level estimates using the proportion of the total sample in the block as a weight?” the barista is pouring some new coffee beans into bowls set on two different digital scales. “That makes sense to me. Why don’t y’all just do that? You don’t want your overall effect to overrepresent what happened in the small block, right? You have a lot more information about the treatment effect in the bigger block and so that information should play a bigger role in the overall story of the experiment.”
“We should,” says the first experiment a bit grumpily, “After all, lots of previous literature and even fairly simple math tells us that this block-size weighted estimator is unbiased and other weights produce biased estimators. For example, we have been reading Gerber and Green (2012) (Chapter 3), the blog post, The trouble with ‘controlling for’ blocks, and the analytics in Humphreys (2009), which make exactly this argument.”
“By the way, what are you doing with the scales this morning? Usually you just start making our regular drink orders.” The second experiment is also looking at the scales, bowls, and beans.
“Oh. These are new beans. I know that a good espresso depends on striking a balance between bitter and sour. And, in general, with the beans I’m used to getting, I know that about 17 grams of beans ground at about level 2 and extracted at 9 psi for about 24 seconds makes a good double shot using this grinder and this machine and 40 percent humidity in the shop. But, since these are new beans, I have to explore the values of these variables a bit before deciding about the right balance. I’ll be back to pulling shots quickly tomorrow. Today, however, I need to make some test coffees that I may not sell in order to learn how to make good coffee with these beans. Just a sec, it will be noisy while I grind this.” With the newly ground beans in the portafilter, the barista puts a scale under a cup, zeros out the weight, touches the button on a timer, and flips the chrome switch on the espresso machine. “So, back to field experiments, why don’t you just use the idea that blocks with more information should contribute more to the estimate and get back to enjoying your day?”
“Well, that is kind of the problem,” says the second experiment a bit wearily, as if rehashing an old argument. “It is very expensive to do policy-relevant field experiments — so we might want to trade a little bit of bias for increases in precision. We also know that a different form of weighting that we call ‘precision-weighting’ is optimal from that perspective.1”
“Can you explain the intuition behind the precision-weighting approach?” The third experiment gestures to the barista, “Meanwhile, I’m happy to help you test free espresso shots while you figure out your own procedure for this new batch of beans.”
“Look, take the example of two blocks with treatment effects of 5 and 1 respectively, and imagine that they each have 100 people. But now imagine that the first block had assigned 50 people to treatment and 50 to control, but in the second block the administrators of the program in that block only let you randomize 5 people to treatment. Obviously the first block, with 50 in each of the arms, tells us a lot more about the treatment effect than the second block, with only 5 people in the treatment condition. But both blocks are the same size. And the block-size weights will over-emphasize the effects in the 5 treated-person block from this perspective. From the perspective of information or precision we should instead weight by both block-size and proportion assigned to treatment — the second block should get a lower weight in this case. It turns out that these precision weights are exactly this combination. In this case, the first block would get a weight proportional to (100/200)(50/100)(1- 50/100)=.125 — (100/200) is the block-size weight and (50/100) were assigned to treatment. The second block would have a weight proportional to (100/200)(5/100)(1- 5/100)=.0238. After making them sum to 1, we have the precision weights of .84 for the first block and .16 for the second block (compared to .5 and .5 if we only paid attention to block-size). So, we could report the block-size weighted effect of (5 \(\times\) 100)/200 + (1 \(\times\) 100)/200 = 5 * (1/2) + 1 * (1/2) = 3 or the precision weighted effect 5(1/2)(50/100)(1 - 50/100) + 1(1/2)(5/100)(1- 5/100) = 5 \(\times\) .125 + 1 \(\times\) .0238 = 0.65.”
“These are pretty large differences in effect. You said the first one, estimated ATE=3, arises from an unbiased estimator, so we should choose the block-sized weighting in this case, right?” The barista looks puzzled and grinds more beans. The sound of grinding coffee drowns out further conversation for a few seconds. When the grinding stops, the barista asks, “Ok. That makes sense. So, when you have blocks of different sizes or blocks with different probabilities of treatment assignment, you should use that more precise weight, the ‘precision-weight’, and when you have blocks where probability of treatment of assignment is the same, both weights should be the same, so then the ‘precision-weight’ would just be the same as the ‘block-size weight’. So you should just use that precision weight always if you want to follow the intuition that overall estimates should reflect blocks with more information, huh? You are actual experiments. I just make coffee. Shouldn’t you have figured that out?”
The first experiment shakes their head. “It is not so simple. ‘Always use precision weights / fixed effects’ might be a rule of thumb that would lead you astray sometimes. And of course, you should know that there are more than two ways to calculate these weights and that the only way that is guaranteed to produce unbiased estimates is the block-size weight. You should see N. E. Pashley and Miratrix (2020) and N. Pashley and Miratrix (2020) to learn about estimating average treatment effects and their standard errors in all kinds of block randomized experiments — including pair-randomized experiments where you really can’t calculate standard errors within pair.2 We block randomized experiments raise lots of interesting statistical problems.” The fourth experiment looks a bit proud saying this. The others chuckle.
“Well. What is the problem then?” The barista tamps the first shot. “You have lots of guidance. Just follow that advice.”
“The problem is that we have two rules of thumb to follow given our designs where we have a few large blocks. Some people say that we should ‘use fixed effects’ (we know that this is the same as saying that we should use precision-weights) and other people say that we should use block-size weights,” complains the third experiment.
“I see,” the barista pulls the first shot and the sound of foaming milk fills the air, “When I was first learning to make espresso and cappuccino, I found a lot of advice on the internet. But, that advice didn’t work all the time. I really liked the advice from the chemists — and it gave me good starting places, for example, but nice graphs of acidity by particle size don’t really tell me what to do in detail with a given batch of beans. I have to make my own decision about this, knowing that I do not have an ultra clean lab and fancy equipment. So, I’ve found I sometimes need to play around, try different approaches to see what works. I only do this when I get new beans, or a new grinder, or something else changes, of course. It would be waste of time and money to have to throw away a lot of espresso every morning.” The barista pats the gigantic espresso machine with some affection. “Can you do something like this? Try each approach?”
“Hey. This third espresso is better than the first, by the way.” The fourth experiment drums fingers on the bar and taps toes on the brass footrest of the bar stool and addresses the other experiments, “I like our barista’s idea! We have the DeclareDesign package that we can use. Why don’t we compare the two approaches?” The other experiments nod. “We’ll take our cappuccinos at the table while we work, if you don’t mind. Sorry to step away.”
“Sure. No problem.” The barista motions them away. “I’ll be interested to hear what you find.”
(TIME PASSES)
“Hey, look at this!” the experiments have come back to the counter carrying their empty coffee cups and holding up one of the their laptops. The morning rush is over. The barista is wiping down the bar.