Abstract
1 What is regression?
2 What is a regression equation?
3 What are the main purposes of regression?
4 What are the standard errors, t-statistics, p-values, and degrees of freedom?
5 What the confidence intervals mean
6 Watch out for “researcher degrees of freedom”
7 Watch out for other possible biases
8 What the R² means
9 Be careful when comparing coefficients
10 Meet the whole family

Abstract

This guide¹ gives basic information to help you understand how to interpret the results of ordinary least squares (OLS) regression in social science research. The guide focuses on regression but also discusses general concepts such as confidence intervals.

The table below that will be used throughout this methods guide is adapted from a study done by EGAP members Miriam Golden, Eric Kramon and their colleagues (J. Asunka et al., “Protecting the Polls: The Effect of Observers on Election Fraud”). The authors performed a field experiment in Ghana in 2012 to test the effectiveness of domestic election observers on combating two common electoral fraud problems: ballot stuffing and overvoting. Ballot stuffing occurs when more ballots are found in a ballot box than are known to have been distributed to voters. Overvoting occurs when more votes are cast at a polling station than the number of voters registered. This table reports a multiple regression (this is a concept that will be further explained below) from their experiment that explores the effects of domestic election observers on ballot stuffing. The sample consists of 2,004 polling stations.

1 What is regression?

Regression is a method for calculating the line of best fit. The regression line uses the “independent variables” to predict the outcome or “dependent variable.” The dependent variable represents the output or response. The independent variables represent inputs or predictors, or they are variables that are tested to see if they predict the outcome.

Independent and dependent variables have many synonyms, so it helps to be familiar with them. They are the explanatory and response variables, input and output variables, right hand side and left hand side variables, explanans and explanandum, regressor and regressand, predictor and criterion variable, among many others. The first thing you need to do when you see a regression table is to figure out what the dependent variable is—this is often written at the top of the column. Afterwards identify the most important independent variables. You will base your interpretation on these.

A positive relationship in a regression means that high values of the independent variable are associated with high values of the dependent variable. A negative relationship means that units which have high values on the independent variable tend to have low values on the dependent variable, and vice versa. Regressions can be run to estimate or test many different relationships. You might run a regression to predict how much more money people earn on average for every additional year of education, or to predict the likelihood of success based on hours practiced in a given sport.

Use the app linked here get a feel for what a regression is and what it does. Below we will talk through the output of the regression table. Fill in values for x and for y and then look to see how the line of best fit changes to capture the average relationship between x and y. As the line changes, so too does the key information in the regression table.

2 What is a regression equation?

This is the formula for a regression that contains only two variables:

\[Y=α+βX+ε\]

The Y on the left side of the equation is the dependent variable. The α or Alpha coefficient represents the intercept, which is where the line hits the y-axis in the graph, i.e., the predicted value of Y when X equals 0. The β or Beta coefficient represents the slope, the predicted change in Y for each one-unit increase in X.

It’s really all about that Beta. The Beta coefficient represents either an increase or a decrease in the rate of ballot stuffing when the independent variable increases. For instance (see the Table), when the presence of observers increases by one unit, the occurrence of ballot stuffing decreases by .037 units, and for every one-unit increase in competition, there was a .019 unit increase in ballot stuffing. Note there is an assumed linear relationship (though different models can relax this): when X goes up by so much, Y goes up or down by so much. The ε is the epsilon or “error term,” representing the remaining variation in Y that cannot be explained by a linear relationship with X.

We observe Y and X in our data, but not ε. The coefficients α and β are parameters—unknown quantities that we use the data to estimate.

A regression with one dependent variable and more than one independent variable is called a multiple regression. This type of regression is very commonly used. It is a statistical tool to predict the value of the dependent variable, using several independent variables. The independent variables can include quadratic or other nonlinear transformations: for example, if the dependent variable Y is earnings, we might include gender, age, and the square of age as independent variables, in which case the assumption of a “linear” relationship between Y and the three regressors actually allows the possibility of a quadratic relationship with age.

The example table above examines how the dependent variable, fraud in the form of ballot stuffing, is associated with the following factors/independent variables: election observers, how saturated the area is, the electoral competition in the area, and the density. The regression will show if any of these independent variables help to predict the dependent variable.

3 What are the main purposes of regression?

Regressions can be run for any of several distinct purposes, including (1) to give a descriptive summary of how the outcome varies with the explanatory variables; (2) to predict the outcome, given a set of values for the explanatory variables; (3) to estimate the parameters of a model describing a process that generates the outcome; and (4) to study causal relationships. As Terry Speed writes, the “core” textbook approach to regression “is unlikely to be the right thing in any of these cases. Sharpening the question is just as necessary when considering regression as it is with any other statistical analysis.”

For descriptive summaries, there’s a narrow technical sense in which ordinary least squares (OLS) regression gets the job done: OLS shows us the best-fitting linear relationship, where “best” is defined as minimizing the sum of squares of the residuals (the differences between the actual outcomes and the values predicted from the explanatory variables). Furthermore, if we have a sufficiently large sample that was randomly drawn from a much larger population, OLS estimates the best-fitting line in the population, and we can use the estimated coefficients and “robust” standard errors to construct confidence intervals (see section 5) for the coefficients of the population line.² However, the summary provided by OLS may miss important features of the data, such as outliers or nonlinear relationships; see the famous graphs of Anscombe’s quartet.

Similarly, for prediction, OLS regression gives the best linear predictor in the sample, and if the sample is drawn randomly from a larger population, OLS is a consistent estimator of the population’s best linear predictor. However, (a) the best linear predictor from a particular set of regressors may not be the best predictor that can be constructed from the available data, and (b) a prediction that works well in our sample or in similar populations may not work well in other populations. Regression and many other methods for prediction are discussed in the freely downloadable book An Introduction to Statistical Learning by Gareth James, Daniela Witten, Trevor Hastie, and Robert Tibshirani.

Estimating the parameters of a model is the purpose that receives the most discussion in traditional textbooks. However, studying causal relationships is often the real motivation for regression. Many researchers use regression for causal inference but are not interested in all the parameters of the regression model. To estimate an average causal effect of one particular explanatory variable (the treatment) on the outcome, the researchers may regress the outcome on a treatment indicator and other explanatory variables known as covariates. The covariates are included in the regression to reduce bias (in an observational study) or variance (in a randomized experiment), but the coefficients on the covariates are typically not of interest in themselves. Strong assumptions are needed for regression to yield valid inferences about treatment effects in observational studies, but weaker assumptions may suffice in randomized experiments.³

4 What are the standard errors, t-statistics, p-values, and degrees of freedom?

Standard Error

The standard error (SE) is an estimate of the standard deviation of an estimated coefficient.⁴ It is often shown in parentheses next to or below the coefficient in the regression table. It can be thought of as a measure of the precision with which the regression coefficient is estimated. The smaller the SE, the more precise is our estimate of the coefficient. SEs are of interest not so much for their own sake as for enabling the construction of confidence intervals (CIs) and significance tests. An often-used rule of thumb is that when the sample is reasonably large, the margin of error for a 95% CI is approximately twice the SE. However, explicit CI calculations are preferable. We discuss CIs in more detail in the next section.

The table above from Asunka et al. shows “robust” standard errors, which have attractive properties in large samples because they remain valid even when some of the regression model assumptions are violated. The key assumptions that “conventional” or “classical” SEs make and robust SEs relax are that (1) the expected value of Y, given X, is a linear function of X, and (2) the variance of Y does not depend on X (conditional homoskedasticity). Robust SEs do assume (unless they are “clustered”) either that the observations are statistically independent or that the treatment was randomly assigned to the units of observation (the polling stations in this example).⁵

t-Statistic

The t-statistic (in square brackets in the example table) is the ratio of the estimated coefficient to its standard error. T-statistics usually appear in the output of regression procedures but are often omitted from published regression tables, as they’re just a tool for constructing confidence intervals and significance tests.

p-Values and Significance Tests

In the table above, if an estimated coefficient (in bold) is marked with one or more asterisks,⁶ that means the estimate is “statistically significant” at the 1%, 5%, or 10% level—in other words, the p-value (from a two-sided test ⁷ of the null hypothesis that the true coefficient is zero) is below 0.01, 0.05, or 0.1.

To calculate a p-value, we typically assume that the data on which you run your regression are a random sample from some larger population. We then imagine that you draw a new random sample many times and run your regression for every new sample. (Alternatively, we may imagine randomly assigning some treatment many times. See our guide on hypothesis testing for more details.) This procedure would create a distribution of estimates and t-statistics. Given this distribution, the p-value captures the probability that the absolute value of the t-statistic would have been at least as large as the value that you actually observed if the true coefficient were zero. If the p-value is greater than or equal to some conventional threshold (such as 0.05 or 0.1), the estimate is “not statistically significant” (at the 5% or 10% level). According to convention, estimates that are not statistically significant are not considered evidence that the true coefficient is nonzero.

In the table, the only estimated coefficient that is statistically significant at any of the conventional levels is the intercept (which is labeled “Constant/Intercept” because in the algebra of regression, the intercept is the coefficient on the constant 1). The intercept is the predicted value of the outcome when the values of the explanatory variables are all zero. In this example, the question of whether the true intercept is zero is of no particular interest, but the table reports the significance test for completeness. The research question is about observer effects on ballot stuffing (as shown in the heading of the table). The estimated coefficient on “Observer Present (OP)” is of main interest, and it is not statistically significant.

It is easy to misinterpret p-values and significance tests. Many scholars believe that although significance tests can be useful as a restraining device, they are often overemphasized. Helpful discussions include the American Statistical Association’s 2016 statement on p-values; the invited comments on the statement, especially Sander Greenland et al.’s “Statistical Tests, P values, Confidence Intervals, and Power: A Guide to Misinterpretations” (republished here); and short posts by David Aldous and Andrew Gelman.

F-Test and Degrees of Freedom

The bottom section of the table includes a row with the heading “F(5, 59)”, the value 1.43 (the F-statistic), and the p-value .223. This F-test is a test of the null hypothesis that the true values of the regression coefficients, excluding the intercept, are all zero. In other words, the null hypothesis is that none of the explanatory variables actually help predict the outcome. In this example, the p-value associated with the F-statistic is 0.223, so the null hypothesis is not rejected at any of the conventional significance levels. However, since our main interest is in the effects of observers, the F-test isn’t of much interest in this application. (We already knew that the estimated coefficient on “Observer Present” is not statistically significant, as noted above.)

The numbers 5 and 59 in parentheses are the degrees of freedom (df) associated with the numerator and denominator in the F-statistic formula. The numerator df (5) is the number of parameters that the null hypothesis claims are zero. In this example, those parameters are the coefficients on the 5 explanatory variables shown in the table. The denominator df (59) equals the sample size minus the total number of parameters estimated. (In this example, the sample size is 2,004 and there are only 6 estimated parameters shown in the table, but the regression also included many dummy variables for constituencies that were used in blocking.)

5 What the confidence intervals mean

Confidence intervals (CIs) are frequently reported in social science research papers and occasionally shown in regression tables. They communicate some of the uncertainty in estimation: for example, the point estimate of the coefficient on “Observer Present” is a specific value, –0.037, but the CI (calculated as the point estimate plus or minus a margin of error) is the range of values from –0.09 to 0.01, implying that any value in that range is compatible with the data. (In other words, having an observer present may have reduced the rate of ballot stuffing by 9 percentage points, or it may have actually increased the rate by 1 percentage point, or the effect may have been somewhere in between.) The coverage probability (or confidence level) of a CI is the probability that the CI contains the true value of the parameter. Reported CIs usually have a nominal (claimed) coverage probability of 95%, so they are called 95% confidence intervals.

Coverage probabilities are easy to misinterpret. In the example table, (–0.09, 0.01) is the 95% CI for the observer effect on ballot stuffing. This does not mean that there’s a 95% probability that the true effect was in the range between –0.09 and 0.01. Statements of that nature can be made in Bayesian statistics (with posterior intervals, also known as credible intervals), but confidence intervals are a construct of frequentist statistics. The coverage probability answers the following question: Imagine that we can replicate the experiment a large number of times, and the only thing that varies from one replication to another is which units are randomly assigned to treatment.⁸ How often will the CI capture the true effect of election observers? In this framework, the observer effect is fixed, but the endpoints of the CI are random. For example, if the true effect is –0.02, then it is –0.02 on every replication. But because different units are randomly assigned to treatment on each replication, we could see the following CIs in three replications of the experiment: (–0.10, -0.01), (-0.03, 0.03), and (0.00, 0.10). The first and second CIs capture the true value of –0.02, but the third misses it. The nominal coverage probability of 95% means that in a million replications, about 950,000 of the CIs would capture the true value of –0.02. It’s a claim about the ex ante reliability of our method for reporting a range, not about the ex post probability that the true observer effect is in the range between –0.09 and 0.01.

Greenland et al. give a helpful discussion of the benefits, limitations, and common misinterpretations of CIs. As they note, “many authors agree that confidence intervals are superior to tests and P values because they allow one to shift focus away from the null hypothesis, toward the full range of effect sizes compatible with the data—a shift recommended by many authors and a growing number of journals.” However, “the confidence interval is computed from many assumptions, the violation of which may have led to the results. Thus it is the combination of the data with the assumptions, along with the arbitrary 95% criterion, that are needed to declare an effect size outside the interval is in some way incompatible with the observations. Even then, judgements as extreme as saying the effect size has been refuted or excluded will require even stronger conditions.”

The CONSORT Explanation and Elaboration document notes that in medicine, “Many journals require or strongly encourage the use of confidence intervals.” In the social sciences, CIs aren’t always explicitly reported; some authors report only point estimates and standard errors. If the degrees of freedom for the distribution of the t-statistic are reported, readers with sufficient technical background can construct a CI on their own (although it would obviously be more helpful if authors reported CIs explicitly). In our example table, the df for the t-statistics is the same as the denominator df (59) for the F-statistic. To construct the margin of error for a 95% CI, we multiply the SE by the appropriate critical value, the 0.975 quantile of the t-distribution with 59 degrees of freedom, which is 2.001 (in R, use the command qt(.975, df = 59)). Thus, the rule of thumb that we mentioned in the section on SEs (“the margin of error for a 95% CI is approximately twice the SE”) works well here. However, if we had, say, only 20 degrees of freedom, the appropriate critical value would be about 2.09, and the 95% CI should be wider than the rule of thumb would suggest.

6 Watch out for “researcher degrees of freedom”

The SEs, p-values, significance tests, and CIs reported in regression tables typically assume that the researchers would have made all the same analytic decisions (which observations and variables to include in the regression, which hypothesis to test, etc.) if the outcome data had shown different patterns or if (in a randomized experiment) different units had been randomly assigned to treatment. This assumption is credible if all those decisions were pre-specified before the researchers saw any data on outcomes or treatment assignments. Otherwise, researchers may make decisions that consciously or unconsciously tilt a study toward a desired result. This problem is known as “fishing”⁹, “researcher degrees of freedom,”¹⁰ or “the garden of forking paths.”¹¹

In an instructive and entertaining paper, Joseph Simmons, Leif Nelson, and Uri Simonsohn use simulations as well as actual experiments to show how easy it is for researcher degrees of freedom to invalidate significance tests. In simulations, they show that when researchers have unlimited discretion about which outcome to analyze, when to stop recruiting subjects, how to model the effect of a covariate, and which treatment conditions to include in the analysis, a significance test that claims to have a Type I error probability (false-positive rate) of 5% can easily be made to have an actual Type I error probability as high as 61%. In other words (and as said in the paper’s title), “Undisclosed flexibility in data collection and analysis allows presenting anything as significant.” Allowing themselves unlimited flexibility in the data collection and analysis for an actual experiment, Simmons et al. manage to reach the necessarily false conclusion that listening to the Beatles’ song “When I’m Sixty-Four” made the subjects “nearly a year-and-a-half younger,” with a p-value of .04.

One remedy is for researchers to pre-specify and publicly archive their decisions about data collection and analysis (e.g., stopping rules, outcome measures, covariates, regression models, sample exclusions, and subgroup definitions) before they see the outcome data (and, ideally, before they assign treatments). Documents of these decisions are known as pre-analysis plans (PAPs). Critics of PAPs worry that they inhibit exploratory data analysis. Proponents argue that deviations from the plans are not prohibited, but should be fully disclosed and highlighted, to help readers distinguish between exploratory and confirmatory analyses. For valuable discussions, see the symposia in Political Analysis (Winter 2013) and Journal of Economic Perspectives (Summer 2015). Also take a look at our guide on pre-registration.

7 Watch out for other possible biases

Just because you use a regression to estimate a relationship does not mean that the relationship you estimate truly captures the type of relationship you are interested in. Here are some of the possible sources of bias to be aware of:

Selection bias can arise when there are systematic, unmeasured differences in characteristics between the individuals who are selected into the sample or the treatment and those who are not selected. In other words, selection bias can refer to either of two concerns:
1. If treatment is determined by some process other than random assignment (e.g., if subjects self-select into treatment), then treated subjects may differ from untreated subjects in ways that affect the outcome. Such differences can easily lead to bias in a regression of the outcome on treatment, even if measured characteristics of the subjects are included as covariates, because treated and untreated subjects may differ in unmeasured ways.
2. If the sample that is included in the regression isn’t a random sample of the population of interest, then the regression may yield biased estimates of the population relationship between the outcome and the explanatory variables.
Attrition bias is a form of selection bias that can occur when outcome data are missing for a nonrandom subset of the original sample. In studies of treatment effects, attrition bias can be especially challenging to address if the treatment may have affected attrition (the loss of outcome data): when the rates or patterns of attrition differ between treated and untreated subjects, even a randomized experiment may not yield unbiased treatment effect estimates for any population.¹² See our guide on missing data for details.
Similarly, if the treatment affects the measurement of the outcome, the symmetry that random assignment created is threatened, and estimated treatment effects may be biased even in a randomized experiment.
Adjustment for covariates that may have been affected by the treatment can lead to bias, as explained in 10 Things to Know About Covariate Adjustment.¹³
Publication bias, also known as the file drawer problem, arises when entire studies go unpublished not because their quality is any lower than that of other studies on the same topic, but because of the nature of their results (e.g., because the results are considered unsurprising, or because they do not reach conventional thresholds for statistical significance). As Robert Rosenthal wrote in a classic article, “The extreme view of the ‘file drawer problem’ is that journals are filled with the 5% of the studies that show Type I errors, while the file drawers are filled with the 95% of the studies that show nonsignificant results.”¹⁴

8 What the R² means

R² is the squared multiple correlation coefficient, also known as the Coefficient of Determination. R² shows the proportion of the variance of the outcome that is “explained” by the regression. In other words, it is the variance of the outcome values predicted from the explanatory variables, divided by the variance of the actual outcome values. The larger the R² is, the better the fit of the regression model. And a model fits the data well if the differences between the actual values and the values predicted by the regression are small. The R² is generally of secondary importance, unless your main concern is using the regression equation to make accurate predictions. It is always between 0 and 1, so if the independent variables are strong predictors, the R² will be closer to 1. It is possible, however, that a statistically significant relationship between X and Y is found even though the R² is low; this just means we have evidence of a relationship between X and Y, but X does not explain a large proportion of the variation in Y.

In the example table, the R² value is .011, showing that in this case, the explanatory variables account for only a small portion of the variance of the outcome. If a model could explain all of the variance, the values predicted by the regression would always equal the actual values observed, so the regression line would fit the data perfectly and the R² would equal 1.

Although R² summarizes how well the model fits the data, any single-number summary has limitations. In Anscombe’s quartet, all four regressions have the same R², but the four graphs look very different.

9 Be careful when comparing coefficients

If one coefficient is bigger than another, does that mean the outcome is more sensitive to that explanatory variable? No—the interpretation of coefficients depends on the scales the variables are measured on. If you convert an explanatory variable from feet to miles, the coefficient will get a lot bigger, without any real change in the underlying relationship between the explanatory variable and the outcome.

10 Meet the whole family

So far, this guide has focused on ordinary least squares regression, one of the most commonly used estimation methods in the social sciences. In fact, there are many other regression methods, including weighted least squares and generalized least squares, as well as all sorts of nonlinear models for limited dependent variables—outcomes that are limited to a particular range of values, such as binary (0/1), categorical (A,B,C,…), or count (0,1,2,…) outcomes.

Researchers might use a weighted least squares regression when the variance of ε differs from one observation to another and can be modeled as a function of one or more predictors (this is called heteroskedasticity, which generally looks something like this).¹⁵

You might see a logit or a probit regression when the outcome is binary, meaning it has only two possible values: yes/no, 0/1, or True/False. Logit and probit differ in terms of the assumptions about the underlying data-generating process, but they often yield similar results.¹⁶
Ordered logit and ordered probit models may be used for outcomes with multiple ordered categories (such as “strongly disagree,” “disagree,” “agree,” “strongly agree”).
Multinomial logit or multinomial probit models may be used for outcomes with multiple unordered categories (“Labour,” “Conservative,” “Lib Dem”).
Poisson or Negative Binomial models may be used when the outcome is a count (“how many riots this year”).
Tobit models are sometimes used for non-negative outcomes (“How much time spent working this month”).
and many more …

For the simple linear case, the coefficient tells you the change in Y you get for each unit change in X, but for nonlinear regressions the interpretation can be much more difficult. For nonlinear models you should generally expect authors to provide substantive interpretations of the coefficients. The program Clarify by Gary King and colleagues helps with this for Stata users; the Zelig package for R (also by King and coauthors) supports analysis and interpretation of these models in R.

Originating author: Abby Long. Revisions: Winston Lin, 21 July 2016. The guide is a live document and subject to updating by EGAP members at any time; contributors listed are not responsible for subsequent edits. Thanks to Don Green, Macartan Humphreys, and Tod Mijanovich for helpful discussions.↩︎
For in-depth discussions, see: Joshua D. Angrist & Jörn-Steffen Pischke (2009), Mostly Harmless Econometrics, chapters 3 and 8; Richard A. Berk et al. (2014), “Misspecified Mean Function Regression: Making Good Use of Regression Models That Are Wrong,” Sociological Methods and Research 43: 422–451; Andreas Buja et al., “Models as Approximations—A Conspiracy of Random Regressors and Model Misspecification Against Classical Inference in Regression,” working paper; Bruce Hansen, Econometrics, online textbook.↩︎
On regression in observational studies, see: 10 Strategies for Figuring out if X Caused Y; David A. Freedman (1991), “Statistical Models and Shoe Leather” (with discussion), Sociological Methodology 21: 291–358; Angrist & Pischke, Mastering ’Metrics: The Path from Cause to Effect (2015) and Mostly Harmless Econometrics; Guido W. Imbens & Jeffrey M. Wooldridge (2009), “Recent Developments in the Econometrics of Program Evaluation,” Journal of Economic Literature 47: 5–86; Imbens (2015), “Matching Methods in Practice: Three Examples,” Journal of Human Resources 50: 373–419. On regression adjustment in randomized experiments, see 10 Things to Know About Covariate Adjustment and Winston Lin’s Development Impact blog posts (here and here).↩︎
Strictly speaking, the true SE is the standard deviation of the estimated coefficient, while what we see in the regression table is the estimated SE. However, in common parlance, people often say “standard error” when they mean the estimated SE, and we’ll do the same.↩︎
Robust SEs are also known as Huber–White or sandwich SEs. On the properties of robust SEs, see: Mostly Harmless Econometrics, section 3.1.3 and chapter 8; Guido W. Imbens & Michal Kolesár (2016), “Robust Standard Errors in Small Samples: Some Practical Advice,” Review of Economics and Statistics 98: 701–712; Charles S. Reichardt & Harry F. Gollob (1999), “Justifying the Use and Increasing the Power of a t Test for a Randomized Experiment with a Convenience Sample,” Psychological Methods 4: 117–128; Cyrus Samii & Peter M. Aronow (2012), “On Equivalencies Between Design-Based and Regression-Based Variance Estimators for Randomized Experiments,” Statistics and Probability Letters 82: 365–370; Winston Lin (2013), “Agnostic Notes on Regression Adjustments to Experimental Data: Reexamining Freedman’s Critique,” Annals of Applied Statistics 7: 295–318; Alberto Abadie, Susan Athey, Guido W. Imbens, & Jeffrey M. Wooldridge (2014), “Finite Population Causal Standard Errors”, NBER Working Paper No. 20325.↩︎
The use of asterisks to flag statistically significant results is common but not universal. Our intention here is merely to explain what the asterisks mean, not to recommend that they should or should not be used.↩︎
Two-sided tests are the default in most software packages and in some research fields, so when tables do not explicitly note whether the p-values associated with regression coefficients are one- or two-sided, they are usually two-sided. Ben Olken (2015, pp. 67, 70) notes that since “convention typically dictates two-sided hypothesis tests,” researchers who prefer one-sided tests should commit to that choice in a pre-analysis plan so “they cannot be justly accused of cherry-picking the test after the fact.” Sander Greenland et al. (2016, p. 342) argue against the view that one should always use two-sided p-values, but write, “Nonetheless, because two-sided P values are the usual default, it will be important to note when and why a one-sided P value is being used instead.”↩︎
The framework where “the only thing that varies from one replication to another is which units are randomly assigned to treatment” is known as randomization-based inference. This isn’t the only framework for frequentist inference. In the classical regression framework, the only thing that varies is that on each replication, different values of ε are randomly drawn. And in the random sampling framework, on each replication a different random sample is drawn from the population. On randomization-based inference, see the Reichardt & Gollob, Samii & Aronow, Lin, and Abadie et al. references in note 5; on the random sampling framework, see the references in note 2.↩︎
See, e.g., Macartan Humphreys, Raul Sanchez de la Sierra, & Peter van der Windt (2013), “Fishing, Commitment, and Communication: A Proposal for Comprehensive Nonbinding Research Registration,” Political Analysis 21: 1–20.↩︎
Joseph P. Simmons, Leif D. Nelson, & Uri Simonsohn (2011), “False-Positive Psychology: Undisclosed Flexibility in Data Collection and Analysis Allows Presenting Anything as Significant,” Psychological Science 22: 1359–1366.↩︎
Andrew Gelman & Eric Loken (2013), “The Garden of Forking Paths: Why Multiple Comparisons Can Be a Problem, Even When There Is No ‘Fishing Expedition’ or ‘P-Hacking’ and the Research Hypothesis Was Posited Ahead of Time,” preprint.↩︎
For more discussion of attrition bias in randomized experiments, see, e.g., Alan S. Gerber & Donald P. Green (2012), Field Experiments: Design, Analysis, and Interpretation, chapter 7.↩︎
See also: Mostly Harmless Econometrics, section 3.2.3; Paul R. Rosenbaum (1984), “The Consquences of Adjustment for a Concomitant Variable That Has Been Affected by the Treatment,” Journal of the Royal Statistical Society. Series A (General) 147: 656-666.↩︎
Robert Rosenthal (1979), “The ‘File Drawer Problem’ and Tolerance for Null Results,” Psychological Bulletin 86: 638–641. On reforms to counter publication bias, see: Brendan Nyhan (2015), “Increasing the Credibility of Political Science Research: A Proposal for Journal Reforms,” PS: Political Science and Politics 48 (S1): 78–83; Michael G. Findley, Nathan M. Jensen, Edmund J. Malesky, and Thomas B. Pepinsky (forthcoming), “Can Results-Free Review Reduce Publication Bias? The Results and Implications of a Pilot Study,” Comparative Political Studies.↩︎
Weighting by the inverse of the variance of ε is a form of generalized least squares (GLS). The classical argument is that GLS is more efficient (i.e., has lower variance) than OLS under heteroskedasticity. However, when the goal is to estimate an average treatment effect, some researchers question the relevance of the classical theory, because if treatment effects are heterogeneous, GLS and OLS are not just more efficient and less efficient ways to estimate the same treatment effect. Instead, they estimate different weighted average treatment effects. In other words, they answer different questions, and choosing GLS for efficiency is arguably like looking for your keys where the light’s better. In “The Credibility Revolution in Empirical Economics,” Angrist & Pischke (2010, pp. 11–12) write: “Today’s applied economists have the benefit of a less dogmatic understanding of regression analysis. Specifically, an emerging grasp of the sense in which regression and two-stage least squares produce average effects even when the underlying relationship is heterogeneous and/or nonlinear has made functional form concerns less central. The linear models that constitute the workhorse of contemporary empirical practice usually turn out to be remarkably robust, a feature many applied researchers have long sensed and that econometric theory now does a better job of explaining. Robust standard errors, automated clustering, and larger samples have also taken the steam out of issues like heteroskedasticity and serial correlation. A legacy of White’s (1980a) paper on robust standard errors, one of the most highly cited from the period, is the near death of generalized least squares in cross-sectional applied work. In the interests of replicability, and to reduce the scope for errors, modern applied researchers often prefer simpler estimators though they might be giving up asymptotic efficiency.” Similarly, Jim Stock (2010, p. 85) comments: “The 1970s procedure for handling potential heteroskedasticity was either to ignore it or to test for it, to model the variance as a function of the regressors, and then to use weighted least squares. While in theory weighted least squares can yield more statistically efficient estimators, modeling heteroskedasticity in a multiple regression context is difficult, and statistical inference about the effect of interest becomes hostage to the required subsidiary modeling assumptions. White’s (1980) important paper showed how to get valid standard errors whether there is heteroskedasticity or not, without modeling the heteroskedasticity. This paper had a tremendous impact on econometric practice: today, the use of heteroskedasticity-robust standard errors is standard, and one rarely sees weighted least squares used to correct for heteroskedasticity.” (Emphasis added in both quotations.)↩︎
Logit, probit, and other limited dependent variable (LDV) models do not immediately yield estimates of the average treatment effect (ATE). To estimate ATE, one needs to compute an average marginal effect (or average predictive comparison) after estimating the LDV model (see, e.g., Andrew Gelman & Iain Pardoe [2007], “Average Predictive Comparisons for Models with Nonlinearity, Interactions, and Variance Components,” Sociological Methodology 37: 23–51). Some researchers argue that the complexity of marginal effect calculations for LDV models is unnecessary because OLS tends to yield similar ATE estimates (see Mostly Harmless Econometrics, section 3.4.2, and the debate between Angrist and his discussants in “Estimation of Limited Dependent Variable Models with Dummy Endogenous Regressors” [2001], Journal of Business and Economic Statistics 19: 2–28). In randomized experiments, the robustness of OLS is supported by both asymptotic theory and simulation evidence. For theory, see Lin, “Agnostic Notes on Regression Adjustments to Experimental Data.” For simulations, see Humphreys et al., “Fishing, Commitment, and Communication,” and David R. Judkins & Kristin E. Porter (2016), “Robustness of Ordinary Least Squares in Randomized Clinical Trials,” Statistics in Medicine 35: 1763–73. See also Lin’s comments on this MHE blog post.↩︎