(by Michael Lavine)
We’re all familiar with hypothesis tests, having learned them at our parents’ knees in our statistical infancy. It’s time for them to die.
By hypothesis testing, I mean the scenario in which one divides the world into H0 and H1, calculates a continuous statistic such as a p-value – which may itself be a function of a more fundamental test statistic – or Bayes factor, then reports either “reject” or “fail to reject” (I will use “accept” as shorthand for “fail to reject”.) according to whether the statistic is above or below a threshold. To be clear, I do not oppose dividing the world into H0 and H1 and calculating a continuous statistic for comparing them. But I do oppose the practice of either rejecting or accepting H0. There are two primary reasons for my opposition.
First, dichotomizing a continuous variable must entail a loss of information and reporting the continuous statistic, whether p-value, Bayes factor, or something else, must be at least as informative as reporting the accept/reject decision. And if we report the continuous statistic, there is no additional information in also reporting whether an arbitrary threshold is above or below that statistic. Of course, statistical users may care only whether the threshold is above the statistic and it’s tempting to say that’s their lookout. But that’s too simplistic. As statistical experts, it’s our job to guide their interpretation and use of statistical analysis. We should insist forcefully that weighing the evidence comparing H0 to H1 is more subtle than merely noting whether the continuous statistic is above a cutoff.
Second, hypothesis tests are often described as a decision, as though we decide which hypothesis to use. But in my experience, that description is misleading more often than not because there is no use to which the hypothesis is put. In such cases, thinking about a binary decision diverts our attention from the more important problem of quantifying the evidence favoring H0 or H1. And in the cases where there is a real decision, that decision comes with real consequences and utilities that should play a role in our decision. In short, the formal, binary hypothesis testing is not up to the job of making real decisions.
(Michael Lavine teaches statistics at the University of Massachusetts at Amherst.)
Editor’s note: I pretty much agree with all of the above, and I appreciate that Lavine slams so-called Bayesian hypothesis tests as well. The general goal of statistical inference is to summarize the information in data, typically with respect to some model or class of models. Testing model fit is important, but ultimately all models are false, and what’s interesting is where our data depart from our models, not whether we currently have enough data to “reject.”
Even more barfable to me is the idea, deeply ingrained in old-fashioned statistical theory, that interval estimation is equivalent to inversions of hypothesis tests. No it isn’t, except in some special cases of pivotal test statistics. But that’s another story . . .
That said, maybe we should all think more about the appeal of hypothesis testing and try to understand how brilliant people from Neyman and Jeffreys, on to the present day, have found hypothesis testing to be useful and important.