(by Michael Lavine)

We’re all familiar with hypothesis tests, having learned them at our parents’ knees in our statistical infancy. It’s time for them to die.

By hypothesis testing, I mean the scenario in which one divides the world into H0 and H1, calculates a continuous statistic such as a p-value – which may itself be a function of a more fundamental test statistic – or Bayes factor, then reports either “reject” or “fail to reject” (I will use “accept” as shorthand for “fail to reject”.) according to whether the statistic is above or below a threshold. To be clear, I do not oppose dividing the world into H0 and H1 and calculating a continuous statistic for comparing them. But I do oppose the practice of either rejecting or accepting H0. There are two primary reasons for my opposition.

First, dichotomizing a continuous variable must entail a loss of information and reporting the continuous statistic, whether p-value, Bayes factor, or something else, must be at least as informative as reporting the accept/reject decision. And if we report the continuous statistic, there is no additional information in also reporting whether an arbitrary threshold is above or below that statistic. Of course, statistical users may care only whether the threshold is above the statistic and it’s tempting to say that’s their lookout. But that’s too simplistic. As statistical experts, it’s our job to guide their interpretation and use of statistical analysis. We should insist forcefully that weighing the evidence comparing H0 to H1 is more subtle than merely noting whether the continuous statistic is above a cutoff.

Second, hypothesis tests are often described as a decision, as though we decide which hypothesis to use. But in my experience, that description is misleading more often than not because there is no use to which the hypothesis is put. In such cases, thinking about a binary decision diverts our attention from the more important problem of quantifying the evidence favoring H0 or H1. And in the cases where there is a real decision, that decision comes with real consequences and utilities that should play a role in our decision. In short, the formal, binary hypothesis testing is not up to the job of making real decisions.

(Michael Lavine teaches statistics at the University of Massachusetts at Amherst.)

Editor’s note: I pretty much agree with all of the above, and I appreciate that Lavine slams so-called Bayesian hypothesis tests as well. The general goal of statistical inference is to summarize the information in data, typically with respect to some model or class of models. Testing model fit is important, but ultimately all models are false, and what’s interesting is where our data depart from our models, not whether we currently have enough data to “reject.”

Even more barfable to me is the idea, deeply ingrained in old-fashioned statistical theory, that interval estimation is equivalent to inversions of hypothesis tests. No it isn’t, except in some special cases of pivotal test statistics. But that’s another story . . .

That said, maybe we should all think more about the appeal of hypothesis testing and try to understand how brilliant people from Neyman and Jeffreys, on to the present day, have found hypothesis testing to be useful and important.

Also relevant are this article by mathematical psychologist Dave Krantz and this blog discussion.

First thanks to Dr. Scott Evans at Harvard University for showing me this interesting website. I agree with Professor Lavine’s reservations on hypothesis testing (i used reservations not objections). The results from hypothesis testing have very little value in practice especially from a risk benefit perspective. So, a p-value should not be the end of the story.. on the other hand, testing can be very useful, for example, in a fishing expedition for variable selections in a big pond. In my humble opinion, we should teach students more about prediction than just deriving the “best” p-value for association in regression after all no model is correct, but an approximation can be very useful.

Michael,

Thanks for the thoughtful article. I agree that hypothesis testing can be reductionist, but it is also convenient in the case that LJ Wei cites where you want to understand a phenomenon but have little idea where to start. For instance, I’m studying community college completion, about which we know almost nothing: Dept of Ed surveys find that half of all drop outs leave for “personal reasons”, twice as many as leave for family or financial reasons. I used logistic regression with the usual controls as a quick way to identify which of 200+ variables are worth further exploration and higher quality study. Some factors not only correlated with multiple college outcomes but were thematically similar to each other. What would you suggest as an alternative method of variable exploration? Or would you say that hypothesis testing should not simply be the end of the research project?

Janet

Hi Janet,

With over 200 variables, I appreciate the need to select just a few for further study. I would try to use selection criteria relevant to the study. For example, I might select those that have the largest estimated coefficients, or those whose confidence intervals (credible regions, for Bayesians) don’t rule out large effects, or those which, if I exclude them from the model, result in appreciably worse out-of-sample predictions.

And it’s true that I may end up dividing the variables into two groups, just as hypothesis testing does. But my division would be chosen for substantive reasons like (a) I have enough time to look only at 20 variables, or (b) I want to look at all variables with the potential to make a substantial change in Pr[completion], or (c) a histogram of estimated coefficients shows an outlying cluster, or (d) some of the variables are subject to intervention and others are not.

In short, you have a real decision to make, with real consequences for the future course of your analysis. I would try to base the decision on criteria that are relevant to the problem.

At ENAR, I attended a session on regulatory and legal statistics. In one of the talks it was noted in a legal case that a judge found a p-value of 0.07 “significant” because of a small sample size (and one that could not be increased-I think it was a wrongful termination and discrimination suit). I was pleased.

I think the presenter was against overly simplistic interpretation of the p-value as well.

I personally hope this discussion spreads to the stat 101 classes. We need all scientists and doctors to have this critical examination of the p-value. However, we do have to make decisions in many cases, and so I hour the discussion eventually shifts to a decision making framework (formal or otherwise).

To the editor,

I am interested to know more about:

Is your argument that inversion of a set of hypotheses leads to a confidence set? Or is the argument more subtle?

Hi Mark,

The testing procedures are useful tools, for example, they can used for obtaining interval estimates (one can generalize pivotal to asymptotic pivotal). However, using a test result as a “final” answer to a real world problem is not that interesting. We still need to teach those tools in basic statistics courses, but tell students how to use such tools properly..