(by Andrew Gelman)

Rob Kass’s article on statistical pragmatism is scheduled to appear in *Statistical Science* along with some discussions. Here are my comments.

I agree with Rob Kass’s point that we can and should make use of statistical methods developed under different philosophies, and I am happy to take the opportunity to elaborate on some of his arguments.

I’ll discuss the following:

– Foundations of probability

– Confidence intervals and hypothesis tests

– Sampling

– Subjectivity and belief

– Different schools of statistics

**Foundations of probability.** Kass describes probability theory as anchored upon physical randomization (coin flips, die rolls and the like) but being useful more generally as a mathematical model. I completely agree but would also add another anchoring point: calibration. Calibration of probability assessments is an objective, not subjective process, although some subjectivity (or scientific judgment) is necessarily involved in the choice of events used in the calibration. In that way, Bayesian probability calibration is closely connected to frequentist probability statements, in that both are conditional on “reference sets” of comparable events. We discuss these issues further in chapter 1 of Bayesian Data Analysis, featuring examples from sports betting and record linkage.

**Confidence intervals and hypothesis tests.** I agree with Kass that confidence and statistical significance are “valuable inferential tools.” They are treated differently in classical and Bayesian statistics, however. In the Neyman-Pearson theory of inference, confidence and statistical significance are two sides of the same coin, with a confidence interval being the set of parameter values not rejected by a significance test. Unfortunately, this approach falls apart (or, at the very least, is extremely difficult) in problems with high-dimensional parameter spaces that are characteristic of my own applied work in social science and environmental health.

In a modern Bayesian approach, confidence intervals and hypothesis testing are both important but are not isomorphic; they represent two different steps of inference. Confidence statements, or posterior intervals, are summaries of inference about parameters conditional on an assumed model. Hypothesis testing–or, more generally, model checking–is the process of comparing observed data to replications under the model if it were true. Statistically significance in a hypothesis test corresponds to some aspect of the data which would be unexpected under the model. For Bayesians as for other statistical researchers, both these steps of inferences are important: we want to make use of the mathematics of probability to make conditionally valid statements about unobserved quantities, and we also want to make use of this same probability theory to reveal areas in which our models do not fit the data.

**Sampling.** Kass discusses the role of sampling as a model for understanding statistical inference. But sampling is more than a metaphor; it is crucial in many aspects of statistics. This is evident in analysis of public opinion and health, where analyses rely on random-sample national surveys, and in environmental statistics, where continuous physical variables are studied using space-time samples. But even in areas where sampling is less apparent, it can be important. Consider medical experiments, where the object invariably is inference for the general population, not merely for the patients in the study. Similarly, the goal of Kass and his colleagues in their neuroscience research is to learn about general aspects of human and animal brains, not merely to study the particular creatures on which they have data. Ultimately, sample is just another word for subset, and in both Bayesian and classical inference, appropriate generalization from sample to population depends on a model for the sampling or selection process. I have no problem with Kass’s use of sampling as a framework for inference, and I think this will work even better if he emphasizes the generalization from real samples to real populations–not just mathematical constructs–that are central to so much of our applied inferences.

**Subjectivity and belief.** The only two statements in Kass’s article that I clearly disagree with are the following two claims: “the only solid foundation for Bayesianism is subjective,” and “the most fundamental belief of any scientist is that the theoretical and real worlds are aligned.” I will discuss the two statements in turn.

Claims of the subjectivity of Bayesian inference have been much debated, and I am under no illusion that I can resolve them here. But I will repeat my point made at the outset of this discussion that Bayesian probability, like frequentist probability, is except in the simplest of examples a model-based activity that is mathematically anchored by physical randomization at one end and calibration to a reference set at the other. I will also repeat the familiar, but true, argument that most of the power of a Bayesian inference typically comes from the likelihood, not the prior, and a person who is really worried about subjective model-building might profitably spend more effort thinking about assumptions inherent in additive models, logistic regressions, proportional hazards models, and the like. Even the Wilcoxon test is based on assumptions! To put it another way, I will accept the idea of subjective Bayesianism when this same subjectivity is acknowledged for other methods of inference. Until that point, I prefer to speak not of “subjectivity” but of “assumptions” and “scientific judgment.” I agree with Kass that scientists and statisticians can and should feel free to make assumptions without falling into a “solipsistic quagmire.”

Finally, I am surprised to see Kass write that scientists believe that the theoretical and real worlds are aligned. It is from acknowledging the discrepancies between these worlds that we can (a) feel free to make assumptions without being paralyzed by fear of making mistakes, and (b) feel free to check the fit of our models (those hypothesis tests again! Although I prefer graphical model checks, supplanted by p-values as necessary). All models are false, etc.

I assume that Kass is using the word “aligned” in a loose sense, to imply that scientists believe that their models are appropriate to reality even if not fully correct. But I would not even want to go that far. Often in my own applied work I have used models that have clear flaws, models that are at best “phenomenological” in the sense of fitting the data rather than corresponding to underlying processes of interest–and often such models don’t fit the data so well either. But these models can still be useful: they are still a part of statistics and even a part of science (to the extent that science includes data collection and description as well as deep theories).

**Different schools of statistics.** Like Kass, I believe that philosophical debates can be a good thing, if they motivate us to think carefully about our unexamined assumptions. Perhaps even the existence of subfields that rarely communicate with each other has been a source of progress in allowing different strands of research to be developed in a pluralistic environment, in a way that might not have been so easily done if statistical communication had been dominated by any single intolerant group. Ideas of sampling, inference, and model checking are important in many different statistical traditions and we are lucky to have so many different ideas on which to draw for inspiration in our applied and methodological research.

Go to the link above to read Rob’s original article. The other discussions, and Rob’s response to the discussions, are at the journal’s website.

When it comes down to it, Bayesian, frequentist, pure Neyman-Pearson, or other statistical frameworks are a collection of assumptions and consequences of those assumptions. This may be obvious to someone who has studied statistics for many years, but not to our clients, and perhaps we don’t think about it enough ourselves.

If the assumptions of frequentist statistics fit, or fit well enough, it’s fine to use frequentist assumptions. However, I am still having trouble with unknown, unknowable fixed parameters for which you can get better estimates over time. The assumption works fine for some things (snapshot of a population average height at one time), but I am wondering if the overapplication of this assumption is getting us into trouble. For example, we spend millions on clinical trials to determine clinical benefit of a drug, but we are finding that a fixed treatment effect over time is a hard assumption to justify (e.g. antibiotic resistance).

John:

In this case I think the variation over time can be modeled directly rather than being viewed as equivalent to a Bayesian uncertainty.