(by Michael Lavine)
In the old days, when I was in graduate school, inference and model assessment relied heavily on numerical summaries like F-tests, t-tests, Bayes factors, and many others. But some of my professors made a point of assessing models and making inferences graphically, rather than, or maybe in addition to, numerically.
I found their arguments convincing and thought the field of statistics was moving in that direction. I was wrong. These days the field of statistics relies heavily on numerical summaries like AIC, BIC, FDR, MCMC output, and others. We’ve replaced one set of quantities with another, but haven’t given up our reliance on numerology.
Recently, my own work reminded me of the hazards of numerology. My students were computing AIC and BIC to compare models for binary regression. Rather than present the full story here, I will use an artificial data set to illustrate the problem we found.
A binary response Y depends on a covariate X. We will compare two logistic regression models: Model 1, where X enters linearly, and Model 2, where X enters quadratically. The AIC values are about 1028 and 1020, for Models 1 and 2, respectively. Apparently, Model 2 wins. Traditional tests agree: the p-value for the quadratic term in Model 2 is less than 0.01.
Figure 1 below shows one way to assess the models graphically. The data points are sorted according to their fitted values, then binned. For each bin, Figure 1 plots the average fitted value on the x-axis and the average observed value (proportion of 1’s) on the y-axis. Model 1 is on the left; Model 2 on the right. For good models, points should lie close to the line. Here, there’s little to choose between the two models.
Figure 2 belows show another graphical assessment; it plots fitted values from Model 2 vs. fitted values from Model 1. Points where Y=0 are on the left; Y=1 is on the right. Most points lie near the diagonal line, showing that the two models agree. But one point, call it A, is far from the line; for that point, Model 1 has a large fitted value while Model 2 has a moderate fitted value. That point happens to have Y=0, so p(A | Model 1) / p(A | Model 2) is small. Point A makes a large contribution to the likelihood ratio and may be driving the numerical comparison between the models.
To investigate further, Figure 3 below shows each point’s contribution to the loglikelihood ratio; i.e., log [ p(y | Model 1) / p(y | Model 2) ]. Points are jittered vertically for clarity. Most points have a likelihood ratio near 0, but one point, A, is far from 0. That point dominates the log likelihood ratio and therefore dominates any numerical summary, like AIC, BIC, DIC, or deviance, based on the likelihood ratio. If that point were removed, the AIC values would be about 1044 and 1045 for models 1 and 2 respectively, and the p-value for comparing models would be around 0.4. In fact, without point A, Model 1 is correct: I constructed the data by generating 1000 points from Model 1, then adding point A.
If this were a real application, and if I strongly believed my models, I might take the likelihood function seriously and prefer Model 2. But in a real application, I wouldn’t strongly believe my models. Whatever the eventual conclusion, relying on numerology obscures, rather than illuminates, an interesting feature.
If you want to play with this example, here’s how I generated the data and fit the models in R.
npts <- 1000
data <- data.frame ( x = runif(npts,0,2.5) )
data$y <- rbinom ( npts, 1, invlogit(data$x) ) # Model 1 is correct for these 1000 points
data <- rbind ( data, c ( x = 7, y=0 ) ) # One point is added
fit1 <- glm ( y ~ x, family = binomial, data = data )
fit2 <- glm ( y ~ x + I(x^2), family = binomial, data = data )