The New York Times reported today on predictive policing, or deploying officers where crimes are predicted to occur in the future. According to the Times, “Based on models for predicting aftershocks from earthquakes, [the method used in Santa Cruz, CA] generates projections about which areas and windows of time are at highest risk for future crimes”. The statistical work was done by George Mohler of Santa Clara University and Martin Short of UCLA. Kudos to them for an interesting and useful application of point processes.
Archive Page 3
Using a “pure infographic” to explore differences between information visualization and statistical graphicsPublished August 10, 2011 Uncategorized 1 Comment
(by Andrew Gelman)
Our discussion on data visualization continues.
One one side are three statisticians–Antony Unwin, Kaiser Fung, and myself. We have been writing about the different goals served by information visualization and statistical graphics.
On the other side are graphics experts (sorry for the imprecision, I don’t know exactly what these people do in their day jobs or how they are trained, and I don’t want to mislabel them) such as Robert Kosara and Jen Lowe, who seem a bit annoyed at how my colleagues and myself seem to follow the Tufte strategy of criticizing what we don’t understand.
And on the third side are many (most?) academic statisticians, econometricians, etc., who don’t understand or respect graphs and seem to think of visualization as a toy that is unrelated to serious science or statistics.
I’m not so interested in the third group right now–I tried to communicate with them in my big articles from 2003 and 2004)–but I am concerned that our dialogue with the graphics experts is not moving forward quite as I’d wished.
I’m not trying to win any arguments here; rather I’m trying to move the discussion away from “good vs. bad” (I know I’ve contributed to that attitude in the past, and I’m sure I’ll do so again) toward a discussion of different goals.
I’ll try to write something more systematic on the topic, but for now I’d like to continue by discussing examples.
My article with Antony had many many examples but we got so involved in the statistical issues of data presentation that I think the main thread of the argument got lost.
For example, Hadley Wickham, creator of the great ggplot2, wrote:
Unfortunately both sides [statisticians and infovgraphics people] seem to be comparing the best of one side with the worst of the other. There are some awful infovis papers that completely ignore utility in the pursuit of aesthetics. There are many awful stat graphics papers that ignore aesthetics in the pursuit of utility (and often fail to achieve that). Neither side is perfect, and it’s a shame that we can’t work more closely together to get the best of both worlds.
I agree about the best of both worlds (and return to this point at the end of the present post). But I don’t agree that we’re comparing to “the worst of the other.” Sure, sometimes this is true (as in the notorious “chartjunk” paper in which pretty graphs are compared to piss-poor plots that violate every principle of visualization and statistical graphics).
But recent web discussions have been about the best, not the worst. In my long article with Unwin, we discussed the “5 best data visualizations of the year”! In our short article, we discuss Florence Nightingale’s spiral graph, which is considered a data visualization classic. And, from the other side, my impression is that infographics gurus are happy to celebrate the best of statistical graphics.
But in this sort of discussion we have to discuss examples we don’t like. There are some infographics that I love love love–for example, Laura and Martin Wattenberg’s Name Voyager, which is on my blogroll and which I’ve often linked to. But I don’t have much to say about these–I consider them to have the best features of statistical graphics.
In much of my recent writing on graphics, I’ve focused on visualizations that have been popular and effective–Wordle is an excellent example here–while not following what I would consider to be good principles of statistical graphics.
When I discuss the failings of Wordle (or of Nightingale’s spiral, or Kosara’s swirl, or this graph), it is not to put them down, but rather to highlight the gap between (a) what these visualizations do (draw attention to a data pattern and engage the viewer both visually and intellectually) and (b) my goal in statistical graphics (to display data patterns, both expected and unexpected). The differences between (a) and (b) are my subject, and a great way to highlight them is to consider examples that are effective as infovis but not as statistical graphics. I would have no problem with Kosara etc. doing the opposite with my favorite statistical graphics: demonstrating that despite their savvy graphical arrangements of comparisons, my graphs don’t always communicate what I’d like them to.
I’m very open to the idea that graphics experts could help me communicate in ways that I didn’t think of, just as I’d hope that graphics experts would accept that even the coolest images and dynamic graphics could be reimagined if the goal is data exploration.
To get back to our exchange with Kosara, I stand firm in my belief that the swirly plot is not such a good way to display time series data–there are more effective ways of understanding periodicity, and no I don’t think this has anything to do with dynamic vs. static graphics or problems with R. As I noted elsewhere, I think the very feature that makes many infographics appear beautiful is that they reveal the expected in an unexpected way, whereas statistical graphics are more about revealing the unexpected (or, as I would put it, checking the fit to data of models which may be explicitly or implicitly formulated. But I don’t want to debate that here. I’ll quarantine a discussion of the display of periodic data to another blog post.
Instead I’d like to discuss a pure infographic that has no quantitative content at all. It’s a display of strategies of Rock Paper Scissors that Nathan Yau featured a couple weeks ago on his blog:
This is an attractive graphic that conveys some information–but the images have almost nothing to do with the info. It’s really a small bit of content with an attractive design that fills up space.
Difference in perspectives
The graphic in question is titled, “How do I win rock, paper, scissors every time?”, which is completely false. As my literal-minded colleague Kaiser Fung would patiently explain, No, the graph does no tell you how to win the game every time. This is no big deal–it’s nothing but a harmless exaggeration–but it illustrates a difference in perspective. A statistician wouldn’t be caught dead making a knowingly false statement. Conversely, a journalist wouldn’t be caught dead making a boring headline (for example, “Some strategies that might increase your odds in rock paper scissors”).
Who’s right here–the statistician or the journalist? It depends on your goals. I’ll stick with being who I am–but I also recognize that Nathan’s post got 116 comments and who knows how many thousand viewers. In contrast, my post from a few years ago (titled “How to win at rock-paper-scissors,” a bit misleading but much less so than “How to win every time”) had a lot more information and received exactly 6 comments. This is fair enough, I’m not complaining. Visuals are more popular than text, and “popular” isn’t a bad thing. The goal is to communicate, and sacrificing some information for an appealing look is a tradeoff that is often worth it.
Let me conclude with a suggestion that I’ve been making a lot lately. Lead with the pretty graph but then follow up with more information. In this case, Nathan could post the attractive image (and thus sill interest his broad readership and inspire them to those 100+ comments) but set it up so that if you click through you get text (in this case, it’s words not statistical graphs) with more detailed information:
(Sorry about the tiny font; I was having difficulty with the screen shots.)
Again I purposely chose a non-quantitative example to move the discussion away from “How’s the best way to display these data” and focus entirely on the different goals.
(by John Johnson)
Revolution Analytics recently published the results of a poll indicating that JSM 2011 attendees consider themselves “data scientists.” Nancy Geller, President of the ASA, asks statisticians not to “Shun the ‘S’ word.” Yet a third take on the matter is the top tweet from JSM 2011 with Dave Blei’s quote “‘machine learning’ is how you say ‘statistics’ to a computer scientist.”
Comments about selection bias from Revolution’s poll aside (it was conducted as part of the free wifi connection in the expo), the shift from “statistics” to “analytics,” “machine learning,” “data science,” and other terms seems to reflect that calling oneself a “statistician” is just not cool or scares our colleagues. So I open the floor up to the question: has “statistics” become a dirty word?
(by Julien Cornebise)
For my final post about JSM, based on three year’s attendance in a row (DC, Vancouver, Miami), a recap for next year potential attendants: Why Going to JSM? When is it worth it, when is it not?
First, the obvious wrong reasons for going: such a massive monster, with its 15-20 minutes talks barely allowing for anything but an extended abstract, and with 50 sessions in parallel, you rarely go to JSM for its scientific presentations. JSM is not the place:
- to learn on recent developments in your field: not enough precise content in 20 minutes.
- to get to know better someone’s work: same problem.
- to get advertisement and visibility for your work: same problem, plus, empty sessions do happen way too much — you can’t compete with a panel of world famous speakers, especially when all you offer is a skewer of 20 minutes talks.
- to see a wide overview of your optic: conflicting sessions on a same topic make it a frustrating experience.
For all those, specific small conferences (such as MCMCSki in the MCMC field) are way better: more focused interaction, more time for work sessions, more time for exposure of ideas, for constructive feedback. So why the heck coming? What makes 5,000 people fly here and spend a whole week? Why am I so glad I attended?
Of course, JSM offers some important community events, most noticeably its awards sessions and lectures (COPPS, Neyman, Wolf, and Medallion Lectures, …) where great contributors to our fields are honored by all their peers. Even though we’re all in there for the science, I won’t hide that I, for one, appreciate such public displays of recognition: it is not because we are scientists that we should never tell those who completely wow us that, indeed, we do think they do amazingly and that we want to thank them for that! Still, this would not be a sufficient reason by itself to hold such a gigantic and costly meeting.
But JSM incredible strength is truly its social side:
- Nowhere else can you meet all of your US-based colleagues face to face at the same time in the same place, exchanging scientific ideas or just spending some great time in an informal context, getting to know each other better in a relaxed setting.
- Nowhere else can you see former and new people from all the institutions you’ve worked at, keeping up with what they’re up to, keeping them up with what you’re up to!
- Never else can you go for dinner with people from all those, getting them to meet, meeting their new colleagues, learning about their recent interests, what’s hot in the field, who’s moving where, why this or that department suddenly busted, how this or that other one is about to double its size and go on a hiring spree, what interesting specialized workshop is in preparation, etc. JSM is the largest grapevine concentrated over three days.
JSM is like iterating the adjacency matrix of your graph by several steps: not only do you strengthen your links with colleagues/friends you already know and appreciate, but you also get to know those they know, and find great matches! With the obvious caveat: if you don’t know anyone, then it will be quite difficult to meet new people. I’d recommend going there with a few colleagues from your institution for the first time. The less easy profile: the isolated statistician from a foreign country; his geographical attaches (Alma mater, former employer) won’t even compensate for his lack of people to hang out with — with the noticeable exception of seizing the occasion to meet someone you’ve only interacted with remotely. The best profile: pretty much any other!
Of course, all of the above is by no mean as formal/opportunistic as it may sound. Most of this happens while going to the beach with friends (after sessions…), going to dinner, sampling terriblific junk food (Five Guys Burgers, 15th and Espanola… I will miss you), living crazy nights on Ocean Drive — note to funding agencies: this never happens, I am just pretending, we are an extremely serious bunch, all of us, no exception. Simple: most of this is essentially hanging out with friends. With the noticeable difference: those friends are also our colleagues, lots of colleagues are also our friends.
And that’s why, in spite of all its flaws, this massive meeting is so enjoyable: work and fun do mix, friends and colleagues do mix, and real long-term highlights come out of it. After all, we’re all in here for the different faces of a common passion! See you next year.
(by Julien Cornebise)
That’s it. It’s over. Done. Gone. RIP JSM 2011. ’til next year. A great week!
Yesterday’s convention center was a mix between an airport and the ghost town of Saturday: a fraction of the people were still here, most of them carrying suitcases. There should not be any talks on the last day 😉 And, although there were not big 2 hours Lecture to attend, I still had a hard time choosing between
- Sampling and Sampling Distributions contributed session, and
- Significance Magazine: Communicating Statistics to the World about the joint magazine Significance of the RSS and ASA.
The 15-minutes shortness of the former’s talks put me off, and the curiosity about this magazine that Xian blogged about, the challenges to talk stats to non-statisticians, and my own will for a steroid-version of “Popular science” decided me into picking the latter.
Boy was I glad: after a short introduction outlining the aim of Significance and calling for contributors (think of it, for you or your PhD students, it looks like a great experience!), we were treated to three very enjoyable talks by authors of recent cover papers:
- Uneducated Guesses: Using Evidence to Uncover Misguided Education Policies by Howard Wainer (I could not find the article in Significance’s archive — anyone?)
- The sea, the Census and statistics by Andrew Solow
- Deepwater disaster: how the oil spill estimates got it wrong by Ian MacDonald
Howard Wainer on how missing data can lead to dire policies, and how just a few extra data will be of precious help to avoid dramatic mistakes, with striking illustrations in Education that are also available in his book. This was thought-provoking: in a first move, I might tend to integrate out the missing data using using EM algorithm or Data Augmentation, hence assuming that the missing data is distributed similarly to the non-missing. Wrong! Howard’s examples were some of those “ah-ah!” moments, where you just realize that the original strategy amounted to standing on your head. Three examples:
- Allowing the students to pick a subset of possible questions in a test, so as to make it fairer. Wrong. A quick study on one class showed that it tends to worsen the inequality: weak students are impaired in their choice and pick the hardest questions, failing them. Consequence of assuming random missing data: augmenting the score gap with the better students who picked the easiest questions.
- Eliminating tenure for teachers to save money. Wrong. Looking back to 1991’s suppression of tenure for super-intendants showed that the salaries increased massively. Most likely explanation: tenure is a job benefit that costs nothing to the employer; removing it requires to increase the salary to compensate. Consequence of assuming random missing data: augmenting the expenses.
- Making SAT scores disclosure optional to enter college>. Wrong. Studying withheld SAT scores for the one college who has done so for 40 years shows that students choose rationally to disclose their score or not: very few “I did very well at SAT, but so what?”, many “I scored less than the average entry score, disclosing it won’t help my chances to enter”. Consequence of assuming random missing data: those students picked classes that they failed, as they lacked too many prerequisites. A thought here: it would also have been interesting to compare them not only with students who divulged their score as Howard did, but with other students with similar scores who went to other universities: did getting access to harder classes than they would have usually been allowed to helped them on the long term?
Andrew Solow on the Census of Marine Life (2000-2010): how many species, and is a species extinct? There were some striking statistical problems, again due to non-uniform missing data: it is missing because the species is harder to observe in our usual surroundings! So there is more to it than the abstract problem of estimating the number of classes in multinomial sampling, and of estimating the end-point of a distribution (a tricky problem in itself already).
Finally, most anchored in recent actuality, Ian MacDonald brilliant talk on the BP Discharge in the Gulf of Mexico (I learned it’s a more precise term than “Deepwater oil spill”: it’s not Deepwater in charge but BP, and it is not an overboard spill but a discharge from a reservoir).
This one was one for the records: a precise and scientific study of the estimates of the size of the discharge, based on the speaker’s experience with natural oil seeps occurring everyday in the Gulf. Beyond the beautiful/appalling before/after pictures, and the pleasant feeling of the modest scientist being (sadly) proved true vs the massive corporation, there was a fascinating scientific chase to the source of the discrepancies amongst the estimates. Ian brilliantly chased it down to the table linking thickness of the surface oil spread with its color (rainbow, metallic, light-brown, dark), which is multiplied by the surface to estimate the volume: while all of the scholar’s studies use one table, oil companies (BP, Exxon) use one provided by US Coast Guards with a 100-fold downward error for the thickest levels — precisely the ones needed when drama occurs!
The dramatic consequences of this error are well-know: we’re not talking indemnities, but dramatic error on the pressure escaping the well leading to failure of the blockage attempts — an error confirmed when the videos of the leak were finally released and particle-velocity expert scholars were able to confirm overnight that the flow was much more than officially stated.
Ian concluded not in an obvious “who’s to blame” that would have been too easy (and obvious…), but focused on the question: what will be the long-lasting impact? His study of the spatial distribution of the natural seeps, much different than that of the BP discharge, puts at rest the idea that the ecosystem is somehow immunized. We’re left with the challenge of designing a statistical test to that unwanted massive experiment. Ian calls for two concrete measure:
- Identify and monitor key habitats and population to check ecosystem health.
- Put the repayment of the ecosystem in the front of the line, using BP’s fine to that effect.
In conclusion, a much pleasant session, a treat for those of us who could stay this last day, and a much interesting magazine: I’ll definitely think of contributing!
Stay tuned for a final post later tonight, before I hand back the keys of the blog to its editor.
(by Christian Robert)
Another early day at JSM 2011, with a series of appointments at the Loews Hotel, whose only public outcome is that the vignettes on Bayesian statistics I called for in a previous post could end up being published in Statistical Science… I still managed to go back to the conference centre (almost) in time for Chris Holmes’ talk. Although I am sure Julien will be much more detailed about this Medallion Lecture talk, let me say that this was a very enjoyable and informative talk about the research Chris has brilliantly conducted so far! I like very much the emphasis on decision-theory, subjective Bayesianism, and hidden Markov models, while the application section was definitely impressive in the scope of the problems handled and the rich outcome of Chris’ statistical analyses, especially in connection with cancer issues…
In the afternoon I attended a Bayesian non-parametric session, before joining many others for the COPSS Awards session, where the awards were given to
- COPSS: Nilanjan Chatterjee, National Cancer Institute,
- F.N. Dawid: Marie Davidian, North Carolina State University,
- G.W. Snedecor: Nilanjan Chatterjee, National Cancer Institute,
- R.A. Fisher Lecture: Jeff Wu, Georgia Tech. University,
seeing the same person Nilanjan Chatterjee being awarded two rewards twice for the first time.
(by John Johnson)
The talks that everyone is talking about are of course very cool, and we can learn a lot from them. However, I came to this Joint Statistical Meetings in search of some of something a little different. I attended many fewer talks than I have in the past (where I would diligently attend something every session except maybe Thursday morning when I would check out and go. What I found were a lot of devils in the details.
- It is collected on the whole population, as a census, but is longitudinal in nature
- It is very difficult to curate, and is collected and curated through administrative processes rather than sampling
- It is difficult to quality control, and that control is best done through merging with other data
- Its analysis value increases in merging with other data
- The only source of error is transcription