Archive Page 3

Point Process Crime Prediction

The New York Times reported today on predictive policing, or deploying officers where crimes are predicted to occur in the future. According to the Times, “Based on models for predicting aftershocks from earthquakes, [the method used in Santa Cruz, CA] generates projections about which areas and windows of time are at highest risk for future crimes”. The statistical work was done by George Mohler of Santa Clara University and Martin Short of UCLA. Kudos to them for an interesting and useful application of point processes.


Using a “pure infographic” to explore differences between information visualization and statistical graphics

(by Andrew Gelman)

Our discussion on data visualization continues.

One one side are three statisticians–Antony Unwin, Kaiser Fung, and myself. We have been writing about the different goals served by information visualization and statistical graphics.

On the other side are graphics experts (sorry for the imprecision, I don’t know exactly what these people do in their day jobs or how they are trained, and I don’t want to mislabel them) such as Robert Kosara and Jen Lowe, who seem a bit annoyed at how my colleagues and myself seem to follow the Tufte strategy of criticizing what we don’t understand.

And on the third side are many (most?) academic statisticians, econometricians, etc., who don’t understand or respect graphs and seem to think of visualization as a toy that is unrelated to serious science or statistics.

I’m not so interested in the third group right now–I tried to communicate with them in my big articles from 2003 and 2004)–but I am concerned that our dialogue with the graphics experts is not moving forward quite as I’d wished.

I’m not trying to win any arguments here; rather I’m trying to move the discussion away from “good vs. bad” (I know I’ve contributed to that attitude in the past, and I’m sure I’ll do so again) toward a discussion of different goals.

I’ll try to write something more systematic on the topic, but for now I’d like to continue by discussing examples.

My article with Antony had many many examples but we got so involved in the statistical issues of data presentation that I think the main thread of the argument got lost.

For example, Hadley Wickham, creator of the great ggplot2, wrote:

Unfortunately both sides [statisticians and infovgraphics people] seem to be comparing the best of one side with the worst of the other. There are some awful infovis papers that completely ignore utility in the pursuit of aesthetics. There are many awful stat graphics papers that ignore aesthetics in the pursuit of utility (and often fail to achieve that). Neither side is perfect, and it’s a shame that we can’t work more closely together to get the best of both worlds.

I agree about the best of both worlds (and return to this point at the end of the present post). But I don’t agree that we’re comparing to “the worst of the other.” Sure, sometimes this is true (as in the notorious “chartjunk” paper in which pretty graphs are compared to piss-poor plots that violate every principle of visualization and statistical graphics).

But recent web discussions have been about the best, not the worst. In my long article with Unwin, we discussed the “5 best data visualizations of the year”! In our short article, we discuss Florence Nightingale’s spiral graph, which is considered a data visualization classic. And, from the other side, my impression is that infographics gurus are happy to celebrate the best of statistical graphics.

But in this sort of discussion we have to discuss examples we don’t like. There are some infographics that I love love love–for example, Laura and Martin Wattenberg’s Name Voyager, which is on my blogroll and which I’ve often linked to. But I don’t have much to say about these–I consider them to have the best features of statistical graphics.

In much of my recent writing on graphics, I’ve focused on visualizations that have been popular and effective–Wordle is an excellent example here–while not following what I would consider to be good principles of statistical graphics.

When I discuss the failings of Wordle (or of Nightingale’s spiral, or Kosara’s swirl, or this graph), it is not to put them down, but rather to highlight the gap between (a) what these visualizations do (draw attention to a data pattern and engage the viewer both visually and intellectually) and (b) my goal in statistical graphics (to display data patterns, both expected and unexpected). The differences between (a) and (b) are my subject, and a great way to highlight them is to consider examples that are effective as infovis but not as statistical graphics. I would have no problem with Kosara etc. doing the opposite with my favorite statistical graphics: demonstrating that despite their savvy graphical arrangements of comparisons, my graphs don’t always communicate what I’d like them to.

I’m very open to the idea that graphics experts could help me communicate in ways that I didn’t think of, just as I’d hope that graphics experts would accept that even the coolest images and dynamic graphics could be reimagined if the goal is data exploration.

To get back to our exchange with Kosara, I stand firm in my belief that the swirly plot is not such a good way to display time series data–there are more effective ways of understanding periodicity, and no I don’t think this has anything to do with dynamic vs. static graphics or problems with R. As I noted elsewhere, I think the very feature that makes many infographics appear beautiful is that they reveal the expected in an unexpected way, whereas statistical graphics are more about revealing the unexpected (or, as I would put it, checking the fit to data of models which may be explicitly or implicitly formulated. But I don’t want to debate that here. I’ll quarantine a discussion of the display of periodic data to another blog post.

Instead I’d like to discuss a pure infographic that has no quantitative content at all. It’s a display of strategies of Rock Paper Scissors that Nathan Yau featured a couple weeks ago on his blog:

This is an attractive graphic that conveys some information–but the images have almost nothing to do with the info. It’s really a small bit of content with an attractive design that fills up space.

Difference in perspectives

The graphic in question is titled, “How do I win rock, paper, scissors every time?”, which is completely false. As my literal-minded colleague Kaiser Fung would patiently explain, No, the graph does no tell you how to win the game every time. This is no big deal–it’s nothing but a harmless exaggeration–but it illustrates a difference in perspective. A statistician wouldn’t be caught dead making a knowingly false statement. Conversely, a journalist wouldn’t be caught dead making a boring headline (for example, “Some strategies that might increase your odds in rock paper scissors”).

Who’s right here–the statistician or the journalist? It depends on your goals. I’ll stick with being who I am–but I also recognize that Nathan’s post got 116 comments and who knows how many thousand viewers. In contrast, my post from a few years ago (titled “How to win at rock-paper-scissors,” a bit misleading but much less so than “How to win every time”) had a lot more information and received exactly 6 comments. This is fair enough, I’m not complaining. Visuals are more popular than text, and “popular” isn’t a bad thing. The goal is to communicate, and sacrificing some information for an appealing look is a tradeoff that is often worth it.

Moving forward

Let me conclude with a suggestion that I’ve been making a lot lately. Lead with the pretty graph but then follow up with more information. In this case, Nathan could post the attractive image (and thus sill interest his broad readership and inspire them to those 100+ comments) but set it up so that if you click through you get text (in this case, it’s words not statistical graphs) with more detailed information:

(Sorry about the tiny font; I was having difficulty with the screen shots.)

Again I purposely chose a non-quantitative example to move the discussion away from “How’s the best way to display these data” and focus entirely on the different goals.

Data science vs. statistics: has “statistics” become a dirty word?

(by John Johnson)

Revolution Analytics recently published the results of a poll indicating that JSM 2011 attendees consider themselves “data scientists.” Nancy Geller, President of the ASA, asks statisticians not to “Shun the ‘S’ word.” Yet a third take on the matter is the top tweet from JSM 2011 with Dave Blei’s quote “‘machine learning’ is how you say ‘statistics’ to a computer scientist.”

Comments about selection bias from Revolution’s poll aside (it was conducted as part of the free wifi connection in the expo), the shift from “statistics” to “analytics,” “machine learning,” “data science,” and other terms seems to reflect that calling oneself a “statistician” is just not cool or scares our colleagues. So I open the floor up to the question: has “statistics” become a dirty word?

Why Going to JSM?

(by Julien Cornebise)

For my final post about JSM, based on three year’s attendance in a row (DC, Vancouver, Miami), a recap for next year potential attendants: Why Going to JSM? When is it worth it, when is it not?

First, the obvious wrong reasons for going: such a massive monster, with its 15-20 minutes talks barely allowing for anything but an extended abstract, and with 50 sessions in parallel, you rarely go to JSM for its scientific presentations. JSM is not the place:

  • to learn on recent developments in your field: not enough precise content in 20 minutes.
  • to get to know better someone’s work: same problem.
  • to get advertisement and visibility for your work: same problem, plus, empty sessions do happen way too much — you can’t compete with a panel of world famous speakers, especially when all you offer is a skewer of 20 minutes talks.
  • to see a wide overview of your optic: conflicting sessions on a same topic make it a frustrating experience.

For all those, specific small conferences (such as MCMCSki in the MCMC field) are way better: more focused interaction, more time for work sessions, more time for exposure of ideas, for constructive feedback. So why the heck coming? What makes 5,000 people fly here and spend a whole week? Why am I so glad I attended?

Of course, JSM offers some important community events, most noticeably its awards sessions and lectures (COPPS, Neyman, Wolf, and Medallion Lectures, …) where great contributors to our fields are honored by all their peers. Even though we’re all in there for the science, I won’t hide that I, for one, appreciate such public displays of recognition: it is not because we are scientists that we should never tell those who completely wow us that, indeed, we do think they do amazingly and that we want to thank them for that! Still, this would not be a sufficient reason by itself to hold such a gigantic and costly meeting.

But JSM incredible strength is truly its social side:

  • Nowhere else can you meet all of your US-based colleagues face to face at the same time in the same place, exchanging scientific ideas or just spending some great time in an informal context, getting to know each other better in a relaxed setting.
  • Nowhere else can you see former and new people from all the institutions you’ve worked at, keeping up with what they’re up to, keeping them up with what you’re up to!
  • Never else can you go for dinner with people from all those, getting them to meet, meeting their new colleagues, learning about their recent interests, what’s hot in the field, who’s moving where, why this or that department suddenly busted, how this or that other one is about to double its size and go on a hiring spree, what interesting specialized workshop is in preparation, etc. JSM is the largest grapevine concentrated over three days.

JSM is like iterating the adjacency matrix of your graph by several steps: not only do you strengthen your links with colleagues/friends you already know and appreciate, but you also get to know those they know, and find great matches! With the obvious caveat: if you don’t know anyone, then it will be quite difficult to meet new people. I’d recommend going there with a few colleagues from your institution for the first time. The less easy profile: the isolated statistician from a foreign country; his geographical attaches (Alma mater, former employer) won’t even compensate for his lack of people to hang out with — with the noticeable exception of seizing the occasion to meet someone you’ve only interacted with remotely. The best profile: pretty much any other!

Of course, all of the above is by no mean as formal/opportunistic as it may sound. Most of this happens while going to the beach with friends (after sessions…), going to dinner, sampling terriblific junk food (Five Guys Burgers, 15th and Espanola… I will miss you), living crazy nights on Ocean Drive — note to funding agencies: this never happens, I am just pretending, we are an extremely serious bunch, all of us, no exception. Simple: most of this is essentially hanging out with friends. With the noticeable difference: those friends are also our colleagues, lots of colleagues are also our friends.

And that’s why, in spite of all its flaws, this massive meeting is so enjoyable: work and fun do mix, friends and colleagues do mix, and real long-term highlights come out of it. After all, we’re all in here for the different faces of a common passion! See you next year.

JSM treat for the road: Significance Magazine

(by Julien Cornebise)

That’s it. It’s over. Done. Gone. RIP JSM 2011. ’til next year. A great week!
Yesterday’s convention center was a mix between an airport and the ghost town of Saturday: a fraction of the people were still here, most of them carrying suitcases. There should not be any talks on the last day 😉 And, although there were not big 2 hours Lecture to attend, I still had a hard time choosing between

The 15-minutes shortness of the former’s talks put me off, and the curiosity about this magazine that Xian blogged about, the challenges to talk stats to non-statisticians, and my own will for a steroid-version of “Popular science” decided me into picking the latter.

Boy was I glad: after a short introduction outlining the aim of Significance and calling for contributors (think of it, for you or your PhD students, it looks like a great experience!), we were treated to three very enjoyable talks by authors of recent cover papers:

Howard Wainer on how missing data can lead to dire policies, and how just a few extra data will be of precious help to avoid dramatic mistakes, with striking illustrations in Education that are also available in his book. This was thought-provoking: in a first move, I might tend to integrate out the missing data using using EM algorithm or Data Augmentation, hence assuming that the missing data is distributed similarly to the non-missing. Wrong! Howard’s examples were some of those “ah-ah!” moments, where you just realize that the original strategy amounted to standing on your head. Three examples:

  • Allowing the students to pick a subset of possible questions in a test, so as to make it fairer. Wrong. A quick study on one class showed that it tends to worsen the inequality: weak students are impaired in their choice and pick the hardest questions, failing them. Consequence of assuming random missing data: augmenting the score gap with the better students who picked the easiest questions.
  • Eliminating tenure for teachers to save money. Wrong. Looking back to 1991’s suppression of tenure for super-intendants showed that the salaries increased massively. Most likely explanation: tenure is a job benefit that costs nothing to the employer; removing it requires to increase the salary to compensate. Consequence of assuming random missing data: augmenting the expenses.
  • Making SAT scores disclosure optional to enter college>. Wrong. Studying withheld SAT scores for the one college who has done so for 40 years shows that students choose rationally to disclose their score or not: very few “I did very well at SAT, but so what?”, many “I scored less than the average entry score, disclosing it won’t help my chances to enter”. Consequence of assuming random missing data: those students picked classes that they failed, as they lacked too many prerequisites. A thought here: it would also have been interesting to compare them not only with students who divulged their score as Howard did, but with other students with similar scores who went to other universities: did getting access to harder classes than they would have usually been allowed to helped them on the long term?

Andrew Solow on the Census of Marine Life (2000-2010): how many species, and is a species extinct? There were some striking statistical problems, again due to non-uniform missing data: it is missing because the species is harder to observe in our usual surroundings! So there is more to it than the abstract problem of estimating the number of classes in multinomial sampling, and of estimating the end-point of a distribution (a tricky problem in itself already).

Finally, most anchored in recent actuality, Ian MacDonald brilliant talk on the BP Discharge in the Gulf of Mexico (I learned it’s a more precise term than “Deepwater oil spill”: it’s not Deepwater in charge but BP, and it is not an overboard spill but a discharge from a reservoir).
This one was one for the records: a precise and scientific study of the estimates of the size of the discharge, based on the speaker’s experience with natural oil seeps occurring everyday in the Gulf. Beyond the beautiful/appalling before/after pictures, and the pleasant feeling of the modest scientist being (sadly) proved true vs the massive corporation, there was a fascinating scientific chase to the source of the discrepancies amongst the estimates. Ian brilliantly chased it down to the table linking thickness of the surface oil spread with its color (rainbow, metallic, light-brown, dark), which is multiplied by the surface to estimate the volume: while all of the scholar’s studies use one table, oil companies (BP, Exxon) use one provided by US Coast Guards with a 100-fold downward error for the thickest levels — precisely the ones needed when drama occurs!

The dramatic consequences of this error are well-know: we’re not talking indemnities, but dramatic error on the pressure escaping the well leading to failure of the blockage attempts — an error confirmed when the videos of the leak were finally released and particle-velocity expert scholars were able to confirm overnight that the flow was much more than officially stated.

Ian concluded not in an obvious “who’s to blame” that would have been too easy (and obvious…), but focused on the question: what will be the long-lasting impact? His study of the spatial distribution of the natural seeps, much different than that of the BP discharge, puts at rest the idea that the ecosystem is somehow immunized. We’re left with the challenge of designing a statistical test to that unwanted massive experiment. Ian calls for two concrete measure:

  • Identify and monitor key habitats and population to check ecosystem health.
  • Put the repayment of the ecosystem in the front of the line, using BP’s fine to that effect.

In conclusion, a much pleasant session, a treat for those of us who could stay this last day, and a much interesting magazine: I’ll definitely think of contributing!

Stay tuned for a final post later tonight, before I hand back the keys of the blog to its editor.

JSM impressions (day 4)

(by Christian Robert)

Another early day at JSM 2011, with a series of appointments at the Loews Hotel, whose only public outcome is that the vignettes on Bayesian statistics I called for in a previous post could end up being published in Statistical Science… I still managed to go back to the conference centre (almost) in time for Chris Holmes’ talk. Although I am sure Julien will be much more detailed about this Medallion Lecture talk, let me say that this was a very enjoyable and informative talk about the research Chris has brilliantly conducted so far! I like very much the emphasis on decision-theory, subjective Bayesianism, and hidden Markov models, while the application section was definitely impressive in the scope of the problems handled and the rich outcome of Chris’ statistical analyses, especially in connection with cancer issues…

In the afternoon I attended a Bayesian non-parametric session, before joining many others for the COPSS Awards session, where the awards were given to

seeing the same person Nilanjan Chatterjee being awarded two rewards twice for the first time.

Reflections on JSM – dusting the dusty corners

(by John Johnson)


The talks that everyone is talking about are of course very cool, and we can learn a lot from them. However, I came to this Joint Statistical Meetings in search of some of something a little different. I attended many fewer talks than I have in the past (where I would diligently attend something every session except maybe Thursday morning when I would check out and go. What I found were a lot of devils in the details.

On Saturday I attended a continuing education course on the analysis of register data. Register data is administrative data such as what a government would collect. For example, birth and death data are register data in the US and almost every other country with a functioning government. This data is a challenge to work with for the following reasons:
  • It is collected on the whole population, as a census, but is longitudinal in nature
  • It is very difficult to curate, and is collected and curated through administrative processes rather than sampling
  • It is difficult to quality control, and that control is best done through merging with other data
  • Its analysis value increases in merging with other data
  • The only source of error is transcription
While I don’t work with register data, I can appreciate the hardships that come from working with administrative data, or data that is collected as an artifact of a transaction. The challenges in merging come from the subtleties in defining the variables, and making sure that variable definitions are consistent across data. It got me to wondering whether many of the challenges and inefficiencies we have in working with this data comes from our sample-based approach to handling it.
Speaking of data, a late Sunday session on CDISC data standards was well received, and in fact we ran over by over half an hour with consent from the audience. This talk was sponsored by the statistical programming section, but there was something in there for statisticians as well especially regarding the planning of analysis of clinical trial data. Statisticians would do well to learn these standards to some degree, because they will become more of a centerpiece of statistical analysis of clinical trials.
More generally, I am curious how many statistics departments have a class on data cleaning and handling, and, if so, if it is required or a choice for a required track. I was almost completely unprepared for this aspect when I came into the industry, having only managed messy data a little bit during a consulting seminar. In planning data collection, it is important for the statistician to look ahead and thing about how the data will have to be organized for the desired method, and that requires some data handling experience.
On Monday I attended part of the session on reproducible research, and concluded that at least in the pharma/bio industry we have no clue what reproducible research is.  We have an excellent notion that research needs to be repeatable, and that documentation needs to accompany analysis to tell someone else how to interpret the findings. However, we don’t really integrate it as closely as is expected in a true reproducible research settings. Maybe CDISC data standards (as discussed above) will eliminate that need at least from the point of view of an FDA reviewer. However, it won’t within companies, or in studies that are not done with CDISC compliant data.
Monday night, I partied with the stat computing and graphics crowd, and had a mostly delightful time. Maybe they can run their raffle and business more efficiently next year. Hint hint.
On Tuesday I supported a colleague in a poster presentation describing challenges in a registry of chronic pain management, and gained a new appreciation for the poster format. Much of the discussion was thoughtful and insightful, and we were able to explain the challenges. It was at least validating that the attendees who stopped by agreed with our challenges and gave some suggestions along the lines we were thinking, and the depth of discussion was stimulating. Off the success of that, I made a point to stop by the posters and found some really good material. I would encourage more posters, and I found that most of the benefit I get from JSM is from small group discussions (and occasionally from the larger talks as well).
It was somewhere in here that Andrew forwarded me an email with a disturbing statistic about the number of investigators who cannot describe a clinical trial or the data, nor can the consulting statistician explain the trial. I think this is a topic we will return to in this blog, and I think I will submit this idea as a biopharm-sponsored invited session next year. I know that the consulting section has sponsored quality sessions on leadership in the past, and I saw a very good session on leadership at ENAR this year. I think it is time to bring it to a wider audience.
Tuesday night and Wednesday were mostly focused on catching up with old and new friends and going to posters. I’m fairly tired by Wednesday on the week of JSM, and even more so given that I got in on Friday this time, so I debated whether I would get anything out of sitting in talks. I found a couple of fascinating posters on using tumor burden to assess cancer drugs and whether safety monitoring of drug trials has an impact on Type II error rate (it does, and it’s nasty). On the basis of this, I hope to see more well-done posters submitted at next year’s meeting. I love the discussion they generate.
I ended up in a fascinating discussion about evidence needed for FDA drug approval, whether subjective Bayes has any role, and the myth and illusion of objectivity. Some of this discussion relates back to “the difference between statistically significant and not significant is not statistically significant,” but I think there are some deeper philosophical problems with the drug evidence evaluation that keep getting swept under the rug, such as the fact that we assume that drug efficacy and safety are static parameters that do not change over time. (There are obvious exceptions to this treatment, such as antibiotics.) This is a true can of worms, and I’ll let them crawl a bit. And yes, practical considerations come into play such as the fact that the choice of software is either do something that is hard to write and verify it is correct, or spend thousands of dollars on software.
Tomorrow is the last day of the conference, and I’ll try to catch a talk or two before I leave. I hope to see you next year, and before!


The Statistics Forum, brought to you by the American Statistical Association and CHANCE magazine, provides everyone the opportunity to participate in discussions about probability and statistics and their role in important and interesting topics.

The views expressed here are those of the individual authors and not necessarily those of the ASA, its officers, or its staff. The Statistics Forum is edited by Andrew Gelman.

A Magazine for People Interested in the Analysis of Data

RSS CHANCE Magazine Online