(by YQnancy Wang)
In the session on Statistical Modeling at the Internet Scale: Understanding User and Advertise Behavior, Daryl Pregibon from Google Inc. discussed how statistical “plumbing” is used to help search engines rank webpages. Besides query independent quality, the level of matching to query, user feedbacks are heavily employed to develop a dynamic ranking system. He used ranking football teams as an analogy to illustrate incorporating the quality of opponents into the ranking system. Using Minorization Maximization (Hunter, 2004), the algorithm aggregates users’ pairwise clicks / non-clicks information to assign scores to a list of webpages. Many good questions were brought up after the talk. One was about the fact that this ranking model leaves out all the temporal information, such as which webpage the user clicked on first and last. Personally, it is more about how to maintain the balance between learning all details in the data and losing information during aggregation. Also different user habits might result in different interpretations based on the same order of clicks. It would be useful to learn the user habits on an individual level; sometimes, like Daryl pointed out, a sophisticated model does not always help solve the problem. It is a balance, and why statistics is an art.
The following two talks in the session focused more on the data-side of the stories. Eric Sun from Facebook, Inc. showed the power of understanding user behaviors using the massive amount of Facebook user data. For example, when it comes to words associated with “vodka” in people’s status: younger people tend to use “shot” and “drunk”, while older people use “lime” and “lemon”; males with “bottle”, while females “too much”. The study on people’s happiness and positive words they use in their status confirms their positive correlation, and is completed using Facebook updates and self-reported survey results.
Jake M. Hofman from Yahoo Research continued to dazzle the audience with the power and the amount of information lies within data generated from the Internet usage nowadays. He made a good point to learn the contents of Internet usages rather than frequencies. Jake showed that the diversity of audiences of a website is greater than that of a physical neighborhood; and using the websites one individual visits, it is possible to tells us a great deal of the user’s identity already. Moreover, it shows substantially better performance if restricted to stenotypes. In the end, the comparison between the “clean” story and the “real” story calls for another balance between applying complicated models and understanding the raw data.