John Cook is an applied mathematician working in Houston, Texas. His career has been a blend of research, software development, consulting, and management. John is a DZone MVB and is not an employee of DZone and has posted 168 posts at DZone. You can read more from them at their website. View Full User Profile

Tweaking Bayes’ Theorem

  • submit to reddit

In Peter Norvig’s talk The Unreasonable Effectiveness of Data, starting at 37:42, he describes a translation algorithm based on Bayes’ theorem. Pick the English word that has the highest posterior probability as the translation. No surprise here. Then at 38:16 he says something curious.

So this is all nice and theoretical and pure, but as well as being mathematically inclined, we are also realists. So we experimented some, and we found out that when you raise that first factor [in Bayes' theorem] to the 1.5 power, you get a better result.

In other words, if we change Bayes’ theorem (!) the algorithm works better. He goes on to explain

Now should we dig up Bayes and notify him that he was wrong? No, I don’t think that’s it. …

I imagine most statisticians would respond that this cannot possibly be right. While it appears to work, there must be some underlying reason why and we should find that reason before using an algorithm based on an ad hoc tweak.

While such a reaction is understandable, it’s also a little hypocritical. Statisticians are constantly drawing inference from empirical data without understanding the underlying mechanisms that generate the data. When analyzing someone else’s data, a statistician will say that of course we’d rather understand the underlying mechanism than fit statistical models, that just not always possible. Reality is too complicated and we’ve got to do the best we can.

I agree, but that same reasoning applied at a higher level of abstraction could be used to accept Norvig’s translation algorithm. Here’s this model (derived from spurious math, but we’ll ignore that). Let’s see empirically how well it works.

Published at DZone with permission of John Cook, author and DZone MVB. (source)

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)


Goel Yatendra replied on Thu, 2012/03/15 - 1:31pm

I think the pea was slipped under the other walnut shell when Norvig took the step from “Pr” to “p” — that is, from the population probability to the frequency within a sample. He did offer a very reasonable explanation when he said “I think what’s going on here is that we have more confidence in the first model” that is, we have a much larger sample of English texts than of known French-to-English translations, and hence more confidence that the frequency of a given English string reflects its “true” probability. The assumption of equivalent confidence is necessary to justify application of Bayes’ theorem to any finite sample. I have a truly marvelous proof, but this comment box is too small to contain it.

Kevan Doyle replied on Thu, 2012/03/15 - 7:52pm in response to: Goel Yatendra

Shades of Fermat and the too small margins in his book!

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.