In our last post, we found out how simple it is to use Cassandra to estimate ad conversion. It’s easy, because effectively all you have to do is accumulate counts – and Cassandra is quite good at counting. As we demonstrated in that post, Cassandra can be used as a giant, distributed, redundant, “infinitely” scalable counting framework. During this post we will take the online ad company example just a bit further by creating a Cassandra-backed Naive Bayes Classifier. Again, we see that the “secret sauce” is simply keeping track of the appropriate counts.
In the previous post, we helped equip your online ad company with the ability to track ad conversion rates. But competition is steep and we’ll need to do a little better than ad conversion rates if your company is to stay on top. Recently, suspicions have arisen that ads are often being shown to unlikely customers. A quick look at the logs confirms this concern. For instance, there was a case of one internet user that clicked almost every single ad that he was shown – so long as it related to the camping gear. Several times, he went on to make purchases: a tent, a lantern, and a sleeping bag. But despite this users obvious interest in outdoor sporting goods, your logs indicated that fully 90% of the ads he was shown were for women’s apparel. Of these ads, this user clicked none of them.
Let’s attack this problem by creating a classifier. Fortunately for us, your company specializes in two main genres, fashion, and outdoors sporting goods. If we can determine which type of user we’re dealing with, then we can improve our conversion rates considerably by simply showing users the appropriate ads.Naive Bayes Classifiers
With this goal in mind, let’s look at the theory behind Naive Bayes Classifiers so that we can build our own. The purpose of a classifier is to identify which group a sample belongs to based upon the given evidence. In this case, our “sample”, is an individual user, and based upon the evidence of which ads she clicks, we wish to identify which group she belongs to: fashion or outdoors. To put some math to the problem, consider the following question:
What is the probability that user is from group given the fact that this user has clicked on ads , , and ?
To put this into equation form, we can write:
This function returns a probability, a number from 0 to 1, representing how likely it is that this user is from a particular group based upon the fact that they have clicked on these ads. The goal, then, is to evaluate this equation with each group and then find which group leads to a bigger result. But how do you evaluate this equation? Fortunately for us, Thomas Bayes, a clergyman from the 18th century, provided an answer in the form of Bayes’s equation:
Here we’ve turned one probability into the function of three separate probabilities:
- – the probability, in the absence of any evidence, that a user is from a particular group – this is called the prior
- – the probability that a user from group will have clicked ads , , , etc.
- – the probability that a user from any group will click ads , , , etc.
This looks a little confusing, but bear with me a moment and we’ll see how this allows us to solve our classification problem. Let’s first look at the probability . We happen to know that both the fashion group and the outdoor group are about equally strong, so for simplicity's sake, we assume that . But remember, we ultimately intend to identify the group which maximizes this equation. Since in both cases, it does not affect the outcome and can safely be disregarded. Next up, . We could estimate the probability that users click on particular groups of ads, but here again we’re looking for the group that maximizes this above equation, and since the value of is not a function of the group in consideration, this component remains constant across all groups and can also be safely be disregarded.
The only piece left is , the probability that a user from group G will click on ads , , , etc. Since this piece is a function of , we can not disregard it, so we must somehow compute it. And our goal, again, is to find the group which maximizes this probability. … But we have a problem. This particular probability, as stated, can not be computed. It’s intractable. It’s mathematically infeasible to gather enough information to estimate the probability that a member of group will click on any particular set of ads. So we do what any good applied mathematician will do when hitting a wall like this, we’ll make a simplifying assumption. If we assume that ad clicks are completely independent from one another, then we can deal with them each separately. Thus:
Here, each piece, , , , etc., is actually quite simple to estimate. And though this assumption might be a bit naive — this is, after all, reason that this classifier is called the Naive Bayes Classifier — the resulting classifier has empirically been show to work quite well across a wide range of applications and even in certain cases where this assumption is not only naive, but actually quite wrong.
Finally, we have arrived at something we can deal with. Let’s take a moment to recap: In order to find the most likely group that a user belongs to based upon their ad clicks , , , etc., we must find which group maximizes the equation:
But after using Bayes’s equation and discarding some unnecessary pieces we recognize that we can determine the most likely group by finding the group which maximizes this function:
And finally, after making the simplifying assumption regarding the independence of clicks, we see that determining the most likely group for the user based upon ad clicks is as simple as finding the group which maximizes this equation