Machine Learning – Naive Bayes Classifier
January 28, 2010 8 Comments
There are 3 methods to establish a classifier, these are:
- Model a classification rule directly
Examples of this include the KNN, Decision Trees, Perceptrons and SVM.
- Model the probability of class memberships given the input data
Examples include Multi-layered Perceptrons with the cross Entropy cost.
- Make a probabilistic model of data within each class
Examples are the Naive Bayes and Model Based classifiers.
With the following 3 methods, we can also say that:
- 1 and 2 are examples of Discriminative classification
- 3 is an example of Generative classification
- 2 and 3 are examples of Probabilistic classification
The Bayesian Rule is:
where we could say that:
So for example, if we apply this to a Spam Filter, then P(C) would be the probability that the message is Spam, and P(X|C) is the probability that the given word (input) is Spam, given that the message is Spam. P(X) is just the probability of a word appearing in a message using the given training data.
For the Bayesian Rule above, we have to extend it so that we have:
Naive Bayes is called so because it makes the assumption that all the input attributes are independent, such as one word doesn’t affect the other in deciding whether or not a message is spam.
Example – Training
Lets say we have a table that decided if we should play tennis under certain circumstances. These could be the outlook of the weather; the temperature; the humidity and the strength of the wind:
So here we have 4 attributes. What we need to do is to create “look-up tables” for each of these attributes, and write in the probability that a game of tennis will be played based on this attribute. In these tables we have to note that there are 5 cases of not being able to play a game, and 9 cases of being able to play a game.
|OUTLOOK||Play = Yes||Play = No||Total|
|TEMPERATURE||Play = Yes||Play = No||Total|
|HUMIDITY||Play = Yes||Play = No||Total|
|WIND||Play = Yes||Play = No||Total|
We also must note the following probabilities for P(C):
- P(Play=Yes) = 9/14
- P(Play=No) = 5/14
Now, we are in the testing phase. For this, say we were given a new instance, and we want to know if we can play a game or not, then we need to lookup the results from the tables above. So, this new instance is:
- X = (Outlook=Sunny, Temperature=Cool, Humidity=High, Wind=Strong)
Firstly we look at the probability that we can play the game, so we use the lookup tables to get:
- P(Outlook=Sunny | Play=Yes) = 2/9
- P(Temperature=Cool | Play=Yes) = 3/9
- P(Humidity=High | Play=Yes) = 3/9
- P(Wind=Strong | Play=Yes) = 3/9
- P(Play=Yes) = 9/14
Next we consider the fact that we cannot play a game:
- P(Outlook=Sunny | Play=No) = 3/5
- P(Temperature=Cool | Play=No) = 1/5
- P(Humidity=High | Play=No) = 4/5
- P(Wind=Strong | Play=No) = 3/5
- P(Play=No) = 5/14
Then, using those results, you have to multiple the whole lot together. So you multiple all the probabilities for Play=Yes such as:
- P(X|Play=Yes)P(Play=Yes) = (2/9) * (3/9) * (3/9) * (3/9) * (9/14) = 0.0053
And this gives us a value that represents ‘P(X|C)P(C)’, or in this case ‘P(X|Play=Yes)P(Play=Yes)’.
We also have to do the same thing for Play=No:
- P(X|Play=No)P(Play=No) = (3/5) * (1/5) * (4/5) * (3/5) * (5/14) = 0.0206
Finally, we have to divide both results by the evidence, or ‘P(X)’. The evidence for both equations
is the same, and we can find the values we need within the ‘Total’ columns of the look-up tables. Therefore:
- P(X) = P(Outlook=Sunny) * P(Temperature=Cool) * P(Humidity=High) * P(Wind=Strong)
- P(X) = (5/14) * (4/14) * (7/14) * (6/14)
- P(X) = 0.02186
Then, dividing the results by this value:
- P(Play=Yes | X) = 0.0053/0.02186 = 0.2424
- P(Play=No | X) = 0.0206/0.02186 = 0.9421
So, given the probabilities, can we play a game or not? To do this, we look at both probabilities and see which once has the highest value, and that is our answer. Therefore:
- P(Play=Yes | X) = 0.2424
- P(Play=No | X) = 0.9421
Since 0.9421 is greater than 0.2424 then the answer is ‘no’, we cannot play a game of tennis today.