Machine Learning – Naive Bayes Classifier


There are 3 methods to establish a classifier, these are:

  1. Model a classification rule directly
    Examples of this include the KNN, Decision Trees, Perceptrons and SVM.
  2. Model the probability of class memberships given the input data
    Examples include Multi-layered Perceptrons with the cross Entropy cost.
  3. Make a probabilistic model of data within each class
    Examples are the Naive Bayes and Model Based classifiers.

With the following 3 methods, we can also say that:

  • 1 and 2 are examples of Discriminative classification
  • 3 is an example of Generative classification
  • 2 and 3 are examples of Probabilistic classification


Bayesian Rule

The Bayesian Rule is:


where we could say that:


So for example, if we apply this to a Spam Filter, then P(C) would be the probability that the message is Spam, and P(X|C) is the probability that the given word (input) is Spam, given that the message is Spam. P(X) is just the probability of a word appearing in a message using the given training data.


Naive Bayes

For the Bayesian Rule above, we have to extend it so that we have:

naivebayesextendWhere, if we continued to use the spam filter idea, X1, …. Xn would be the input, or the words from the training data.

Naive Bayes is called so because it makes the assumption that all the input attributes are independent, such as one word doesn’t affect the other in deciding whether or not a message is spam.


Example – Training

Lets say we have a table that decided if we should play tennis under certain circumstances. These could be the outlook of the weather; the temperature; the humidity and the strength of the wind:


Day Outlook Temperature Humidity Wind Play Tennis?
1 Sunny Hot High Weak No
2 Sunny Hot High Strong No
3 Overcast Hot High Weak Yes
4 Rain Mild High Weak Yes
5 Rain Cool Normal Weak Yes
6 Rain Cool Normal Strong No
7 Overcast Cool Normal Strong Yes
8 Sunny Mild High Weak No
9 Sunny Cool Normal Weak Yes
10 Rain Mild Normal Weak Yes
11 Sunny Mild Normal Strong Yes
12 Overcast Mild High Strong Yes
13 Overcast Hot Normal Weak Yes
14 Rain Mild High Strong No


So here we have 4 attributes. What we need to do is to create “look-up tables” for each of these attributes, and write in the probability that a game of tennis will be played based on this attribute. In these tables we have to note that there are 5 cases of not being able to play a game, and 9 cases of being able to play a game.


OUTLOOK Play = Yes Play = No Total
Sunny 2/9 3/5 5/14
Overcast 4/9 0/5 4/14
Rain 3/9 2/5 5/14


TEMPERATURE Play = Yes Play = No Total
Hot 2/9 2/5 4/14
Mild 4/9 2/5 6/14
Cool 3/9 1/5 4/14


HUMIDITY Play = Yes Play = No Total
High 3/9 4/5 7/14
Normal 6/9 1/5 7/14


WIND Play = Yes Play = No Total
Strong 3/9 3/5 6/14
Weak 6/9 2/5 8/14


We also must note the following probabilities for P(C):

  • P(Play=Yes) = 9/14
  • P(Play=No) = 5/14



Now, we are in the testing phase. For this, say we were given a new instance, and we want to know if we can play a game or not, then we need to lookup the results from the tables above. So, this new instance is:

  • X = (Outlook=Sunny, Temperature=Cool, Humidity=High, Wind=Strong)

Firstly we look at the probability that we can play the game, so we use the lookup tables to get:

  • P(Outlook=Sunny | Play=Yes) = 2/9
  • P(Temperature=Cool | Play=Yes) = 3/9
  • P(Humidity=High | Play=Yes) = 3/9
  • P(Wind=Strong | Play=Yes) = 3/9
  • P(Play=Yes) = 9/14

Next we consider the fact that we cannot play a game:

  • P(Outlook=Sunny | Play=No) = 3/5
  • P(Temperature=Cool | Play=No) = 1/5
  • P(Humidity=High | Play=No) = 4/5
  • P(Wind=Strong | Play=No) = 3/5
  • P(Play=No) = 5/14

Then, using those results, you have to multiple the whole lot together. So you multiple all the probabilities for Play=Yes such as:

  • P(X|Play=Yes)P(Play=Yes) = (2/9) * (3/9) * (3/9) * (3/9) * (9/14) = 0.0053

And this gives us a value that represents ‘P(X|C)P(C)’, or in this case ‘P(X|Play=Yes)P(Play=Yes)’.

We also have to do the same thing for Play=No:

  • P(X|Play=No)P(Play=No) = (3/5) * (1/5) * (4/5) * (3/5) * (5/14) = 0.0206

Finally, we have to divide both results by the evidence, or ‘P(X)’. The evidence for both equations
is the same, and we can find the values we need within the ‘Total’ columns of the look-up tables. Therefore:

  • P(X) = P(Outlook=Sunny) * P(Temperature=Cool) * P(Humidity=High) * P(Wind=Strong)
  • P(X) = (5/14) * (4/14) * (7/14) * (6/14)
  • P(X) = 0.02186

Then, dividing the results by this value:

  • P(Play=Yes | X) = 0.0053/0.02186 = 0.2424
  • P(Play=No | X) = 0.0206/0.02186 = 0.9421

So, given the probabilities, can we play a game or not? To do this, we look at both probabilities and see which once has the highest value, and that is our answer. Therefore:

  • P(Play=Yes | X) = 0.2424
  • P(Play=No | X) = 0.9421

Since 0.9421 is greater than 0.2424 then the answer is ‘no’, we cannot play a game of tennis today.


About Badgerati
Computer Scientist, Games Developer, and DevOps Engineer. Fantasy and Sci-fi book lover, also founder of Cadaeic Studios.

8 Responses to Machine Learning – Naive Bayes Classifier

  1. Srego says:

    Nice tutorial, very conceptual and made the idea easy to digest it.
    I have a question;
    What would you do if you saw an unseen event? For example Humidity is Average.
    This will yield 0 probability for the whole test instance.



    • Badgerati says:


      Thanks for the comment.

      When i wrote this post i should have clearly stated that all of the values for testing and training were discrete-values. With that principle in mind, if we saw an unseen event, take your example Humidity is Average; then quite simply, yes, the probability yielded would be 0 as we never originally trained the classifier on that event. This is because when we use discrete-values, the Naive Bayes Classifier can be viewed as a linear classifier; that is, every such Naive Bayes classifier corresponds to a hyperplane decision from the original values – with no unseen events occurring

      However, if we instead used continuous-values rather than the discrete values, then we need to map the events onto a curve, meaning that Naive Bayes is no longer linear – therefore we can now accept unseen events; like Humidity is Average.

      Hope that helped


  2. Srego says:

    Thank you Badgerati.
    This helped.
    By the way nice blog, you can be sure that this blog have a follower :D.

  3. Foo says:

    Any reason why you have not divided by P(X1,X2,..Xn) as shown in the formula? You have multiplied P(X1,…|C) with P(C) but you have not divided it by P(X1,X2,…Xn)?

    • Badgerati says:

      That is a very good point, and I’m sorry this took so long to get back to as well. You are correct though, I should have been dividing by P(X1,X2,..Xn). I have now fixed this above.


  4. umek1 says:

    so cool…. mudah dipahami mister 😀
    thank’s a lot

  5. aamerabbas says:

    Great tutorial. Very concise and easy to read explanation. Thanks!

  6. Preeti Sharma says:

    Very nice way to explain a fresher of Bayesian learning student 🙂 This helped me a lot….Thanks!

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: