Machine Learning – Decision Trees and Entropy
January 28, 2010 2 Comments
If anyone requires further explanation, please get in touch with me (or anyone else for that matter)!
In machine learning, it can be desirable to come up with meaningful if-then rules in order to predict what will occur in the future. For example, “if this and if that then this will probably happen”. Decision trees can be built automatically, which can then be used to come up with these if-then rules.
A decision tree is used to investigate huge amounts of data and come up with the most probable outcomes. Each condition is an internal node on the tree. Each outcome is an external node.
The concept of Information Gain – pieces of information acquired can effect the level of certainty in a particular prediction. If one knows more information about a particular event that is about to occur then one can be more certain of what that event will be. For example, if you know that Apple has decorated its product announcement conference venue with splashes of paint, you can make a wild guess that they will be releasing a tablet-like device. If you then find out that an executive for an ebook store claims they’ve been working closely with Apple on ebooks then you now have a greater degree of certainty that the product they’re going to announce will be a tablet, and so on.
This concept of information gain can be used to make machine learning predictions.
The level of certainty of a particular decision can be measured as a number from 1 (completely uncertain) to 0 (completely certain).
Information Theory (developed by Claude Shannon 1948) defines this value of uncertainty as entropy, a probability-based measure used to calculate the amount of uncertainty.
For example, if an event has a 50/50 probability, the entropy is 1. If the probability is 25/75 then the entropy is a little lower.
The goal in machine learning is to get a very low entropy in order to make the most accurate decisions and classifications. In order to calculate the entropy, the following formula is used:
H = – SUM_OF_ALL_THE_EVENTS(the probability of the event * log2(the probability of the event).
H stands for the entropy. The minus sign is used to create a positive value for the entropy. The logarithm is used to make more compact and efficient decision trees.
The uncertainty (entropy) of the decision event for various situations can now be calculated. The entropy can change when the information is narrowed down (i.e. when the dataset is split on a particular piece of information). The branches at that information split (i.e. when a new pieces of information is being considered) that lower the entropy can be calculated. The difference between the initial entropy and the new entropy after following a branch gives the amount of information gained.
Thus, we have the following notations and observations:
H(B) is the entropy of the original decision.
H(B | A = C) is the entropy of the original decision given the some information I provided is C.
H(B | A) is the weighted average of all H(B | A = Ci) where i is the number of different alternatives for A.
H(B) – H(B | D) is the information gain.
A continuous piece of information can be make discrete using a threshold (or thresholds).
Building the Tree
Decision trees can be constructed using the ID3 algorithm that splits the data by the feature with the maximum information gain recursively for each branch.
maketree ( featurelist, examples ) returns tree
BASE CASE: if featurelist is empty, or entropy = 0
return an empty tree with leaf = majority answer in examples
find the feature X with the largest information gain,
list_subset = remove X from the featurelist
create an empty tree T
for each possible value ‘x’ of feature X
data_subset = get all examples where X = ‘x’
t = maketree( list_subset, data_subset )
add t as a new sub-branch to T
ID3 Algorithm in English:
The algorithm looks at each feature within the featurelist and determines which will provide the largest information gain (X). Once X is found it can be removed from the list of candidates to be considered. A newfeaturelist and a newdata_subset are created which are subsets of the original featurelist and newdata_subset respectively (exluding feature X). Each possible value of the feature X is recursively called with the newfeaturelist and the narrowed down examples of newdata_subset, so the algorithm will continue performing the steps indicated. The base case is reached when a featurelist is provided that has no features in it (so the features have been exhausted), or where the entropy is equal to 0 (there’s complete certainty). For these cases, the algorithm returns a leaf node consisting of the most probable outcome.
Overfitting can occur when the tree is very large and results in very obscure rules.
Round two! – The whole lecture again!
www.decisiontrees.net – a recomended website for help with this.