# Machine Learning – Decision Trees and Entropy

January 28, 2010 2 Comments

# Decision Trees

If anyone requires further explanation, please get in touch with me (or anyone else for that matter)!

In machine learning, it can be desirable to come up with meaningful **if-then** rules in order to predict what will occur in the future. For example, “**if **this **and if** that** then** this will probably happen”. Decision trees can be built automatically, which can then be used to come up with these **if-then** rules.

A decision tree is used to investigate huge amounts of data and come up with the most probable outcomes. Each condition is an **internal** node on the tree. Each outcome is an **external** node.

The concept of **Information Gain** – pieces of information acquired can effect the level of certainty in a particular prediction. If one knows more information about a particular event that is about to occur then one can be more certain of what that event will be. For example, if you know that Apple has decorated its product announcement conference venue with splashes of paint, you can make a wild guess that they will be releasing a tablet-like device. If you then find out that an executive for an ebook store claims they’ve been working closely with Apple on ebooks then you now have a greater degree of certainty that the product they’re going to announce will be a tablet, and so on.

This concept of information gain can be used to make machine learning predictions.

The level of certainty of a particular decision can be measured as a number from 1 (completely uncertain) to 0 (completely certain).

**Information Theory** (developed by Claude Shannon 1948) defines this value of uncertainty as **entropy**, a probability-based measure used to calculate the amount of uncertainty.

For example, if an event has a 50/50 probability, the entropy is 1. If the probability is 25/75 then the entropy is a little lower.

The goal in machine learning is to get a very low entropy in order to make the most accurate decisions and classifications. In order to calculate the entropy, the following formula is used:

**H = – SUM_OF_ALL_THE_EVENTS(the probability of the event * log2(the probability of the event).**

H stands for the entropy. The minus sign is used to create a positive value for the entropy. The logarithm is used to make more compact and efficient decision trees.

The uncertainty (entropy) of the decision event for various situations can now be calculated. The entropy can change when the information is narrowed down (i.e. when the dataset is split on a particular piece of information). The branches at that information split (i.e. when a new pieces of information is being considered) that lower the entropy can be calculated. The difference between the initial entropy and the new entropy after following a branch gives the amount of information gained.

Thus, we have the following notations and observations:

H(B) is the entropy of the original decision.

H(B | A = C) is the entropy of the original decision given the some information I provided is C.

H(B | A) is the weighted average of all H(B | A = Ci) where i is the number of different alternatives for A.

H(B) – H(B | D) is the information gain.

A continuous piece of information can be make discrete using a threshold (or thresholds).

### Building the Tree

Decision trees can be constructed using the ID3 algorithm that splits the data by the feature with the maximum information gain recursively for each branch.

maketree ( featurelist, examples ) returns tree

{

BASE CASE: **if** featurelist is empty, or entropy = 0

**return** an empty tree with leaf = majority answer in examples

RECURSION:

**find** the feature **X** with the largest information gain,

list_subset = remove **X** from the featurelist

**create** an empty tree T

**for each** possible value ‘x’ of feature **X**

data_subset = get all examples where **X** = ‘x’

t = maketree( list_subset, data_subset )

**add** t as a new sub-branch to T

**endfor**

**return** T

}

ID3 Algorithm in English:

The algorithm looks at each feature within the featurelist and determines which will provide the largest information gain (**X**). Once **X** is found it can be removed from the list of candidates to be considered. A **newfeaturelist** and a **newdata_subset** are created which are subsets of the original **featurelist** and **newdata_subset** respectively (exluding feature **X**). Each possible value of the feature **X** is recursively called with the **newfeaturelist **and the narrowed down examples of **newdata_subset**, so the algorithm will continue performing the steps indicated. The base case is reached when a **featurelist** is provided that has no features in it (so the features have been exhausted), or where the entropy is equal to 0 (there’s complete certainty). For these cases, the algorithm returns a leaf node consisting of the most probable outcome.

Other Notes

Overfitting can occur when the tree is very large and results in very obscure rules.

Round two! – The whole lecture again!

www.decisiontrees.net – a recomended website for help with this.

Jaaaames, in you entropy above, you said that if entropy is 1 then then you’re completely certain of the outcome, and if 0 then completely uncertain. Thats the wrong way around :D It should be 1 is completely uncertain and 0 is completely certain ;)

meep meep – changed!