Machine Learning – Support Vector Machines

Hopefully all of this should now be complete, if not tell me, hehe

Say we have the following graph with set of plotted points:

The way that we would normally classify such a data set is by using the perceptron learning rule. We use this rule to learn the weights w, and the bias b. This all comes to the equation of:

where if you remember ‘x’ was the inputs.

Where do we put our decision boundary? Well, the answer is anywhere! In fact, there’s loads of possible positions that our decision boundary can go, so we end up with something like this – which is just horrible!

With the given problem, which is the best decision boundary to select for generalisation? The answer to this is to use a margin.

The margin of a linear classifier is the width that the decision boundary could be increased by before “hitting” any of the data points. The best margin is the maximum margin. This is the margin with the widest width to the data points.

Another note to jot down is what a support vector is. A support vector is a data point that the margin pushes up against, and it is possible to have more than 2 support vectors 🙂

So now this is what a decision boundary with a margin pushing up against support vectors looks like:

Here we have the support vectors, which are the objects shaded in orange. And the margin for the decision boundary is the faded out yellow like box thing, with the line in the middle being the main decision boundary that we have selected.

Before I continue, the way that we calculate the position of the decision boundary is the same way that we calculate it for a perceptron. This means that SVM’s are also linear.

Margin

As mentioned above, the margin is the width that the decision boundary could be increased by before “hitting” any of the data points.

The way that we calculate the width of the margin is by using the weights that we have used to calculate the position of the boundary. This equation is:

Support Vectors

Do you remember what a support vector is? 😀 Well, i’ll re-jog your memory anyway, hehe. A support vector is the data points that the margin pushes up against, and USUALLY there are always 2 – or more – support vectors.

But, how do we know if a given data point is a support vector? Well the answer is that we use the following equation:

Here we have the following aspects:

• Ysv is the label of the data point, so either (usually) –1 or +1
• W are the weights
• Xsv is the values within the training set. There will be equal number of values to the weights.
• b is the value of the bias.

The reason we make the whole equation equal ‘1’, is because all support vectors are the points that are being pushed up against by the margin. Any value between 1 and 0 is a data point within the margin (which shouldn’t happen… just saying), So that we have 1 being the edge of the margin, which makes that point a support vector for the margin 🙂

The Learning Rule is no longer simple like that of a perceptron as we need to search the space of weights and biases to find the widest margin that matches all the data points or support vectors.

The way we do this is by using a Quadratic Algorithm. However this is too fiddly and difficult to learn here 😉

Kernel

A Kernel function is some function that corresponds to an inner product in the feature space.

A linear kernel is a kernel, or an SVM that can separate linearly separable data points, like the above.

But what happens if we have something that requires, for example, a rounded decision boundary, such that the data points aren’t linearly separable? Well the answer to this is to take the data points to a higher dimension! So for example take 2D to 3D, or 3D to 4D, or 255D to 256D…. yeah, i’ll stop now, lol 😀

This kind of Kernel is known as a polynomial kernel of order ‘p’, where ‘p’ (i think, if not please tell me!) is the dimension.

Kernel “Trick”

The Kernel trick is when we extend a linear SVM to a non-linear SVM. This is when the data points are mapped onto a higher dimensional feature space (like mentioned above) so that the data points are linearly separable in that space.

Example

Now for the moment you’ve all being waiting for 😀

Okay, say we were given a set of 3D input data, and we need to apply the SVM learning algorithm to it to achieve an optimal decision plane of:

And we were also given three examples in the training set and we need to say whether or not if they are support vectors. These examples are:

1. A: (-1, –1, –2) with label -1
2. B: (2, –2, –1) with label -1
3. C: (-2, –2, 2) with label 1

The first thing that we need to do is get the weights. This part is fairly simple, look over at the H(x) equation, the set of weights are the values next to the ‘x’s, so we have the set of weights as:

• w: (1, 2, 2)

as well as a bias value of +3.

Margin

Now we need to calculate the margin. And remembering from above that the weights are related to the margin, we get the margin value of:

• M = 2 / [ sqrt(1^2 + 2^2 + 2^"2) ]    =     2 / [ sqrt(9) ]     =     2/3

Support Vectors

Remembering again from above the way that we calculate the way to see if a data point is a support vector, we substitute all the values given into the equation. (hint: its the one where we check to see if we get a value equal to 1 😉 )

Also, i best point out, that ‘y’ in the equation is equal to the value of the labels for each data point, so for example ‘y’ is equal to ‘-1’ for data point A.

• For the data point A we get:

(-1) * [ 1*(-1)  +  2*(-1)  +  2*(-2)  +  3 ]  =  4

• For the data point B we get:

(-1) * [ 1*2  +  2*(-2)  +  2*(-1)  +  3 ]  =  1

• For the data point C we get:

1 * [ 1*(-2)  +  2*(-2)  +  2*2  +  3 ]  =  1

So as we can see, data points B and C are support vectors, whereas A is not.