Introduction to Support Vector Machine

Support Vectors and Hyperplane

Before diving deep, let’s first undertand “What is a Hyperplane?”. A hyperplane is a flat subspace having dimensions one less than the dimensions of co-ordinate system it is represented in.In a 2-D space, hyperplane is a line of the form (A_0) + (A_1)(X_1) + (A_2)(X_2) = 0 and in a m-D space, hyperplane is of the form (A_0) + (A_1)(X_1) + (A_2)(X_2) + …. + (A_m)(X_m) = 0

Support Vector machines have some special data points which we call “Support Vectors” and a separating hyperplane which is known as “Support Vector Machine”. So, essentially SVM is a frontier that best segregates the classes.Support Vectors are the data points nearest to the hyperplane, the points of our data set which if removed, would alter the position of the dividing hyperplane. As we can see that there can be many hyperplanes which can segregate the two classes, the hyperplane that we would choose is the one with the highest margin.

The Kernel Trick

The Kernel Trick

We are not always lucky to have a dataset which is lineraly separable by a hyperplane. Fortunately, SVM is capable of fitting non-inear boundaries using a simple and elegant method known as kernel trick. In simple words, it projects the data into higher dimension where it can be separated by a hyperplane and then project back to lower dimensions.

Here, we can imagine an extra feature ‘z’ for each data point “(x,y)” where (z^{2} = x^{2}+y^{2})We have in-built kernels like rbf, poly, etc. which projects the data into higher dimensions and save us the hard work.

SVM objective

SVM objective

Support Vector Machine try to achieve the following two classification goals simultaneously:

Maximize the margin (see fig)
Correctly classify the data points.

There is a loss function which takes into account the loss due to both, ‘a diminishing margin’ and ‘in-correctly classified data point’. There are hyperparameters which can be set for a trade off between the two.Hyperparameters in case of SVM are:

Kernel – “Linear”, “rbf” (default), “poly”, etc. “rbf” and “poly” are mainly for non- linear hyper-plane.
C(error rate) – Penalty for wrongly classified data points. It controls the trade off between a smoother decision boundary and conformance to test data.
Gamma – Kernel coefficient for kernels (‘rbf’, ‘poly’, etc.). Higher values results in overfitting.

Note: Explaining the maths behind the algortihm is beyond the scope of this post.

Some examples of SVM classification

Some examples of SVM classification

A is the best hyperplane.

Fitting non-linear boundary using Kernel trick.

Trade off between smooth booundary and correct classification.