Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Custom kernels for SVM, when to apply them?

I am new to machine learning field and right now trying to get a grasp of how the most common learning algorithms work and understand when to apply each one of them. At the moment I am learning on how Support Vector Machines work and have a question on custom kernel functions.
There is plenty of information on the web on more standard (linear, RBF, polynomial) kernels for SVMs. I, however, would like to understand when it is reasonable to go for a custom kernel function. My questions are:

1) What are other possible kernels for SVMs?
2) In which situation one would apply custom kernels?
3) Can custom kernel substantially improve prediction quality of SVM?

like image 857
kroonike Avatar asked May 26 '16 18:05

kroonike


People also ask

Which kernel should I use for SVM?

Different SVM algorithms use differing kinds of kernel functions. These functions are of different kinds—for instance, linear, nonlinear, polynomial, radial basis function (RBF), and sigmoid. The most preferred kind of kernel function is RBF. Because it's localized and has a finite response along the complete x-axis.

Why do we use kernels in SVM?

“Kernel” is used due to a set of mathematical functions used in Support Vector Machine providing the window to manipulate the data. So, Kernel Function generally transforms the training set of data so that a non-linear decision surface is able to transform to a linear equation in a higher number of dimension spaces.

How can you specify your own kernel function in the SVM?

By passing Gram Matrix You can define your own kernels by either giving the kernel as a function, as we saw in the above example, or by precomputing the Gram matrix. We'll first make a function that makes gram matrix given data and function and then make function to compute RBF.

Why do we apply kernel trick?

The ultimate benefit of the kernel trick is that the objective function we are optimizing to fit the higher dimensional decision boundary only includes the dot product of the transformed feature vectors. Therefore, we can just substitute these dot product terms with the kernel function, and we don't even use ϕ(x).


1 Answers

1) What are other possible kernels for SVMs?

There are infinitely many of these, see for example list of ones implemented in pykernels (which is far from being exhaustive)

https://github.com/gmum/pykernels

  • Linear
  • Polynomial
  • RBF
  • Cosine similarity
  • Exponential
  • Laplacian
  • Rational quadratic
  • Inverse multiquadratic
  • Cauchy
  • T-Student
  • ANOVA
  • Additive Chi^2
  • Chi^2
  • MinMax
  • Min/Histogram intersection
  • Generalized histogram intersection
  • Spline
  • Sorensen
  • Tanimoto
  • Wavelet
  • Fourier
  • Log (CPD)
  • Power (CPD)

2) In which situation one would apply custom kernels?

Basically in two cases:

  • "simple" ones give very bad results
  • data is specific in some sense and so - in order to apply traditional kernels one has to degenerate it. For example if your data is in a graph format, you cannot apply RBF kernel, as graph is not a constant-size vector, thus you need a graph kernel to work with this object without some kind of information-loosing projection. also sometimes you have an insight into data, you know about some underlying structure, which might help classifier. One such example is a periodicity, you know that there is a kind of recuring effect in your data - then it might be worth looking for a specific kernel etc.

3) Can custom kernel substantially improve prediction quality of SVM?

Yes, in particular there always exists a (hypothethical) Bayesian optimal kernel, defined as:

K(x, y) = 1 iff arg max_l P(l|x) == arg max_l P(l|y)

in other words, if one has a true probability P(l|x) of label l being assigned to a point x, then we can create a kernel, which pretty much maps your data points onto one-hot encodings of their most probable labels, thus leading to Bayes optimal classification (as it will obtain Bayes risk).

In practise it is of course impossible to get such kernel, as it means that you already solved your problem. However, it shows that there is a notion of "optimal kernel", and obviously none of the classical ones is not of this type (unless your data comes from veeeery simple distributions). Furthermore, each kernel is a kind of prior over decision functions - closer you get to the actual one with your induced family of functions - the more probable is to get a reasonable classifier with SVM.

like image 107
lejlot Avatar answered Sep 20 '22 22:09

lejlot