Practicing Kernel trick in SVM

Question

I am reading the theory of SVM. In kernel trick, what I understand is, if we have a data which is not linear separable in the original dimensions n, we use the kernel to map the data to a higher space to be linear separable (we have to choose the right kernel depending on the data set, etc). However, when I watched this video of Andrew ng Kernel SVM, What I understand is we can map original data into a smaller space which make me confused!? Any explanation.

Could you explain me how does RBF kernel work to map each original data sample x1(x11,x12,x13,....,x1n) to a higher space (with dimensions m) to be X1(X11,X12,X13,...,X1m) with a concrete example. Also, what I understand is the kernel compute the inner product of the transformed data (so there is an other transformation before the RBF, which means that RBF transform implicitly the data to a higher space but How?).

other thing: the kernel is a function k(x,x1):(R^n)^2->R =g(x).g(x1), with g is a transformation function, how to define g in the case of RBF kernel?

Suppose that we are in the test set, What I understand is x is the sample to be classified and x1 is the support vector (because only the support vectors will be used to calculate the hyperplane). in the case of RBF k(x,x1)=exp(-(x-x1)^2/2sigma), so where is the transformation?

Last question: Admit that the RBF do the mapping to a higher dimension m, it is possible to show this m? I want to see the theoretical reality.

I want to implement SVM with RBF kernel. What is the m here and how to choose it? How to implement kernel trick in practice?

lejlot · Accepted Answer

Could you explain me how does RBF kernel work to map each original data sample x1(x11,x12,x13,....,x1n) to a higher space (with dimensions m) to be X1(X11,X12,X13,...,X1m) with a concrete example. Also, what I understand is the kernel compute the inner product of the transformed data (so there is an other transformation before the RBF, which means that RBF transform implicitly the data to a higher space but How?).

Exactly as you said - kernel is an inner product of the projected space, not the projection itself. The whole trick is that you do not ever transform your data, because it is computationally too expensive to do so.

other thing: the kernel is a function k(x,x1):(R^n)^2->R =g(x).g(x1), with g is a transformation function, how to define g in the case of RBF kernel?

For rbf kernel, g is actually a mapping from R^n into the space of continuous functions (L2), and each point is mapped into unnormalized gaussian distribution with mean x, and variance sigma^2. Thus (up to some normalizing constant A that we will drop)

g(x) = N(x, sigma^2)[z] / A # notice this is not a number but a function of z!

and now inner product in the space of functions is the integral of products over the whole domain thus

K(x, y) = <g(x), g(y)> 
        = INT_{R^n} N(x, sigma^2)[z] N(y, sigma^2)[z] / A^2 dz 
        = B exp(-||x-y||^2 / (2*sigma^2))

where B is some constant factor (normalization) depending solely on sigma^2, thus we can drop it (as scaling does not really matter here) for computational simplicity.

Suppose that we are in the test set, What I understand is x is the sample to be classified and x1 is the support vector (because only the support vectors will be used to calculate the hyperplane). in the case of RBF k(x,x1)=exp(-(x-x1)^2/2sigma), so where is the transformation?

as said before - transformation is never explicitly used, you simply show that inner product of your hyperplane with the transformed point can be expressed again as inner products with support vectors, thus you do not ever transform anything, just use kernels

<w, g(x)> = < SUM_{i=1}^N alpha_i y_i g(sv_i), g(x)> 
          = SUM_{i=1}^N alpha_i y_i <g(sv_i), g(x)>
          = SUM_{i=1}^N alpha_i y_i K(sv_i, x)

where sv_i is i'th support vector, alpha_i is the per-sample weight (Lagrange multiplier) found during the optimization process and y_i is label of i'th support vector.

Last question: Admit that the RBF do the mapping to a higher dimension m, it is possible to show this m? I want to see the theoretical reality.

In this case m is infinity, as your new space is space of continuous functions in the domain of R^n -> R, thus a single vector (function) is defined as a continuum (size of the set of real numbers) values - one per each possible input value coming from R^n (it is a simple set theory result that R^n for any positive n is of size continuum). Thus in terms of pure mathematics, m = |R|, and using set theory this is so called Beth_1 (https://en.wikipedia.org/wiki/Beth_number).

I want to implement SVM with RBF kernel. What is the m here and how to choose it? How to implement kernel trick in practice?

You do not choose m, it is defined by the kernel itself. Implementing kernel trick in practise requires expressing all your optimization routines in the form, where training points are used solely in the context of inner products, and just replacing them with kernel calls. This is way too complex to describe in SO form.

Practicing Kernel trick in SVM

Tags:

machine-learning

kernel

svm

scikit-learn

Jeanne

1 Answers

lejlot

Recent Activity

Donate For Us

Practicing Kernel trick in SVM

Tags:

machine-learning

kernel

svm

scikit-learn

Jeanne

1 Answers

lejlot

Related questions

Recent Activity

Donate For Us