Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Difference between a linear problem and a non-linear problem? Essence of Dot-Product and Kernel trick

The kernel trick maps a non-linear problem into a linear problem.

My questions are:
1. What is the main difference between a linear and a non-linear problem? What is the intuition behind the difference of these two classes of problem? And How does kernel trick helps use the linear classifiers on a non-linear problem?
2. Why is the dot product so important in the two cases?

Thanks.

like image 868
unj2 Avatar asked Jul 18 '09 20:07

unj2


People also ask

What is the difference between linear and non-linear SVM?

When we can easily separate data with hyperplane by drawing a straight line is Linear SVM. When we cannot separate data with a straight line we use Non – Linear SVM. In this, we have Kernel functions. They transform non-linear spaces into linear spaces.

How does the kernel trick help an SVM classify non linearly related data?

In machine learning, a trick known as “kernel trick” is used to learn a linear classifier to classify a non-linear dataset. It transforms the linearly inseparable data into a linearly separable one by projecting it into a higher dimension.

What is the purpose of the kernel trick?

The kernel trick allows us to project data from a training set which isn't linearly separable into a higher dimensional space where it becomes linearly separable.

Which kernel function can be used in non-linear SVM?

In both 1-class and 2-class SVMs, the radial basis function kernel was used, and the model parameters were tuned to maximize the accuracy of prediction of the testing dataset using grid search.


3 Answers

When people say linear problem with respect to a classification problem, they usually mean linearly separable problem. Linearly separable means that there is some function that can separate the two classes that is a linear combination of the input variable. For example, if you have two input variables, x1 and x2, there are some numbers theta1 and theta2 such that the function theta1.x1 + theta2.x2 will be sufficient to predict the output. In two dimensions this corresponds to a straight line, in 3D it becomes a plane and in higher dimensional spaces it becomes a hyperplane.

You can get some kind of intuition about these concepts by thinking about points and lines in 2D/3D. Here's a very contrived pair of examples...

2D scatter plot

This is a plot of a linearly inseparable problem. There is no straight line that can separate the red and blue points.

3D scatter plot

However, if we give each point an extra coordinate (specifically 1 - sqrt(x*x + y*y)... I told you it was contrived), then the problem becomes linearly separable since the red and blue points can be separated by a 2-dimensional plane going through z=0.

Hopefully, these examples demonstrate part of the idea behind the kernel trick:

Mapping a problem into a space with a larger number of dimensions makes it more likely that the problem will become linearly separable.

The second idea behind the kernel trick (and the reason why it is so tricky) is that it is usually very awkward and computationally expensive to work in a very high-dimensional space. However, if an algorithm only uses the dot products between points (which you can think of as distances), then you only have to work with a matrix of scalars. You can implicitly perform the calculations in the higher-dimensional space without ever actually having to do the mapping or handle the higher-dimensional data.

like image 61
Stompchicken Avatar answered Sep 18 '22 03:09

Stompchicken


Many classifiers, among them the linear Support Vector Machine (SVM), can only solve problems that are linearly separable, i.e. where the points belonging to class 1 can be separated from the points belonging to class 2 by a hyperplane.

In many cases, a problem that is not linearly separable can be solved by applying a transform phi() to the data points; this transform is said to transform the points to feature space. The hope is that, in feature space, the points will be linearly separable. (Note: This is not the kernel trick yet... stay tuned.)

It can be shown that, the higher the dimension of the feature space, the greater the number of problems that are linearly separable in that space. Therefore, one would ideally want the feature space to be as high-dimensional as possible.

Unfortunately, as the dimension of feature space increases, so does the amount of computation required. This is where the kernel trick comes in. Many machine learning algorithms (among them the SVM) can be formulated in such a way that the only operation they perform on the data points is a scalar product between two data points. (I will denote a scalar product between x1 and x2 by <x1, x2>.)

If we transform our points to feature space, the scalar product now looks like this:

<phi(x1), phi(x2)>

The key insight is that there exists a class of functions called kernels that can be used to optimize the computation of this scalar product. A kernel is a function K(x1, x2) that has the property that

K(x1, x2) = <phi(x1), phi(x2)>

for some function phi(). In other words: We can evaluate the scalar product in the low-dimensional data space (where x1 and x2 "live") without having to transform to the high-dimensional feature space (where phi(x1) and phi(x2) "live") -- but we still get the benefits of transforming to the high-dimensional feature space. This is called the kernel trick.

Many popular kernels, such as the Gaussian kernel, actually correspond to a transform phi() that transforms into an infinte-dimensional feature space. The kernel trick allows us to compute scalar products in this space without having to represent points in this space explicitly (which, obviously, is impossible on computers with finite amounts of memory).

like image 41
Martin B Avatar answered Sep 19 '22 03:09

Martin B


The main difference (for practical purposes) is: A linear problem either does have a solution (and then it's easily found), or you get a definite answer that there is no solution at all. You do know this much, before you even know the problem at all. As long as it's linear, you'll get an answer; quickly.

The intuition beheind this is the fact that if you have two straight lines in some space, it's pretty easy to see whether they intersect or not, and if they do, it's easy to know where.

If the problem is not linear -- well, it can be anything, and you know just about nothing.

The dot product of two vectors just means the following: The sum of the products of the corresponding elements. So if your problem is

c1 * x1 + c2 * x2 + c3 * x3 = 0

(where you usually know the coefficients c, and you're looking for the variables x), the left hand side is the dot product of the vectors (c1,c2,c3) and (x1,x2,x3).

The above equation is (pretty much) the very defintion of a linear problem, so there's your connection between the dot product and linear problems.

like image 36
balpha Avatar answered Sep 21 '22 03:09

balpha