Difference between a linear problem and a non-linear problem? Essence of Dot-Product and Kernel trick

Tags:

The kernel trick maps a non-linear problem into a linear problem.

My questions are:
1. What is the main difference between a linear and a non-linear problem? What is the intuition behind the difference of these two classes of problem? And How does kernel trick helps use the linear classifiers on a non-linear problem?
2. Why is the dot product so important in the two cases?

Thanks.

868

asked Jul 18 '09 20:07

unj2

3 Answers

When people say linear problem with respect to a classification problem, they usually mean linearly separable problem. Linearly separable means that there is some function that can separate the two classes that is a linear combination of the input variable. For example, if you have two input variables, x1 and x2, there are some numbers theta1 and theta2 such that the function theta1.x1 + theta2.x2 will be sufficient to predict the output. In two dimensions this corresponds to a straight line, in 3D it becomes a plane and in higher dimensional spaces it becomes a hyperplane.

You can get some kind of intuition about these concepts by thinking about points and lines in 2D/3D. Here's a very contrived pair of examples...

2D scatter plot

This is a plot of a linearly inseparable problem. There is no straight line that can separate the red and blue points.

3D scatter plot

However, if we give each point an extra coordinate (specifically 1 - sqrt(x*x + y*y)... I told you it was contrived), then the problem becomes linearly separable since the red and blue points can be separated by a 2-dimensional plane going through z=0.

Hopefully, these examples demonstrate part of the idea behind the kernel trick:

Mapping a problem into a space with a larger number of dimensions makes it more likely that the problem will become linearly separable.

The second idea behind the kernel trick (and the reason why it is so tricky) is that it is usually very awkward and computationally expensive to work in a very high-dimensional space. However, if an algorithm only uses the dot products between points (which you can think of as distances), then you only have to work with a matrix of scalars. You can implicitly perform the calculations in the higher-dimensional space without ever actually having to do the mapping or handle the higher-dimensional data.

answered Sep 18 '22 03:09

Stompchicken

Many classifiers, among them the linear Support Vector Machine (SVM), can only solve problems that are linearly separable, i.e. where the points belonging to class 1 can be separated from the points belonging to class 2 by a hyperplane.

In many cases, a problem that is not linearly separable can be solved by applying a transform phi() to the data points; this transform is said to transform the points to feature space. The hope is that, in feature space, the points will be linearly separable. (Note: This is not the kernel trick yet... stay tuned.)

It can be shown that, the higher the dimension of the feature space, the greater the number of problems that are linearly separable in that space. Therefore, one would ideally want the feature space to be as high-dimensional as possible.

Unfortunately, as the dimension of feature space increases, so does the amount of computation required. This is where the kernel trick comes in. Many machine learning algorithms (among them the SVM) can be formulated in such a way that the only operation they perform on the data points is a scalar product between two data points. (I will denote a scalar product between x1 and x2 by <x1, x2>.)

If we transform our points to feature space, the scalar product now looks like this:

<phi(x1), phi(x2)>

The key insight is that there exists a class of functions called kernels that can be used to optimize the computation of this scalar product. A kernel is a function K(x1, x2) that has the property that

K(x1, x2) = <phi(x1), phi(x2)>

for some function phi(). In other words: We can evaluate the scalar product in the low-dimensional data space (where x1 and x2 "live") without having to transform to the high-dimensional feature space (where phi(x1) and phi(x2) "live") -- but we still get the benefits of transforming to the high-dimensional feature space. This is called the kernel trick.

Many popular kernels, such as the Gaussian kernel, actually correspond to a transform phi() that transforms into an infinte-dimensional feature space. The kernel trick allows us to compute scalar products in this space without having to represent points in this space explicitly (which, obviously, is impossible on computers with finite amounts of memory).

answered Sep 19 '22 03:09

Martin B

The main difference (for practical purposes) is: A linear problem either does have a solution (and then it's easily found), or you get a definite answer that there is no solution at all. You do know this much, before you even know the problem at all. As long as it's linear, you'll get an answer; quickly.

The intuition beheind this is the fact that if you have two straight lines in some space, it's pretty easy to see whether they intersect or not, and if they do, it's easy to know where.

If the problem is not linear -- well, it can be anything, and you know just about nothing.

The dot product of two vectors just means the following: The sum of the products of the corresponding elements. So if your problem is

c1 * x1 + c2 * x2 + c3 * x3 = 0

(where you usually know the coefficients c, and you're looking for the variables x), the left hand side is the dot product of the vectors (c1,c2,c3) and (x1,x2,x3).

The above equation is (pretty much) the very defintion of a linear problem, so there's your connection between the dot product and linear problems.

answered Sep 21 '22 03:09

balpha

Related questions
                            
                                Find smallest number in Sorted Rotatable Array
                            
                                Multi-threaded algorithm for solving sudoku?
                            
                                Fractal Encryption
                            
                                Creating a power set of a Sequence
                            
                                How to sort an array using minimum number of writes?
                            
                                Algorithm / pseudo-code to create paging links?
                            
                                How do you use a Bidirectional BFS to find the shortest path?
                            
                                Generate Random numbers without using any external functions
                            
                                Finding the closest fibonacci numbers
                            
                                Node search in Binary Tree overflows stack
                            
                                How to find the closest element to a given key value in a binary search tree?
                            
                                Print all the permutations of a string in C
                            
                                Optimized TSP Algorithms
                            
                                How can I efficiently calculate the binomial cumulative distribution function?
                            
                                Which sorting algorithm works best on very large data set [closed]
                            
                                a working non-recursive floodfill algorithm written in C?
                            
                                begin(), end() annoyance in STL algorithms
                            
                                Self numbers in c++
                            
                                Generate a list of primes up to a certain number
                            
                                How to sort (million/billion/...) integers?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Difference between a linear problem and a non-linear problem? Essence of Dot-Product and Kernel trick

Tags:

language-agnostic

algorithm

math

artificial-intelligence

machine-learning