Firstly, I am a beginner to Support Vector Machines so I'm sorry if I am going about this problem in the wrong way. I am trying to implement a very simple SVM from scratch which uses the identity kernel function to classify linearly separable data into one of two classes. As an example of the sort of data which I will be using, consider the plot below seen in this document:
Using the points (1,0), (3, 1) and (3, -1) as support vectors, we know that the following is true with regards to calculating the decision plane (Screenshotted from the same document):
Which when fiddled and rearranged a bit gives us Lagrange multipliers of -3.5, 0.75 and 0.75 respectively.
I understand how this algebra works on paper, however I am unsure as to the best approach when it comes to implementation. So my question is as follows: how are the SVM's Lagrange Multipliers calculated in practice? Is there an algorithm which I am missing which will be able to determine these values for arbitrary linearly separable support vectors? Should I use a standard maths library to solve the linear equations (I am implementing the SVM in java)? Would such a maths library be slow for large scale learning? Note that this is a learning exercise so I'm not just looking for a ready made SVM library.
Any other advice would be much appreciated!
EDIT 1: LutzL made a good point that half the problem is actually determining which points are to be used as the support vectors, so to keep things simple assume for the purpose of this question that they have already been computed.
The support vector machine is designed to discriminate data points belonging to two different classes. One set of points is labelled as +1 also called the positive class. The other set of points is labeled as -1 also called the negative class.
Lagrange's Multipliers MethodLet (x0, y0, z0) ∈ S := {(x, y, z) : g(x, y, z) = 0} and ∇g(x0, y0, z0) ≠ 0. By solving these equations, we get the values of unknown variables, say x, y, z and λ. Thus, we will get the local extremum points through the solutions of the above set of equations.
We want to maximize Lagrange by alpha. Because here Alpha is a penalty. When we do not fit the constraints, we are punishing the lagrange. So, maximizing L with alpha gives us more optimal minimum with respect to w and b.
Independent of the kernel function, the determination of the coefficients leads to a quadratic optimization problem with linear positivity constraints. Which has a horrendous complexity if implemented naively testing all boundary components, so you can not avoid advanced optimization algorithms like barrier or trust region methods.
There are also heuristic approaches that try to keep the optimization problem in low dimension by searching for point sets close to the separation line and eliminating points that are most probably far away from it.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With