Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is it required to shuffle the training data for SVM multi-classification? [closed]

Actually I am using OpenCV's svm python interface and I am trying to classify data into 4 categories. When the labels and training data are in order, I mean for example the data were in 4 groups ordered as label 1, label 2, label 3 and label 4, the correct ratio was low, about only 50% right. But when I shuffled the training data, the result was reasonable, about 90% correct. So my question is: does the training data order affect the final result, or do I need to shuffle the data before training?

like image 995
Mickey Shine Avatar asked Dec 26 '22 15:12

Mickey Shine


2 Answers

No it does not change the SVM training, although some parameters tuning methods used in your code can depend on the ordering. For example - if you use the cross validation without randomization, than ordered set is much harder (ach consequitive folds can have even 0 samples of some classes!).

In short:

  • SVM training does not depend on the data ordering
  • Some library based tools used as "additional method" can depend on it
like image 81
lejlot Avatar answered Dec 28 '22 10:12

lejlot


My answer is No. Based on this page:

Unlike the backpropagation learning algorithm for artificial neural networks, a given SVM will always deterministically converge to the same solution for a given data set, regardless of the initial conditions. For training sets containing less than approximately 5000 points, gradient descent provides an efficient solution to this optimization problem [Campbell and Cristianini, 1999].

First, make sure the feature vectors correspond to their proper labels after shuffling. Also make sure every label has plenty of feature vectors in both of your cases.

Secondly, you can try to run your training repeatedly to observe whether the SVM changes. Use exactly the same data sets with the same order and without the shuffling. In theory it won't change since a convex optimization problem should have the unique maximum.

Thirdly, there is a possibility that you have reached the maximum iteration times if your training converges very very slowly. Then early termination may cause some apparent randomness in results.

Last but not least, although mathematically the primal solution is unique in SVM, the dual solution may be non-unique. It mainly depends on the choice of the bound variable C. This article analyzed the possible uniqueness between primal and dual solutions.

like image 23
lennon310 Avatar answered Dec 28 '22 11:12

lennon310