Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Flink SVM 90% misclassification

I try to do some binary classification with the flink-ml svm implementation. When I evaluated the classification I got a ~85% error rate on the training dataset. I plotted the 3D data and it looked like you could separate the data quite well with a hyperplane.

When I tried to get the weight vector out of the svm I only saw the option to get the weight vector without the interception of the hyperplane. So just a hyperplane going through (0,0,0).

I don't have any clue where the error could be and appreciate every clue.

val env = ExecutionEnvironment.getExecutionEnvironment
val input: DataSet[(Int, Int, Boolean, Double, Double, Double)] = env.readCsvFile(filepathTraining, ignoreFirstLine = true, fieldDelimiter = ";")


val inputLV = input.map(
  t => { LabeledVector({if(t._3) 1.0 else -1.0}, DenseVector(Array(t._4, t._5, t._6)))}
)


val trainTestDataSet = Splitter.trainTestSplit(inputLV, 0.8, precise = true, seed = 100)
val trainLV = trainTestDataSet.training
val testLV = trainTestDataSet.testing

val svm = SVM()

svm.fit(trainLV)

val testVD = testLV.map(lv => (lv.vector, lv.label))
val evalSet = svm.evaluate(testVD)

// groups the data in false negatives, false positives, true negatives, true positives
evalSet.map(t => (t._1, t._2, 1)).groupBy(0,1).reduce((x1,x2) => (x1._1, x1._2, x1._3 + x2._3)).print()

The plotted data is shown here:

Plot of the Data

like image 662
hucko Avatar asked Dec 01 '17 16:12

hucko


1 Answers

The SVM classifier doesn't give you the distance to the origin (aka. bias or threshold), because that's a parameter of the predictor. Different values of the threshold will result in different precision and recall metrics and the optimum is use-case specific. Usually we use a ROC (Receiver Operating Characteristic) curve to find it.

The related properties on SVM are (from the Flink docs):

  • ThresholdValue - to set the threshold for testing / predicting. Outputs below are classified as negative and outputs above as positive. Default is 0.
  • OutputDecisionFunction - set this to true to output the distance to the separating plane instead of the binary classification.

ROC Curve

How to find the optimum threshold is an art in itself. Without knowing anything more about the problem, what you can always do is plot the ROC curve (the True Positive Rate against the False Positive Rate) for different values of the threshold and look for the point with the greatest distance from a random guess (the line with 0.5 slope). But ultimately the choice of threshold also depends on the cost of a false positive vs. the cost of a false negative in your domain. Here is an example ROC curve from Wikipedia for three different classifiers:

To choose the initial threshold you could average it over the training data (or a sample of it):

  // weights is a DataSet of size 1
  val weights = svm.weightsOption.get.collect().head
  val initialThreshold = trainLV.map { lv =>
    (lv.label - (weights dot lv.vector), 1l)
  }.reduce { (avg1, avg2) =>
    (avg1._1 + avg2._1, avg1._2 + avg2._2)
  }.collect() match { case Seq((sum, len)) =>
    sum / len
  }

and then vary it in a loop, measuring the TPR and FPR on the test data.

Other Hyperparameters

Note that the SVM trainer also has Parameters (those are called hyperparameters) that need to be tuned for optimal prediction performance. There are many techniques to do that and this post would become too long to list them. I just wanted to bring your attention to that. If you're feeling lazy, here's a link on Wikipedia: Hyperparameter optimization.

Other Dimensions?

There is (somewhat of) a hack if you don't want to deal with the threshold right now. You can jam the bias into another dimension of the feature vector like so:

val bias = 10 // choose a large value
val inputLV = input.map { t =>
  LabeledVector(
    if (t._3) 1.0 else -1.0,
    DenseVector(Array(t._4, t._5, t._6, bias)))
}

Here is a nice discussion on why you should NOT do this. Basically the problem is that the bias would participate in regularization. But in machine learning there are no absolute truths.

like image 181
g.krastev Avatar answered Oct 19 '22 10:10

g.krastev