Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

what is the "random" or non-deterministic factor inside SVM prediction by probabilities in e1071 in R?

Tags:

r

probability

svm

I'm new to SVM and e1071. I found that the results are different every time I run the exact same code.

For example:

data(iris)
library(e1071)

model <- svm(Species ~ ., data = iris[-150,], probability = TRUE)
pred <- predict(model, iris[150,-5], probability = TRUE)
result1 <- as.data.frame(attr(pred, "probabilities"))

model <- svm(Species ~ ., data = iris[-150,], probability = TRUE)
pred <- predict(model, iris[150,-5], probability = TRUE)
result2 <- as.data.frame(attr(pred, "probabilities"))

then I got result1 as:

         setosa versicolor virginica
150 0.009704854  0.1903696 0.7999255

and result2 as:

        setosa versicolor virginica
150 0.01006306  0.1749947 0.8149423

and the result keeps change every round.

Here I'm using the first 149 rows as a training set and the last row as testing. The probabilities for each classes in result1 and result2 are not exactly the same. I'm guessing there is some process during the prediction that is "random". How is this happening?

I'm aware that the predicted probabilities can be fixed if I set.seed() with the same number before each call. I'm not "aiming" for a fixed prediction result, but just curious why this happens and what steps it takes to generate the probabilities prediction.

The slight difference doesn't really have a big impact on the iris data, since the last sample would still be predicted as "virginica". But when my data (with two classes A and B) is not that "good", and an unknown sample is predicted to have probability of 0.489 and 0.521 for two times of being class A, it will be confusing.

Thanks!

like image 911
Yan Avatar asked Oct 19 '22 09:10

Yan


1 Answers

SVM uses a cross-validation step in developing the estimates of probabilities. The source code for that step starts with:

// Cross-validation decision values for probability estimates
static void svm_binary_svc_probability(
    const svm_problem *prob, const svm_parameter *param,
    double Cp, double Cn, double& probA, double& probB)
{
    int i;
    int nr_fold = 5;
    int *perm = Malloc(int,prob->l);
    double *dec_values = Malloc(double,prob->l);

    // random shuffle
    GetRNGstate();
    for(i=0;i<prob->l;i++) perm[i]=i;
    for(i=0;i<prob->l;i++)
    {
        int j = i+((int) (unif_rand() * (prob->l-i))) % (prob->l-i);
        swap(perm[i],perm[j]);
    }

You can create "predictability" by setting the random seed just before the call:

> data(iris)
> library(e1071)
> set.seed(123)
> model <- svm(Species ~ ., data = iris[-150,], probability = TRUE)
> pred <- predict(model, iris[150,-5], probability = TRUE)
> result1 <- as.data.frame(attr(pred, "probabilities"))
> set.seed(123)
> model <- svm(Species ~ ., data = iris[-150,], probability = TRUE)
> pred <- predict(model, iris[150,-5], probability = TRUE)
> result2 <- as.data.frame(attr(pred, "probabilities"))
> result1
         setosa versicolor virginica
150 0.009114718  0.1734126 0.8174727
> result2
         setosa versicolor virginica
150 0.009114718  0.1734126 0.8174727

But I am reminded of the epigram from Emerson: "A foolish consistency is the hobgoblin of little minds."

like image 156
IRTFM Avatar answered Oct 21 '22 04:10

IRTFM