Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Scikit-learn SVC always giving accuracy 0 on random data cross validation

In the following code I create a random sample set of size 50, with 20 features each. I then generate a random target vector composed of half True and half False values.

All of the values are stored in Pandas objects, since this simulates a real scenario in which the data will be given in that way.

I then perform a manual leave-one-out inside a loop, each time selecting an index, dropping its respective data, fitting the rest of the data using a default SVC, and finally running a prediction on the left-out data.

import random
import numpy as np
import pandas as pd
from sklearn.svm import SVC

n_samp = 50
m_features = 20

X_val = np.random.rand(n_samp, m_features)
X = pd.DataFrame(X_val, index=range(n_samp))
# print X_val

y_val = [True] * (n_samp/2) + [False] * (n_samp/2)
random.shuffle(y_val)
y = pd.Series(y_val, index=range(n_samp))
# print y_val

seccess_count = 0
for idx in y.index:
    clf = SVC()  # Can be inside or outside loop. Result is the same.

    # Leave-one-out for the fitting phase
    loo_X = X.drop(idx)
    loo_y = y.drop(idx)
    clf.fit(loo_X.values, loo_y.values)

    # Make a prediction on the sample that was left out
    pred_X = X.loc[idx:idx]
    pred_result = clf.predict(pred_X.values)
    print y.loc[idx], pred_result[0]  # Actual value vs. predicted value - always opposite!
    is_success = y.loc[idx] == pred_result[0]
    seccess_count += 1 if is_success else 0

print '\nSeccess Count:', seccess_count  # Almost always 0!

Now here's the strange part - I expect to get an accuracy of about 50%, since this is random data, but instead I almost always get exactly 0! I say almost always, since every about 10 runs of this exact code I get a few correct hits.

What's really crazy to me is that if I choose the answers opposite to those predicted, I will get 100% accuracy. On random data!

What am I missing here?

like image 602
Shovalt Avatar asked Apr 26 '16 11:04

Shovalt


1 Answers

Ok, I think I just figured it out! It all comes down to our old machine learning foe - the majority class.

In more detail: I chose a target comprising 25 True and 25 False values - perfectly balanced. When performing the leave-one-out, this caused a class imbalance, say 24 True and 25 False. Since the SVC was set to default parameters, and run on random data, it probably couldn't find any way to predict the result other than choosing the majority class, which in this iteration would be False! So in every iteration the imbalance was turned against the currently-left-out sample.

All in all - a good lesson in machine learning, and an excelent mathematical riddle to share with your friends :)

like image 195
Shovalt Avatar answered Sep 28 '22 04:09

Shovalt