Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Scikit-Learn: Label not x is present in all training examples

I'm trying to do multilabel classification with SVM. I have nearly 8k features and also have y vector of length with nearly 400. I already have binarized Y vectors, so I didn't use MultiLabelBinarizer() but when I use it with my Y data's raw form, it still gives same thing.

I'm running this code:

X = np.genfromtxt('data_X', delimiter=";")
Y = np.genfromtxt('data_y', delimiter=";")
training_X = X[:2600,:]
training_y = Y[:2600,:]

test_sample = X[2600:2601,:]
test_result = Y[2600:2601,:]

classif = OneVsRestClassifier(SVC(kernel='rbf'))
classif.fit(training_X, training_y)
print(classif.predict(test_sample))
print(test_result)

After all fitting process when it comes to prediction part, it says Label not x is present in all training examples (x is a few different numbers in range of my y vector length which is 400). After that it gives predicted y vector which is always zero vector with length of 400(y vector length). I'm new at scikit-learn and also in machine learning. I couldn't figure out the problem here. What's the problem and what should I do to fix it? Thanks.

like image 236
malisit Avatar asked Jan 02 '16 00:01

malisit


People also ask

Is scikit-learn too much work for machine learning?

However, it is too much work. Instead, scikit-learn provides a function to carry out parameter tuning and cross-validation. Cross-Validation means during the training, the training set is slip n number of times in folds and then evaluates the model n time.

How to train a model with scikit-learn?

Training a model with scikit-learn is trivial. You need to use the object fit preceded by the pipeline, i.e., model. You can print the accuracy with the score object from the scikit-learn library Finally, you can predict the classes with predict_proba. It returns the probability for each class. Note that it sums to one.

What is scikit learn in Python?

Scikit learn is a library used to perform machine learning in Python. Scikit learn is an open source library which is licensed under BSD and is reusable in various contexts, encouraging academic and commercial use. It provides a range of supervised and unsupervised learning algorithms in Python.

Is the label not number present in all training examples?

UserWarning: Label not :NUMBER: is present in all training examples. When I print out predicted and true labels, cca half of all documents has it's predictions for labels empty.


1 Answers

There are 2 problems here:

1) The missing label warning
2) You are getting all 0's for predictions

The warning means that some of your classes are missing from the training data. This is a common problem. If you have 400 classes, then some of them must only occur very rarely, and on any split of the data, some classes may be missing from one side of the split. There may also be classes that simply don't occur in your data at all. You could try Y.sum(axis=0).all() and if that is False, then some classes do not occur even in Y. This all sounds horrible, but realistically, you aren't going to be able to correctly predict classes that occur 0, 1, or any very small number of times anyway, so predicting 0 for those is probably about the best you can do.

As for the all-0 predictions, I'll point out that with 400 classes, probably all of your classes occur much less than half the time. You could check Y.mean(axis=0).max() to get the highest label frequency. With 400 classes, it might only be a few percent. If so, a binary classifier that has to make a 0-1 prediction for each class will probably pick 0 for all classes on all instances. This isn't really an error, it is just because all of the class frequencies are low.

If you know that each instance has a positive label (at least one), you could get the decision values (clf.decision_function) and pick the class with the highest one for each instance. You'll have to write some code to do that, though.

I once had a top-10 finish in a Kaggle contest that was similar to this. It was a multilabel problem with ~200 classes, none of which occurred with even a 10% frequency, and we needed 0-1 predictions. In that case I got the decision values and took the highest one, plus anything that was above a threshold. I chose the threshold that worked the best on a holdout set. The code for that entry is on Github: Kaggle Greek Media code. You might take a look at it.

If you made it this far, thanks for reading. Hope that helps.

like image 131
Dthal Avatar answered Oct 22 '22 20:10

Dthal