Scikit-Learn: Label not x is present in all training examples

Tags:

I'm trying to do multilabel classification with SVM. I have nearly 8k features and also have y vector of length with nearly 400. I already have binarized Y vectors, so I didn't use MultiLabelBinarizer() but when I use it with my Y data's raw form, it still gives same thing.

I'm running this code:

X = np.genfromtxt('data_X', delimiter=";")
Y = np.genfromtxt('data_y', delimiter=";")
training_X = X[:2600,:]
training_y = Y[:2600,:]

test_sample = X[2600:2601,:]
test_result = Y[2600:2601,:]

classif = OneVsRestClassifier(SVC(kernel='rbf'))
classif.fit(training_X, training_y)
print(classif.predict(test_sample))
print(test_result)

After all fitting process when it comes to prediction part, it says Label not x is present in all training examples (x is a few different numbers in range of my y vector length which is 400). After that it gives predicted y vector which is always zero vector with length of 400(y vector length). I'm new at scikit-learn and also in machine learning. I couldn't figure out the problem here. What's the problem and what should I do to fix it? Thanks.

236

asked Jan 02 '16 00:01

malisit

1 Answers

There are 2 problems here:

1) The missing label warning
2) You are getting all 0's for predictions

The warning means that some of your classes are missing from the training data. This is a common problem. If you have 400 classes, then some of them must only occur very rarely, and on any split of the data, some classes may be missing from one side of the split. There may also be classes that simply don't occur in your data at all. You could try Y.sum(axis=0).all() and if that is False, then some classes do not occur even in Y. This all sounds horrible, but realistically, you aren't going to be able to correctly predict classes that occur 0, 1, or any very small number of times anyway, so predicting 0 for those is probably about the best you can do.

As for the all-0 predictions, I'll point out that with 400 classes, probably all of your classes occur much less than half the time. You could check Y.mean(axis=0).max() to get the highest label frequency. With 400 classes, it might only be a few percent. If so, a binary classifier that has to make a 0-1 prediction for each class will probably pick 0 for all classes on all instances. This isn't really an error, it is just because all of the class frequencies are low.

If you know that each instance has a positive label (at least one), you could get the decision values (clf.decision_function) and pick the class with the highest one for each instance. You'll have to write some code to do that, though.

I once had a top-10 finish in a Kaggle contest that was similar to this. It was a multilabel problem with ~200 classes, none of which occurred with even a 10% frequency, and we needed 0-1 predictions. In that case I got the decision values and took the highest one, plus anything that was above a threshold. I chose the threshold that worked the best on a holdout set. The code for that entry is on Github: Kaggle Greek Media code. You might take a look at it.

If you made it this far, thanks for reading. Hope that helps.

131

answered Oct 22 '22 20:10

Dthal

Related questions
                            
                                Expanding a block of numbers in Python
                            
                                Stereo to Mono wav in Python
                            
                                multiprocessing.Pool.imap_unordered with fixed queue size or buffer?
                            
                                Django runserver from Python script
                            
                                string formatting a sql query in sqlite3
                            
                                Slow ANTLR4 generated Parser in Python, but fast in Java
                            
                                Adjust exponent text after setting scientific limits on matplotlib axis
                            
                                perfomance of len(List) vs reading a variable
                            
                                Plot Multicolored line based on conditional in python
                            
                                celery and long running tasks
                            
                                Plotting asymmetric error bars for a single point using errorbar
                            
                                PyQt5 and asyncio: yield from never finishes
                            
                                How to shift a graph along the x-axis?
                            
                                How to make a .pyc file from Python script [duplicate]
                            
                                Iterate through python string starting at index
                            
                                How to add topic filters when calling recv_pyobj() in ZeroMQ?
                            
                                np.where Not Working in my Pandas
                            
                                Manually-defined axis labels for Matplotlib imshow()
                            
                                spark-submit EMR Step failing when submitted using boto3 client
                            
                                'importlib._bootstrap' has no attribute 'SourceLoader'

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Scikit-Learn: Label not x is present in all training examples

Tags:

python

machine-learning

scikit-learn

malisit

People also ask

1 Answers

Dthal

Recent Activity

Donate For Us