Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

UserWarning: Label not :NUMBER: is present in all training examples

I am doing multilabel classification, where I try to predict correct labels for each document and here is my code:

mlb = MultiLabelBinarizer()
X = dataframe['body'].values 
y = mlb.fit_transform(dataframe['tag'].values)

classifier = Pipeline([
    ('vectorizer', CountVectorizer(lowercase=True, 
                                   stop_words='english', 
                                   max_df = 0.8, 
                                   min_df = 10)),
    ('tfidf', TfidfTransformer()),
    ('clf', OneVsRestClassifier(LinearSVC()))])

predicted = cross_val_predict(classifier, X, y)

When running my code I get multiple warnings:

UserWarning: Label not :NUMBER: is present in all training examples.

When I print out predicted and true labels, cca half of all documents has it's predictions for labels empty.

Why is this happening, is it related to warnings it prints out while training is running? How can I avoid those empty predictions?


EDIT01: This is also happening when using other estimators than LinearSVC().

I've tried RandomForestClassifier() and it gives empty predictions as well. Strange thing is, when I use cross_val_predict(classifier, X, y, method='predict_proba') for predicting probabilities for each label, instead of binary decisions 0/1, there is always at least one label per predicted set with probability > 0 for given document. So I dont know why is this label not chosen with binary decisioning? Or is binary decisioning evaluated in different way than probabilities?

EDIT02: I have found an old post where OP was dealing with similar problem. Is this the same case?

like image 263
PeterB Avatar asked Mar 15 '17 21:03

PeterB


People also ask

Is the label not number present in all training examples?

UserWarning: Label not :NUMBER: is present in all training examples. When I print out predicted and true labels, cca half of all documents has it's predictions for labels empty.

What if a particular tag does not occur in the training sample?

If a particular tag (of index k) does not occur in the training sample, all the elements in the k -th column of the indicator matrix y [train_indices] are zeros. How can I avoid those empty predictions?

When the snippet above is executed two warnings are issued?

When the snippet above is executed two warnings are issued (I used a context manager to make sure warnings are catched): Label not 2 is present in all training examples. Label not 4 is present in all training examples. This is consistent with the fact that tags of indices 2 and 4 are missing from the training samples:

How many labels are there in multi class text classification data set?

I'm working with a multi class text classification data set having train and test sets. There are around 470 unique labels in training set and around 250 unique labels in test set. (These 470+ 250 unique labels comes from a large set of labels of size 400 thousand. ) There are around 30 labels which are only in test set but not in training set.


2 Answers

Why is this happening, is it related to warnings it prints out while training is running?

The issue is likely to be that some tags occur just in a few documents (check out this thread for details). When you split the dataset into train and test to validate your model, it may happen that some tags are missing from the training data. Let train_indices be an array with the indices of the training samples. If a particular tag (of index k) does not occur in the training sample, all the elements in the k-th column of the indicator matrix y[train_indices] are zeros.

How can I avoid those empty predictions?

In the scenario described above the classifier will not be able to reliably predict the k-th tag in the test documents (more on this in the next paragraph). Therefore you cannot trust the predictions made by clf.predict and you need to implement the prediction function on your own, for example by using the decision values returned by clf.decision_function as suggested in this answer.

So I don't know why is this label not chosen with binary decisioning? Or is binary decisioning evaluated in different way than probabilities?

In datasets containing many labels the occurrence frequency for most of them uses to be rather low. If these low values are fed to a binary classifier (i.e. a classifier that makes a 0-1 prediction) it is highly probable that the classifier would pick 0 for all tags on all documents.

I have found an old post where OP was dealing with similar problem. Is this the same case?

Yes, absolutely. That guy is facing exactly the same problem as you and his code is pretty similar to yours.


Demo

To further explain the issue I have elaborated a simple toy example using mock data.

Q = {'What does the "yield" keyword do in Python?': ['python'],
     'What is a metaclass in Python?': ['oop'],
     'How do I check whether a file exists using Python?': ['python'],
     'How to make a chain of function decorators?': ['python', 'decorator'],
     'Using i and j as variables in Matlab': ['matlab', 'naming-conventions'],
     'MATLAB: get variable type': ['matlab'],
     'Why is MATLAB so fast in matrix multiplication?': ['performance'],
     'Is MATLAB OOP slow or am I doing something wrong?': ['matlab-oop'],
    }
dataframe = pd.DataFrame({'body': Q.keys(), 'tag': Q.values()})    

mlb = MultiLabelBinarizer()
X = dataframe['body'].values 
y = mlb.fit_transform(dataframe['tag'].values)

classifier = Pipeline([
    ('vectorizer', CountVectorizer(lowercase=True, 
                                   stop_words='english', 
                                   max_df=0.8, 
                                   min_df=1)),
    ('tfidf', TfidfTransformer()),
    ('clf', OneVsRestClassifier(LinearSVC()))])

Please, notice that I have set min_df=1 since my dataset is much smaller than yours. When I run the following sentence:

predicted = cross_val_predict(classifier, X, y)

I get a bunch of warnings

C:\...\multiclass.py:76: UserWarning: Label not 4 is present in all training examples.
  str(classes[c]))
C:\\multiclass.py:76: UserWarning: Label not 0 is present in all training examples.
  str(classes[c]))
C:\...\multiclass.py:76: UserWarning: Label not 3 is present in all training examples.
  str(classes[c]))
C:\...\multiclass.py:76: UserWarning: Label not 5 is present in all training examples.
  str(classes[c]))
C:\...\multiclass.py:76: UserWarning: Label not 2 is present in all training examples.
  str(classes[c]))

and the following prediction:

In [5]: np.set_printoptions(precision=2, threshold=1000)    

In [6]: predicted
Out[6]: 
array([[0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 1],
       [0, 0, 0, 0, 0, 0, 1],
       [0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0],
       [0, 1, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0]])

Those rows whose entries are all 0 indicate that no tag is predicted for the corresponding document.


Workaround

For the sake of the analysis, let us validate the model manually rather than through cross_val_predict.

import warnings
from sklearn.model_selection import ShuffleSplit

rs = ShuffleSplit(n_splits=1, test_size=.5, random_state=0)
train_indices, test_indices = rs.split(X).next()

with warnings.catch_warnings(record=True) as received_warnings:
    warnings.simplefilter("always")
    X_train, y_train = X[train_indices], y[train_indices]
    X_test, y_test = X[test_indices], y[test_indices]
    classifier.fit(X_train, y_train)
    predicted_test = classifier.predict(X_test)
    for w in received_warnings:
        print w.message

When the snippet above is executed two warnings are issued (I used a context manager to make sure warnings are catched):

Label not 2 is present in all training examples.
Label not 4 is present in all training examples.

This is consistent with the fact that tags of indices 2 and 4 are missing from the training samples:

In [40]: y_train
Out[40]: 
array([[0, 0, 0, 0, 0, 1, 0],
       [0, 1, 0, 0, 0, 0, 0],
       [0, 1, 0, 1, 0, 0, 0],
       [1, 0, 0, 0, 0, 0, 1]])

For some documents, the prediction is empty (those documents corresponding to the rows with all zeros in predicted_test):

In [42]: predicted_test
Out[42]: 
array([[0, 1, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0],
       [0, 1, 0, 1, 0, 0, 0]])

To overcome that issue, you could implement your own prediction function like this:

def get_best_tags(clf, X, lb, n_tags=3):
    decfun = clf.decision_function(X)
    best_tags = np.argsort(decfun)[:, :-(n_tags+1): -1]
    return lb.classes_[best_tags]

By doing so, each document is always assigned the n_tag tags with the highest confidence score:

In [59]: mlb.inverse_transform(predicted_test)
Out[59]: [('matlab',), (), (), ('matlab', 'naming-conventions')]

In [60]: get_best_tags(classifier, X_test, mlb)
Out[60]: 
array([['matlab', 'oop', 'matlab-oop'],
       ['oop', 'matlab-oop', 'matlab'],
       ['oop', 'matlab-oop', 'matlab'],
       ['matlab', 'naming-conventions', 'oop']], dtype=object)
like image 147
Tonechas Avatar answered Sep 26 '22 16:09

Tonechas


I too had the same error. Then I used LabelEncoder() instead of MultiLabelBinarizer() to encode the labels.

from sklearn import preprocessing
le = preprocessing.LabelEncoder()
y = le.fit_transform(Labels)

I am not getting that error anymore.

like image 28
Vidya P V Avatar answered Sep 25 '22 16:09

Vidya P V