UserWarning: Label not :NUMBER: is present in all training examples

Tags:

I am doing multilabel classification, where I try to predict correct labels for each document and here is my code:

mlb = MultiLabelBinarizer()
X = dataframe['body'].values 
y = mlb.fit_transform(dataframe['tag'].values)

classifier = Pipeline([
    ('vectorizer', CountVectorizer(lowercase=True, 
                                   stop_words='english', 
                                   max_df = 0.8, 
                                   min_df = 10)),
    ('tfidf', TfidfTransformer()),
    ('clf', OneVsRestClassifier(LinearSVC()))])

predicted = cross_val_predict(classifier, X, y)

When running my code I get multiple warnings:

UserWarning: Label not :NUMBER: is present in all training examples.

When I print out predicted and true labels, cca half of all documents has it's predictions for labels empty.

Why is this happening, is it related to warnings it prints out while training is running? How can I avoid those empty predictions?

EDIT01: This is also happening when using other estimators than LinearSVC().

I've tried RandomForestClassifier() and it gives empty predictions as well. Strange thing is, when I use cross_val_predict(classifier, X, y, method='predict_proba') for predicting probabilities for each label, instead of binary decisions 0/1, there is always at least one label per predicted set with probability > 0 for given document. So I dont know why is this label not chosen with binary decisioning? Or is binary decisioning evaluated in different way than probabilities?

EDIT02: I have found an old post where OP was dealing with similar problem. Is this the same case?

263

asked Mar 15 '17 21:03

PeterB

2 Answers

Why is this happening, is it related to warnings it prints out while training is running?

The issue is likely to be that some tags occur just in a few documents (check out this thread for details). When you split the dataset into train and test to validate your model, it may happen that some tags are missing from the training data. Let train_indices be an array with the indices of the training samples. If a particular tag (of index k) does not occur in the training sample, all the elements in the k-th column of the indicator matrix y[train_indices] are zeros.

How can I avoid those empty predictions?

In the scenario described above the classifier will not be able to reliably predict the k-th tag in the test documents (more on this in the next paragraph). Therefore you cannot trust the predictions made by clf.predict and you need to implement the prediction function on your own, for example by using the decision values returned by clf.decision_function as suggested in this answer.

So I don't know why is this label not chosen with binary decisioning? Or is binary decisioning evaluated in different way than probabilities?

In datasets containing many labels the occurrence frequency for most of them uses to be rather low. If these low values are fed to a binary classifier (i.e. a classifier that makes a 0-1 prediction) it is highly probable that the classifier would pick 0 for all tags on all documents.

I have found an old post where OP was dealing with similar problem. Is this the same case?

Yes, absolutely. That guy is facing exactly the same problem as you and his code is pretty similar to yours.

Demo

To further explain the issue I have elaborated a simple toy example using mock data.

Q = {'What does the "yield" keyword do in Python?': ['python'],
     'What is a metaclass in Python?': ['oop'],
     'How do I check whether a file exists using Python?': ['python'],
     'How to make a chain of function decorators?': ['python', 'decorator'],
     'Using i and j as variables in Matlab': ['matlab', 'naming-conventions'],
     'MATLAB: get variable type': ['matlab'],
     'Why is MATLAB so fast in matrix multiplication?': ['performance'],
     'Is MATLAB OOP slow or am I doing something wrong?': ['matlab-oop'],
    }
dataframe = pd.DataFrame({'body': Q.keys(), 'tag': Q.values()})    

mlb = MultiLabelBinarizer()
X = dataframe['body'].values 
y = mlb.fit_transform(dataframe['tag'].values)

classifier = Pipeline([
    ('vectorizer', CountVectorizer(lowercase=True, 
                                   stop_words='english', 
                                   max_df=0.8, 
                                   min_df=1)),
    ('tfidf', TfidfTransformer()),
    ('clf', OneVsRestClassifier(LinearSVC()))])

Please, notice that I have set min_df=1 since my dataset is much smaller than yours. When I run the following sentence:

predicted = cross_val_predict(classifier, X, y)

I get a bunch of warnings

C:\...\multiclass.py:76: UserWarning: Label not 4 is present in all training examples.
  str(classes[c]))
C:\\multiclass.py:76: UserWarning: Label not 0 is present in all training examples.
  str(classes[c]))
C:\...\multiclass.py:76: UserWarning: Label not 3 is present in all training examples.
  str(classes[c]))
C:\...\multiclass.py:76: UserWarning: Label not 5 is present in all training examples.
  str(classes[c]))
C:\...\multiclass.py:76: UserWarning: Label not 2 is present in all training examples.
  str(classes[c]))

and the following prediction:

In [5]: np.set_printoptions(precision=2, threshold=1000)    

In [6]: predicted
Out[6]: 
array([[0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 1],
       [0, 0, 0, 0, 0, 0, 1],
       [0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0],
       [0, 1, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0]])

Those rows whose entries are all 0 indicate that no tag is predicted for the corresponding document.

Workaround

For the sake of the analysis, let us validate the model manually rather than through cross_val_predict.

import warnings
from sklearn.model_selection import ShuffleSplit

rs = ShuffleSplit(n_splits=1, test_size=.5, random_state=0)
train_indices, test_indices = rs.split(X).next()

with warnings.catch_warnings(record=True) as received_warnings:
    warnings.simplefilter("always")
    X_train, y_train = X[train_indices], y[train_indices]
    X_test, y_test = X[test_indices], y[test_indices]
    classifier.fit(X_train, y_train)
    predicted_test = classifier.predict(X_test)
    for w in received_warnings:
        print w.message

When the snippet above is executed two warnings are issued (I used a context manager to make sure warnings are catched):

Label not 2 is present in all training examples.
Label not 4 is present in all training examples.

This is consistent with the fact that tags of indices 2 and 4 are missing from the training samples:

In [40]: y_train
Out[40]: 
array([[0, 0, 0, 0, 0, 1, 0],
       [0, 1, 0, 0, 0, 0, 0],
       [0, 1, 0, 1, 0, 0, 0],
       [1, 0, 0, 0, 0, 0, 1]])

For some documents, the prediction is empty (those documents corresponding to the rows with all zeros in predicted_test):

In [42]: predicted_test
Out[42]: 
array([[0, 1, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0],
       [0, 1, 0, 1, 0, 0, 0]])

To overcome that issue, you could implement your own prediction function like this:

def get_best_tags(clf, X, lb, n_tags=3):
    decfun = clf.decision_function(X)
    best_tags = np.argsort(decfun)[:, :-(n_tags+1): -1]
    return lb.classes_[best_tags]

By doing so, each document is always assigned the n_tag tags with the highest confidence score:

In [59]: mlb.inverse_transform(predicted_test)
Out[59]: [('matlab',), (), (), ('matlab', 'naming-conventions')]

In [60]: get_best_tags(classifier, X_test, mlb)
Out[60]: 
array([['matlab', 'oop', 'matlab-oop'],
       ['oop', 'matlab-oop', 'matlab'],
       ['oop', 'matlab-oop', 'matlab'],
       ['matlab', 'naming-conventions', 'oop']], dtype=object)

147

answered Sep 26 '22 16:09

Tonechas

I too had the same error. Then I used LabelEncoder() instead of MultiLabelBinarizer() to encode the labels.

from sklearn import preprocessing
le = preprocessing.LabelEncoder()
y = le.fit_transform(Labels)

I am not getting that error anymore.

answered Sep 25 '22 16:09

Vidya P V

Related questions
                            
                                Pandas - Delete Rows with only NaN values
                            
                                Python AttributeError: 'module' object has no attribute 'connect'
                            
                                Datetime Timezone conversion using pytz
                            
                                Regex, select closest match
                            
                                How can I share a class between processes?
                            
                                How do you add error bars to Bokeh plots in python?
                            
                                Difference(s) between scipy.stats.linregress, numpy.polynomial.polynomial.polyfit and statsmodels.api.OLS
                            
                                Find the year with the most number of people alive in Python
                            
                                Curl POST request into pycurl code
                            
                                Python3 threading with uWSGI
                            
                                One object two foreign keys to the same table
                            
                                How does Pandas to_sql determine what dataframe column is placed into what database field?
                            
                                How to avoid NLTK's sentence tokenizer splitting on abbreviations?
                            
                                Using generator send() within a for loop
                            
                                Python Selenium Exception AttributeError: "'Service' object has no attribute 'process'" in selenium.webdriver.ie.service.Service
                            
                                Python Pandas Drop Duplicates keep second to last
                            
                                Result of -1%7 is different in javascript(-1) and python(6)
                            
                                How to write a Pandas Dataframe to existing Django model
                            
                                Write text in particular font color in MS word using python-docx
                            
                                Updating the values of variables inside a namedtuple() structure

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

UserWarning: Label not :NUMBER: is present in all training examples

Tags:

python

classification

scikit-learn

text-classification

multilabel-classification

PeterB

People also ask

2 Answers

Tonechas

Vidya P V

Recent Activity

Donate For Us