Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can a machine learning model handle unseen data and unseen label?

I am trying to solve a text classification problem. I have a limited number of labels that capture the category of my text data. If the incoming text data doesn't fit any label, it is tagged as 'Other'. In the below example, I built a text classifier to classify text data as 'breakfast' or 'italian'. In the test scenario, I included couple of text data that do not fit into the labels that I used for training. This is the challenge that I'm facing. Ideally, I want the model to say - 'Other' for 'i like hiking' and 'everyone should understand maths'. How can I do this?

import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import TfidfTransformer

X_train = np.array(["coffee is my favorite drink",
                    "i like to have tea in the morning",
                    "i like to eat italian food for dinner",
                    "i had pasta at this restaurant and it was amazing",
                    "pizza at this restaurant is the best in nyc",
                    "people like italian food these days",
                    "i like to have bagels for breakfast",
                    "olive oil is commonly used in italian cooking",
                    "sometimes simple bread and butter works for breakfast",
                    "i liked spaghetti pasta at this italian restaurant"])

y_train_text = ["breakfast","breakfast","italian","italian","italian",
                "italian","breakfast","italian","breakfast","italian"]

X_test = np.array(['this is an amazing italian place. i can go there every day',
                   'i like this place. i get great coffee and tea in the morning',
                   'bagels are great here',
                   'i like hiking',
                   'everyone should understand maths'])

classifier = Pipeline([
    ('vectorizer', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('clf', MultinomialNB())])

classifier.fit(X_train, y_train_text)
predicted = classifier.predict(X_test)
proba = classifier.predict_proba(X_test)
print(predicted)
print(proba)

['italian' 'breakfast' 'breakfast' 'italian' 'italian']
[[0.25099411 0.74900589]
 [0.52943091 0.47056909]
 [0.52669142 0.47330858]
 [0.42787443 0.57212557]
 [0.4        0.6       ]]

I consider the 'Other' category as noise and I cannot model this category.

like image 628
Prasanth Regupathy Avatar asked Sep 17 '18 16:09

Prasanth Regupathy


4 Answers

I think Kalsi might have suggested this but it was not clear to me. You could define a confidence threshold for your classes. If the predicted probability does not achieve the threshold for any of your classes ('italian' and 'breakfast' in your example), you were not able to classify the sample yielding the 'other' "class".

I say "class" because other is not exactly a class. You probably don't want your classifier to be good at predicting "other" so this confidence threshold might be a good approach.

like image 50
Fabio Picchi Avatar answered Nov 15 '22 07:11

Fabio Picchi


You cannot do that.

You have trained the model to predict only two labels i.e., breakfast or italian. So the model doesn't have any idea about the third label or the fourth etc.

You and me know that "i like hiking" is neither breakfast nor italian. But how a model a would know that ? It only knows breakfast & italian. So there has to be a way to tell the model that: If you get confused between breakfast &italian, then predict the label as other

You can achieve this by training the model which is having other as label with some texts like "i like hiking" etc

But in your case, a little hack can be done as follows.


So what does it mean when a model predicts a label with 0.5 probability (or approximately 0.5)? It means that model is getting confused between the labels breakfast and italian. So here you can take advantage of this.

You can take all the predicted probability values & assign the label other if the probability value is between 0.45 & 0.55 . In this way you can predict the other label (obviously with some errors) without letting the model knowing there is a label called other

like image 21
Kalsi Avatar answered Nov 15 '22 07:11

Kalsi


You can try setting class priors when creating the MultinomialNB. You could create a dummy "Other" training example, and then set the prior high enough for Other so that instances default to Other when there aren't enough evidence to select the other classes.

like image 25
Pascal Soucy Avatar answered Nov 15 '22 09:11

Pascal Soucy


No, you cannot do that.

You have to define a third category "other" or whatever name that suits you and give your model some data related to that category. Make sure that number of training examples for all three categories are somewhat equal, otherwise "other" being a very broad category could skew your model towards "other" category.

Other way to approach this, is to get noun phrases from all your sentences for different categories including other and then feed into the model, consider this as a feature selection step for your machine learning model. In this way noise added by irrelevant words will be removed, better performance than tf-idf.

If you have huge data, go for deep learning models which does feature selection automatically.

Dont go with manipulating probabilities by yourself approach, 50-50% probability means that the model is confused between two classes which you have defined, it has no idea about the third "other class". Lets say the sentence is "I want italian breakfast", the model will be confused whether this sentence belongs to "italian" or "breakfast" category but that doesnt mean it belongs to "other" category".

like image 34
GraphicalDot Avatar answered Nov 15 '22 07:11

GraphicalDot