I have a classifier multiclass, trained using the LinearSVC model provided by Sklearn library.
This model provides a decision_function method, which I use with numpy library functions to interpret correctly the result set.
But, I don't understand why this method always tries to distribute the total of probabilities (which in my case is 1) into between each one of the possibles classes.
I expected a different behavior of my classifier.
I mean, for example, suppose that I have a short piece of text like this:
"There are a lot of types of virus and bacterias that cause disease."
But my classifier was trained with three types of texts, let say "maths", "history" and "technology".
So, I think it has very sense that each of the three subjects has a probability very closed to zero (and therefore far to sum 1) when I try to classify that.
Is there a more appropriate method or model to obtain the results that I just described?
Do I use the wrong way the decision_function?
Sometimes, you may have text that has nothing to do with any of the subjects used to train a classifier or vice versa, it could be a probability about 1 for more than one subject.
I think I need to find some light on these issues (text classification, none binary classification, etc.)
Many thanks in advance for any help!
There are multiple parts to your question I will try to answer as much as I can.
That is the nature of most of the ML models out there, a given example has to be put into some class, and every model has some mechanism to compute the probability that a given data point belongs to a class and whichever class has the highest probability you will be predicting the corresponding class.
To address your problem i.e. the existence of examples doesn't belong to any of the classes you could always create a pseudo-class called others when you train the model, in this way even if your data point doesn't correspond to any of your actual classes e.g.maths, history and technology as per your example it will be binned to the other class.
This is what typically multi-label classification is used for.
What you are looking for is Multi-label classification model. Refer here to know understanding multi-label classification and the list of models that support multi-label classification task.
Simple example to demonstrate multi-label classification:
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC
from sklearn.preprocessing import OneHotEncoder
categories = ['sci.electronics', 'sci.space', 'talk.religion.misc',]
newsgroups_train = fetch_20newsgroups(subset='all',
remove=('headers', 'footers', 'quotes'),
categories=categories)
from sklearn.multioutput import MultiOutputClassifier
from sklearn.pipeline import make_pipeline
X, y = newsgroups_train.data, OneHotEncoder(sparse=False)\
.fit_transform([[newsgroups_train.target_names[i]]
for i in newsgroups_train.target])
model = make_pipeline(TfidfVectorizer(stop_words='english'),
MultiOutputClassifier(LinearSVC()))
model.fit(X, y)
print(newsgroups_train.target_names)
# ['sci.electronics', 'sci.space', 'talk.religion.misc']
print(model.predict(['religion followers of jesus']))
# [[0. 0. 1.]]
print(model.predict(['Upper Atmosphere Satellite Research ']))
# [[0. 1. 0.]]
print(model.predict(['There are a lot of types of virus and bacterias that cause disease.']))
# [[0. 0. 0.]]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With