SkLearn model for text classification

Question

I have a classifier multiclass, trained using the LinearSVC model provided by Sklearn library. This model provides a decision_function method, which I use with numpy library functions to interpret correctly the result set.

But, I don't understand why this method always tries to distribute the total of probabilities (which in my case is 1) into between each one of the possibles classes.

I expected a different behavior of my classifier.

I mean, for example, suppose that I have a short piece of text like this:

"There are a lot of types of virus and bacterias that cause disease."

But my classifier was trained with three types of texts, let say "maths", "history" and "technology".

So, I think it has very sense that each of the three subjects has a probability very closed to zero (and therefore far to sum 1) when I try to classify that.

Is there a more appropriate method or model to obtain the results that I just described?

Do I use the wrong way the decision_function?

Sometimes, you may have text that has nothing to do with any of the subjects used to train a classifier or vice versa, it could be a probability about 1 for more than one subject.

I think I need to find some light on these issues (text classification, none binary classification, etc.)

Many thanks in advance for any help!

Parthasarathy Subburaj · Accepted Answer

There are multiple parts to your question I will try to answer as much as I can.

I don't understand why this method always tries to distribute the total of probabilities?

That is the nature of most of the ML models out there, a given example has to be put into some class, and every model has some mechanism to compute the probability that a given data point belongs to a class and whichever class has the highest probability you will be predicting the corresponding class.

To address your problem i.e. the existence of examples doesn't belong to any of the classes you could always create a pseudo-class called others when you train the model, in this way even if your data point doesn't correspond to any of your actual classes e.g.maths, history and technology as per your example it will be binned to the other class.

Addressing the problem that your data point could possibly belong to multiple classes.

This is what typically multi-label classification is used for.

Venkatachalam · Answer

What you are looking for is Multi-label classification model. Refer here to know understanding multi-label classification and the list of models that support multi-label classification task.

Simple example to demonstrate multi-label classification:

from sklearn.datasets import fetch_20newsgroups

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC
from sklearn.preprocessing import OneHotEncoder
categories = ['sci.electronics', 'sci.space', 'talk.religion.misc',]
newsgroups_train = fetch_20newsgroups(subset='all',
                                      remove=('headers', 'footers', 'quotes'),
                                      categories=categories)

from sklearn.multioutput import MultiOutputClassifier
from sklearn.pipeline import make_pipeline

X, y = newsgroups_train.data, OneHotEncoder(sparse=False)\
    .fit_transform([[newsgroups_train.target_names[i]]
                      for i in newsgroups_train.target])

model = make_pipeline(TfidfVectorizer(stop_words='english'),
                      MultiOutputClassifier(LinearSVC()))

model.fit(X, y)

print(newsgroups_train.target_names)
# ['sci.electronics', 'sci.space', 'talk.religion.misc']


print(model.predict(['religion followers of jesus']))
# [[0. 0. 1.]]


print(model.predict(['Upper Atmosphere Satellite Research ']))
# [[0. 1. 0.]]


print(model.predict(['There are a lot of types of virus and bacterias that cause disease.']))
# [[0. 0. 0.]]

SkLearn model for text classification

Tags:

python

artificial-intelligence

machine-learning

scikit-learn

text-classification

Alexis Alfonso

2 Answers

Parthasarathy Subburaj

Venkatachalam

Recent Activity

Donate For Us

SkLearn model for text classification

Tags:

python

artificial-intelligence

machine-learning

scikit-learn

text-classification

Alexis Alfonso

2 Answers

Parthasarathy Subburaj

Venkatachalam

Related questions

Recent Activity

Donate For Us