Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is the theorical foundation for scikit-learn dummy classifier?

By the documentation I read that a dummy classifier can be used to test it against a classification algorithm.

This classifier is useful as a simple baseline to compare with other (real) classifiers. Do not use it for real problems.

What does the dummy classifier do when it uses the stratified aproach. I know that the docummentation says that:

generates predictions by respecting the training set’s class distribution.

Could anybody give me a more theorical explanation of why this is a proof for the performance of the classifier?.

like image 375
john doe Avatar asked Apr 04 '15 02:04

john doe


People also ask

What is dummy classifier in Sklearn?

A dummy classifier is a type of classifier which does not generate any insight about the data and classifies the given data using only simple rules.

What is the purpose of dummy classifier?

DummyClassifier makes predictions that ignore the input features. This classifier serves as a simple baseline to compare against other more complex classifiers. The specific behavior of the baseline is selected with the strategy parameter.

How does training a dummy classifier help in building a model?

A dummy classifier is exactly what it sounds like! It is a classifier model that makes predictions without trying to find patterns in the data. The default model essentially looks at what label is most frequent in the training dataset and makes predictions based on that label.

What method does Scikit-Learn use for classifying operational?

This section will introduce three popular classification techniques: Logistic Regression, Discriminant Analysis, and Nearest Neighbor.


2 Answers

The dummy classifier gives you a measure of "baseline" performance--i.e. the success rate one should expect to achieve even if simply guessing.

Suppose you wish to determine whether a given object possesses or does not possess a certain property. If you have analyzed a large number of those objects and have found that 90% contain the target property, then guessing that every future instance of the object possesses the target property gives you a 90% likelihood of guessing correctly. Structuring your guesses this way is equivalent to using the most_frequent method in the documentation you cite.

Because many machine learning tasks attempt to increase the success rate of (e.g.) classification tasks, evaluating the baseline success rate can afford a floor value for the minimal value one's classifier should out-perform. In the hypothetical discussed above, you would want your classifier to get more than 90% accuracy, because 90% is the success rate available to even "dummy" classifiers.

If one trains a dummy classifier with the stratified parameter using the data discussed above, that classifier will predict that there is a 90% probability that each object it encounters possesses the target property. This is different from training a dummy classifier with the most_frequent parameter, as the latter would guess that all future objects possess the target property. Here's some code to illustrate:

from sklearn.dummy import DummyClassifier
import numpy as np

two_dimensional_values = []
class_labels           = []

for i in xrange(90):
    two_dimensional_values.append( [1,1] )
    class_labels.append(1)

for i in xrange(10):
    two_dimensional_values.append( [0,0] )
    class_labels.append(0)

#now 90% of the training data contains the target property
X = np.array( two_dimensional_values )
y = np.array( class_labels )

#train a dummy classifier to make predictions based on the most_frequent class value
dummy_classifier = DummyClassifier(strategy="most_frequent")
dummy_classifier.fit( X,y )

#this produces 100 predictions that say "1"
for i in two_dimensional_values:
    print( dummy_classifier.predict( [i]) )

#train a dummy classifier to make predictions based on the class values
new_dummy_classifier = DummyClassifier(strategy="stratified")
new_dummy_classifier.fit( X,y )

#this produces roughly 90 guesses that say "1" and roughly 10 guesses that say "0"
for i in two_dimensional_values:
    print( new_dummy_classifier.predict( [i]) )
like image 132
duhaime Avatar answered Oct 24 '22 02:10

duhaime


A major motivation for Dummy Classifier is F-score, when the positive class is in minority (i.e. imbalanced classes). This classifier is used for sanity test of actual classifier. Actually, dummy classifier completely ignores the input data. In case of 'most frequent' method, it checks the occurrence of most frequent label.

like image 2
Avi Avatar answered Oct 24 '22 02:10

Avi