What is the theorical foundation for scikit-learn dummy classifier?

2 Answers

The dummy classifier gives you a measure of "baseline" performance--i.e. the success rate one should expect to achieve even if simply guessing.

Suppose you wish to determine whether a given object possesses or does not possess a certain property. If you have analyzed a large number of those objects and have found that 90% contain the target property, then guessing that every future instance of the object possesses the target property gives you a 90% likelihood of guessing correctly. Structuring your guesses this way is equivalent to using the most_frequent method in the documentation you cite.

Because many machine learning tasks attempt to increase the success rate of (e.g.) classification tasks, evaluating the baseline success rate can afford a floor value for the minimal value one's classifier should out-perform. In the hypothetical discussed above, you would want your classifier to get more than 90% accuracy, because 90% is the success rate available to even "dummy" classifiers.

If one trains a dummy classifier with the stratified parameter using the data discussed above, that classifier will predict that there is a 90% probability that each object it encounters possesses the target property. This is different from training a dummy classifier with the most_frequent parameter, as the latter would guess that all future objects possess the target property. Here's some code to illustrate:

from sklearn.dummy import DummyClassifier
import numpy as np

two_dimensional_values = []
class_labels           = []

for i in xrange(90):
    two_dimensional_values.append( [1,1] )
    class_labels.append(1)

for i in xrange(10):
    two_dimensional_values.append( [0,0] )
    class_labels.append(0)

#now 90% of the training data contains the target property
X = np.array( two_dimensional_values )
y = np.array( class_labels )

#train a dummy classifier to make predictions based on the most_frequent class value
dummy_classifier = DummyClassifier(strategy="most_frequent")
dummy_classifier.fit( X,y )

#this produces 100 predictions that say "1"
for i in two_dimensional_values:
    print( dummy_classifier.predict( [i]) )

#train a dummy classifier to make predictions based on the class values
new_dummy_classifier = DummyClassifier(strategy="stratified")
new_dummy_classifier.fit( X,y )

#this produces roughly 90 guesses that say "1" and roughly 10 guesses that say "0"
for i in two_dimensional_values:
    print( new_dummy_classifier.predict( [i]) )

132

answered Oct 24 '22 02:10

duhaime

A major motivation for Dummy Classifier is F-score, when the positive class is in minority (i.e. imbalanced classes). This classifier is used for sanity test of actual classifier. Actually, dummy classifier completely ignores the input data. In case of 'most frequent' method, it checks the occurrence of most frequent label.

answered Oct 24 '22 02:10

Avi

Related questions
                            
                                How to store django objects as session variables ( object is not JSON serializable)?
                            
                                How to get type of multidimensional Numpy array elements in Python
                            
                                Python: running multiple processes simultaneously
                            
                                pretty_print option in tostring not working in lxml
                            
                                Pandas filter rows based on multiple conditions
                            
                                FigureCanvasAgg' object has no attribute 'invalidate' ? python plotting
                            
                                Django finds tests but fail to import them
                            
                                Get all values from nested dictionaries in python
                            
                                Python unittest mock: Is it possible to mock the value of a method's default arguments at test time?
                            
                                What's causing this error when I try and install virtualenv? IOError: [Errno 13] Permission denied: '/Library/Python/2.7/site-packages/virtualenv.py'
                            
                                Making objects from a CSV file Python [closed]
                            
                                Variable-Width Lookbehind Issue in Python
                            
                                python pyodbc : how to connect to a specific instance
                            
                                How to detect Python Version 2 or 3 in script?
                            
                                Django REST: How to use Router in application level urls.py?
                            
                                How to read sql query to pandas dataframe / python / django
                            
                                Store large data or a service connection per Flask session
                            
                                How can i set the location of minor ticks in matplotlib
                            
                                Equivalent im2double function in OpenCV Python
                            
                                Efficient Matplotlib Redrawing

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

What is the theorical foundation for scikit-learn dummy classifier?

Tags:

python

artificial-intelligence

machine-learning

svm

scikit-learn

john doe

People also ask

2 Answers

duhaime

Avi

Recent Activity

Donate For Us