Why does the decision tree return different solutions for the exact same training data

Question

I was trying out a ML example and it worked for the most part but when I ran the code consecutively python started spitting out different prediction results, now I am now ML expert but that seems wack?

# Example file from Google Developers: "Hello World - Machine Learning Recipes": YouTube: https://youtu.be/cKxRvEZd3Mw
# Category: Supervised Learning                                                                               
# January 14, 2018                                                                                            
from sklearn import tree                                                                                      

# Declarations: Texture                                                                                        
bumpy = 0                                                                                                      
smooth = 1                                                                                                     

# Declarations: Labels                                                                                         
apple = 0                                                                                                      
orange = 1                                                                                                                                                                 

# Step(1): Collect training data                                                                               
# Features: [Weight, Texture]                                                                                  
features = [[140, smooth], [130, smooth], [150, bumpy], [170, bumpy]]                                          

# labels will be used as the index for the features                                                            
labels = [apple, apple, orange, orange]                                                                        

# Step(2): Train Classifier: Decision Tree                                                                     
# Use the decision tree object and then fit 'find' paterns in features and labels                              
clf = tree.DecisionTreeClassifier()                                                                            
clf = clf.fit(features, labels)                                                                                

# Step(3): Make Predictions                                                                                    
# the prdict method will return the best fit from the decesion tree                                            
result = clf.predict([[150, bumpy], [130, smooth], [125.5, bumpy], [110, smooth]])                             
# result = clf.predict([[150, bumpy]])                                                                         
print("Step(3): Make Predictions: ")                                                                           
for x in result:                                                                                               
    if x == 0:
    print("Apple")                                                                                        
        continue                                                                                              
    elif x == 1:                                                                                              
        print("Orange")                                                                                       
        continue                                                                                              
    print("Orange")

Click link to see vim and bash windows

Taylor Wood · Accepted Answer

There's an element of randomness to (most?) Decision Tree algorithms, and your training set is very small which might be exaggerating the effect. The randomness is typically used to determine how many/which samples to use, and in your case there are very few samples.

Try setting the random_state to some fixed integer when you create the DecisionTreeClassifier. If you want a repeatable result for testing, you'll need to use the same "seed" value each time. They use a random seed of zero in the example docs:

clf = DecisionTreeClassifier(random_state=0)

Why does the decision tree return different solutions for the exact same training data

Tags:

python

scikit-learn

decision-tree

PrimeTime

1 Answers

Taylor Wood

Recent Activity

Donate For Us

Why does the decision tree return different solutions for the exact same training data

Tags:

python

scikit-learn

decision-tree

PrimeTime

1 Answers

Taylor Wood

Related questions

Recent Activity

Donate For Us