I am using a classification tree from <code>sklearn</code> and when I have the the model train twice using the same data, and predict with the same test data, I am getting different results. I tried reproducing on a smaller iris data set and it worked as predicted. Here is some code <pre class="prettyprint"><code>from sklearn import tree from sklearn.datasets import iris clf = tree.DecisionTreeClassifier() clf.fit(iris.data, iris.target) r1 = clf.predict_proba(iris.data) clf.fit(iris.data, iris.target) r2 = clf.predict_proba(iris.data) </code></pre> <code>r1</code> and <code>r2</code> are the same for this small example, but when I run on my own much larger data set I get differing results. Is there a reason why this would occur? EDIT After looking into some documentation I see that <code>DecisionTreeClassifier</code> has an input <code>random_state</code> which controls the starting point. By setting this value to a constant I get rid of the problem I was previously having. However now I'm concerned that my model is not as optimal as it could be. What is the recommended method for doing this? Try some randomly? Or are all results expected to be about the same?

The <code>DecisionTreeClassifier</code> works by repeatedly splitting the training data, based on the value of some feature. The Scikit-learn implementation lets you choose between a few splitting algorithms by providing a value to the <code>splitter</code> keyword argument. <ul> <li>"best" randomly chooses a feature and finds the 'best' possible split for it, according to some criterion (which you can also choose; see the methods signature and the <code>criterion</code> argument). It looks like the code does this N_feature times, so it's actually quite like a bootstrap.</li> <li>"random" chooses the feature to consider at random, as above. However, it also then tests randomly-generated thresholds on that feature (random, subject to the constraint that it's between its minimum and maximum values). This may help avoid 'quantization' errors on the tree where the threshold is strongly influenced by the exact values in the training data.</li> </ul> Both of these randomization methods can improve the trees' performance. There are some relevant experimental results in Lui, Ting, and Fan's (2005) KDD paper. If you absolutely must have an identical tree every time, then I'd re-use the same random_state. Otherwise, I'd expect the trees to end up more or less equivalent every time and, in the absence of a ton of held-out data, I'm not sure how you'd decide which random tree is best. See also: Source code for the splitter

Classification tree in sklearn giving inconsistent answers

Tags:

python

classification

scikit-learn

decision-tree

I am using a classification tree from sklearn and when I have the the model train twice using the same data, and predict with the same test data, I am getting different results. I tried reproducing on a smaller iris data set and it worked as predicted. Here is some code

from sklearn import tree
from sklearn.datasets import iris

clf = tree.DecisionTreeClassifier()
clf.fit(iris.data, iris.target)
r1 = clf.predict_proba(iris.data)

clf.fit(iris.data, iris.target)
r2 = clf.predict_proba(iris.data)

r1 and r2 are the same for this small example, but when I run on my own much larger data set I get differing results. Is there a reason why this would occur?

EDIT After looking into some documentation I see that DecisionTreeClassifier has an input random_state which controls the starting point. By setting this value to a constant I get rid of the problem I was previously having. However now I'm concerned that my model is not as optimal as it could be. What is the recommended method for doing this? Try some randomly? Or are all results expected to be about the same?

238

asked Jan 27 '14 20:01

sedavidw

1 Answers

The DecisionTreeClassifier works by repeatedly splitting the training data, based on the value of some feature. The Scikit-learn implementation lets you choose between a few splitting algorithms by providing a value to the splitter keyword argument.

"best" randomly chooses a feature and finds the 'best' possible split for it, according to some criterion (which you can also choose; see the methods signature and the criterion argument). It looks like the code does this N_feature times, so it's actually quite like a bootstrap.
"random" chooses the feature to consider at random, as above. However, it also then tests randomly-generated thresholds on that feature (random, subject to the constraint that it's between its minimum and maximum values). This may help avoid 'quantization' errors on the tree where the threshold is strongly influenced by the exact values in the training data.

Both of these randomization methods can improve the trees' performance. There are some relevant experimental results in Lui, Ting, and Fan's (2005) KDD paper.

If you absolutely must have an identical tree every time, then I'd re-use the same random_state. Otherwise, I'd expect the trees to end up more or less equivalent every time and, in the absence of a ton of held-out data, I'm not sure how you'd decide which random tree is best.

Matt Krause

Related questions
                            
                                Python: how to distinguish between socket error and timeout?
                            
                                How to set the pivot point (center of rotation) for pygame.transform.rotate()?
                            
                                Slick way to reverse the (binary) digits of a number in Python?
                            
                                Calculating variance image python
                            
                                What is the best data structure for storing a set of four (or more) values?
                            
                                Stopping celery task gracefully
                            
                                pyqtgraph: add legend for lines in a plot
                            
                                Response time for urllib in python
                            
                                Sum consecutive numbers in a list. Python
                            
                                Obtaining length of list as a value in dictionary in Python 2.7
                            
                                Hex string variable to hex value conversion in python
                            
                                File upload - Bad request (400)
                            
                                How to retry just once on exception in python
                            
                                What's the Pythonic way to write an auto-closing class?
                            
                                How to identify a generator vs list comprehension
                            
                                Python 3, Converting a string storing binary data to Int
                            
                                Print all values of a given key of dictionaries in a list
                            
                                How to remove all a href tags from text
                            
                                How to return all valid combinations of n-pairs of parentheses?
                            
                                business logic in Django

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With