I am using a classification tree from sklearn
and when I have the the model train twice using the same data, and predict with the same test data, I am getting different results. I tried reproducing on a smaller iris data set and it worked as predicted. Here is some code
from sklearn import tree
from sklearn.datasets import iris
clf = tree.DecisionTreeClassifier()
clf.fit(iris.data, iris.target)
r1 = clf.predict_proba(iris.data)
clf.fit(iris.data, iris.target)
r2 = clf.predict_proba(iris.data)
r1
and r2
are the same for this small example, but when I run on my own much larger data set I get differing results. Is there a reason why this would occur?
EDIT After looking into some documentation I see that DecisionTreeClassifier
has an input random_state
which controls the starting point. By setting this value to a constant I get rid of the problem I was previously having. However now I'm concerned that my model is not as optimal as it could be. What is the recommended method for doing this? Try some randomly? Or are all results expected to be about the same?
What are Decision Tree Classifiers? Decision tree classifiers are supervised machine learning models. This means that they use prelabelled data in order to train an algorithm that can be used to make a prediction. Decision trees can also be used for regression problems.
1.10. Decision Trees — scikit-learn 1.0.1 documentation 1.10. Decision Trees ¶ Decision Trees (DTs) are a non-parametric supervised learning method used for classification and regression. The goal is to create a model that predicts the value of a target variable by learning simple decision rules inferred from the data features.
A classification tree is a supervised learning method. The function aims to create a model from which a target variable is predicted. In the following code, we will import cross_val_score from sklearn.model_selection by which we can calculate the cross value score.
Classification and Regression Trees. Wadsworth, Belmont, CA, 1984. J.R. Quinlan. C4. 5: programs for machine learning. Morgan Kaufmann, 1993. T. Hastie, R. Tibshirani and J. Friedman. Elements of Statistical Learning, Springer, 2009.
The DecisionTreeClassifier
works by repeatedly splitting the training data, based on the value of some feature. The Scikit-learn implementation lets you choose between a few splitting algorithms by providing a value to the splitter
keyword argument.
"best" randomly chooses a feature and finds the 'best' possible split for it, according to some criterion (which you can also choose; see the methods signature and the criterion
argument). It looks like the code does this N_feature times, so it's actually quite like a bootstrap.
"random" chooses the feature to consider at random, as above. However, it also then tests randomly-generated thresholds on that feature (random, subject to the constraint that it's between its minimum and maximum values). This may help avoid 'quantization' errors on the tree where the threshold is strongly influenced by the exact values in the training data.
Both of these randomization methods can improve the trees' performance. There are some relevant experimental results in Lui, Ting, and Fan's (2005) KDD paper.
If you absolutely must have an identical tree every time, then I'd re-use the same random_state. Otherwise, I'd expect the trees to end up more or less equivalent every time and, in the absence of a ton of held-out data, I'm not sure how you'd decide which random tree is best.
See also: Source code for the splitter
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With