Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

sklearn DecisionTreeClassifier more depth less accuracy?

I have two learned sklearn.tree.tree.DecisionTreeClassifiers. Both are trained with the same training data. Both learned with different maximum depths for the decision trees. The depth for the decision_tree_model was 6 and the depth for the small_model was 2. Besides the max_depth, no other parameters were specified.

When I want to get the accuracy on the training data of them both like this:

small_model_accuracy = small_model.score(training_data_sparse_matrix, training_data_labels)
decision_tree_model_accuracy = decision_tree_model.score(training_data_sparse_matrix, training_data_labels)

Surprisingly the output is:

small_model accuracy: 0.61170212766
decision_tree_model accuracy: 0.422496238986

How is this even possible? Shouldn't a tree with a higher maximum depth always have a higher accuracy on the training data when learned with the same training data? Is it maybe that score function, which outputs the 1 - accuracy or something?

EDIT:

  • I just tested it with even higher maximum depth. The value returned becomes even lower. This hints at it being 1 - accuracy or something like that.

EDIT#2:

It seems to be a mistake I made with working with the training data. I thought about the whole thing again and concluded: "Well if the depth is higher, the tree shouldn't be the reason for this. What else is there? The training data itself. But I used the same data! Maybe I did something to the training data in between?" Then I checked again and there is a difference in how I use the training data. I need to transform it from an SFrame into a scipy matrix (might have to be sparse too). Now I made another accuracy calculation right after fitting the two models. This one results in 61% accuracy for the small_model and 64% accuracy for the decision_tree_model. That's only 3% more and still somewhat surprising, but at least it's possible.

EDIT#3:

The problem is resolved. I handled the training data in a wrong way and that resulted in different fitting.

Here is the plot of accuracy after fixing the mistakes:

Decision Tree Accuracy

This looks correct and would also explain why the assignment creators chose to choose 6 as the maximum depth.

like image 645
Zelphir Kaltstahl Avatar asked Feb 11 '26 09:02

Zelphir Kaltstahl


1 Answers

Shouldn't a tree with a higher maximum depth always have a higher accuracy when learned with the same training data?

No, definitely not always. The problem is you're overfitting your model to your training data in fitting a more complex tree. Hence, the lower score as increase the maximum depth.

like image 166
Anthony E Avatar answered Feb 16 '26 16:02

Anthony E



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!