Help Understanding Cross Validation and Decision Trees

Tags:

I've been reading up on Decision Trees and Cross Validation, and I understand both concepts. However, I'm having trouble understanding Cross Validation as it pertains to Decision Trees. Essentially Cross Validation allows you to alternate between training and testing when your dataset is relatively small to maximize your error estimation. A very simple algorithm goes something like this:

Decide on the number of folds you want (k)
Subdivide your dataset into k folds
Use k-1 folds for a training set to build a tree.
Use the testing set to estimate statistics about the error in your tree.
Save your results for later
Repeat steps 3-6 for k times leaving out a different fold for your test set.
Average the errors across your iterations to predict the overall error

The problem I can't figure out is at the end you'll have k Decision trees that could all be slightly different because they might not split the same way, etc. Which tree do you pick? One idea I had was pick the one with minimal errors (although that doesn't make it optimal just that it performed best on the fold it was given - maybe using stratification will help but everything I've read say it only helps a little bit).

As I understand cross validation the point is to compute in node statistics that can later be used for pruning. So really each node in the tree will have statistics calculated for it based on the test set given to it. What's important are these in node stats, but if your averaging your error. How do you merge these stats within each node across k trees when each tree could vary in what they choose to split on, etc.

What's the point of calculating the overall error across each iteration? That's not something that could be used during pruning.

Any help with this little wrinkle would be much appreciated.

746

asked Feb 22 '10 22:02

chubbsondubs

1 Answers

The problem I can't figure out is at the end you'll have k Decision trees that could all be slightly different because they might not split the same way, etc. Which tree do you pick?

The purpose of cross validation is not to help select a particular instance of the classifier (or decision tree, or whatever automatic learning application) but rather to qualify the model, i.e. to provide metrics such as the average error ratio, the deviation relative to this average etc. which can be useful in asserting the level of precision one can expect from the application. One of the things cross validation can help assert is whether the training data is big enough.

With regards to selecting a particular tree, you should instead run yet another training on 100% of the training data available, as this typically will produce a better tree. (The downside of the Cross Validation approach is that we need to divide the [typically little] amount of training data into "folds" and as you hint in the question this can lead to trees which are either overfit or underfit for particular data instances).

In the case of decision tree, I'm not sure what your reference to statistics gathered in the node and used to prune the tree pertains to. Maybe a particular use of cross-validation related techniques?...

198

answered Sep 21 '22 12:09

mjv

Related questions
                            
                                Building bridges problem - how to apply longest increasing subsequence?
                            
                                How can I generate truly (not pseudo) random numbers with C#?
                            
                                What is the fastest way to find Nth biggest number of an INT array?
                            
                                Algorithm to find k smallest numbers in array of n items
                            
                                All factors of a given number
                            
                                Changing integer to binary string of digits
                            
                                How to calculate a standard deviation [array] [duplicate]
                            
                                Given a list of numbers and a number k, return whether any two numbers from the list add up to k
                            
                                Help me understand Inorder Traversal without using recursion
                            
                                How to change a negative number to zero in python without using decision structures
                            
                                Boyer Moore Algorithm Understanding and Example?
                            
                                How to efficiently rack up billiards for an 8-ball game?
                            
                                Is there an efficient algorithm for segmentation of handwritten text?
                            
                                What is a good open source B-tree implementation in C? [closed]
                            
                                How to optimally solve the flood fill puzzle?
                            
                                What's the fastest way to brush up on algorithms for a technical interview (on Monday)? [closed]
                            
                                Fast way of getting the dominant color of an image [closed]
                            
                                Real world applications of Binary heaps and Fibonacci Heaps [closed]
                            
                                How does heap compaction work quickly?
                            
                                Implementation of a work stealing queue in C/C++? [closed]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Help Understanding Cross Validation and Decision Trees

Tags:

algorithm

machine-learning

decision-tree

chubbsondubs

People also ask

1 Answers

mjv

Recent Activity

Donate For Us