Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why does the C4.5 algorithm use pruning in order to reduce the decision tree and how does pruning affect the predicion accuracy?

I have searched on google about this issue and I can't find something that explains this algorithm in a simple yet detailed way.

For instance, I know the id3 algorithm doesn't use pruning at all, so if you have a continuous characteristic, the prediction success rates will be very low.

So the C4.5 in order to support continuous characteristics it uses pruning, but is this the only reason?

Also I can't really understand in the WEKA application, how exactly the confidence factor affects the efficiency of the predictions. The smaller the confidence factor the more pruning the algorithm will do, however what is the correlation between pruning and the prediction's accuracy? The more you prune, the better the predictions or the worse?

Thanks

like image 548
ksm001 Avatar asked Jun 02 '12 19:06

ksm001


People also ask

Why we use the pruning technique in the decision tree?

Pruning reduces the complexity of the final classifier, and hence improves predictive accuracy by the reduction of overfitting.

What is C4 5 decision tree algorithm?

The C4. 5 algorithm is used in Data Mining as a Decision Tree Classifier which can be employed to generate a decision, based on a certain sample of data (univariate or multivariate predictors).

What do you understand by pruning in a decision tree and why do we require pruning in decision trees explain?

Pruning is a technique that is used to reduce overfitting. Pruning also simplifies a decision tree by removing the weakest rules.

Why is tree pruning useful in decision tree induction What is a drawback of using a separate set of tuples to evaluate pruning?

When decision trees are built, many of the branches may reflect noise or outliers in the training data. Tree pruning methods address this problem of overfittingthe data. Tree pruning attempts to identify and remove such branches, with the goal of improving classification accuracy on unseen data.


1 Answers

Pruning is a way of reducing the size of the decision tree. This will reduce the accuracy on the training data, but (in general) increase the accuracy on unseen data. It is used to mitigate overfitting, where you would achieve perfect accuracy on training data, but the model (i.e. the decision tree) you learn is so specific that it doesn't apply to anything but that training data.

In general, if you increase pruning, the accuracy on the training set will be lower. WEKA does however offer various things to estimate the accuracy better, namely training/test split or cross-validation. If you use cross-validation for example, you'll discover a "sweet spot" of the pruning confidence factor somewhere where it prunes enough to make the learned decision tree sufficiently accurate on test data, but doesn't sacrifice too much accuracy on the training data. Where this sweet spot lies however will depend on your actual problem and the only way to determine it reliably is to try.

like image 182
Lars Kotthoff Avatar answered Sep 30 '22 07:09

Lars Kotthoff