I have searched on google about this issue and I can't find something that explains this algorithm in a simple yet detailed way.
For instance, I know the id3 algorithm doesn't use pruning at all, so if you have a continuous characteristic, the prediction success rates will be very low.
So the C4.5 in order to support continuous characteristics it uses pruning, but is this the only reason?
Also I can't really understand in the WEKA application, how exactly the confidence factor affects the efficiency of the predictions. The smaller the confidence factor the more pruning the algorithm will do, however what is the correlation between pruning and the prediction's accuracy? The more you prune, the better the predictions or the worse?
Thanks
Pruning reduces the complexity of the final classifier, and hence improves predictive accuracy by the reduction of overfitting.
The C4. 5 algorithm is used in Data Mining as a Decision Tree Classifier which can be employed to generate a decision, based on a certain sample of data (univariate or multivariate predictors).
Pruning is a technique that is used to reduce overfitting. Pruning also simplifies a decision tree by removing the weakest rules.
When decision trees are built, many of the branches may reflect noise or outliers in the training data. Tree pruning methods address this problem of overfittingthe data. Tree pruning attempts to identify and remove such branches, with the goal of improving classification accuracy on unseen data.
Pruning is a way of reducing the size of the decision tree. This will reduce the accuracy on the training data, but (in general) increase the accuracy on unseen data. It is used to mitigate overfitting, where you would achieve perfect accuracy on training data, but the model (i.e. the decision tree) you learn is so specific that it doesn't apply to anything but that training data.
In general, if you increase pruning, the accuracy on the training set will be lower. WEKA does however offer various things to estimate the accuracy better, namely training/test split or cross-validation. If you use cross-validation for example, you'll discover a "sweet spot" of the pruning confidence factor somewhere where it prunes enough to make the learned decision tree sufficiently accurate on test data, but doesn't sacrifice too much accuracy on the training data. Where this sweet spot lies however will depend on your actual problem and the only way to determine it reliably is to try.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With