Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Different decision tree algorithms with comparison of complexity or performance

I am doing research on data mining and more precisely, decision trees.

I would like to know if there are multiple algorithms to build a decision trees (or just one?), and which is better, based on criteria such as

  • Performance
  • Complexity
  • Errors in decision making
  • and more.
like image 766
Y2theZ Avatar asked Apr 02 '12 15:04

Y2theZ


People also ask

What are the different decision tree algorithms?

There are 4 popular types of decision tree algorithms: ID3, CART (Classification and Regression Trees), Chi-Square and Reduction in Variance.

Which is the best decision tree algorithm?

The ID3 algorithm builds decision trees using a top-down greedy search approach through the space of possible branches with no backtracking. A greedy algorithm, as the name suggests, always makes the choice that seems to be the best at that moment.

What is the complexity of decision tree?

The complexity of the tree is the number of queries on the worst-case input and worst-case outcome of the coin ips. A second way to define a randomized decision tree is as a probability distribution over deterministic decision trees.

What are the decision tree 3 types of nodes?

There are three different types of nodes: chance nodes, decision nodes, and end nodes. A chance node, represented by a circle, shows the probabilities of certain results. A decision node, represented by a square, shows a decision to be made, and an end node shows the final outcome of a decision path.


1 Answers

Decision Tree implementations differ primarily along these axes:

  • the splitting criterion (i.e., how "variance" is calculated)

  • whether it builds models for regression (continuous variables, e.g., a score) as well as classification (discrete variables, e.g., a class label)

  • technique to eliminate/reduce over-fitting

  • whether it can handle incomplete data


The major Decision Tree implementations are:

  • ID3, or Iterative Dichotomizer, was the first of three Decision Tree implementations developed by Ross Quinlan (Quinlan, J. R. 1986. Induction of Decision Trees. Mach. Learn. 1, 1 (Mar. 1986), 81-106.)

  • CART, or Classification And Regression Trees is often used as a generic acronym for the term Decision Tree, though it apparently has a more specific meaning. In sum, the CART implementation is very similar to C4.5; the one notable difference is that CART constructs the tree based on a numerical splitting criterion recursively applied to the data, whereas C4.5 includes the intermediate step of constructing rule sets.

  • C4.5, Quinlan's next iteration. The new features (versus ID3) are: (i) accepts both continuous and discrete features; (ii) handles incomplete data points; (iii) solves over-fitting problem by (very clever) bottom-up technique usually known as "pruning"; and (iv) different weights can be applied the features that comprise the training data. Of these, the first three are very important--and I would suggest that any DT implementation you choose has all three. The fourth (differential weighting) is much less important

  • C5.0, the most recent Quinlan iteration. This implementation is covered by a patent and probably, as a result, is rarely implemented (outside of commercial software packages). I have never coded a C5.0 implementation myself (I have never even seen the source code) so I can't offer an informed comparison of C5.0 versus C4.5. I have always been skeptical about the improvements claimed by its inventor (Ross Quinlan)--for instance, he claims it is "several orders of magnitude" faster than C4.5. Other claims are similarly broad ("significantly more memory efficient") and so forth. I'll just point you to studies which report the result of comparison of the two techniques and you can decide for yourself.

  • CHAID (chi-square automatic interaction detector) actually predates the original ID3 implementation by about six years (published in a Ph.D. thesis by Gordon Kass in 1980). I know every little about this technique.The R Platform has a Package called CHAID which includes excellent documentation

  • MARS (multi-adaptive regression splines) is actually a term trademarked by the original inventor of MARS, Salford Systems. As a result, MARS clones in libraries not sold by Salford are named something other than MARS--e.g., in R, the relevant function is polymars in the poly-spline library. Matlab and Statistica also have implementations with MARS-functionality

I would recommend CART or C4.5 (though again, I have no direct experience with C5.0 or with CHAID, though I am familiar with their feature sets).

C4.5 is the Decision Tree flavor implemented in Orange; CART is the flavor in sklearn--both excellent implementations in excellent ML libraries.

C4.5 is a major step beyond ID3--both in terms of range (C4.5 has a far broader use case spectrum because it can handle continuous variables in the training data) and in terms of model quality.

Perhaps the most significant claimed improvement of C5.0 versus C4.5 is support for boosted trees. Ensemble support for DTs--boosted trees and Random Forests--has been included in the DT implementation in Orange; here, ensemble support was added to a C4.5 algorithm. sklearn also features a range of random forest and boosting methods.

like image 166
doug Avatar answered Oct 04 '22 08:10

doug