I am trying to find how the C4.5 algorithm determines the threshold value for numeric attributes. I have researched and can not understand, in most places I've found this information:
The training samples are first sorted on the values of the attribute Y being considered. There are only a finite number of these values, so let us denote them in sorted order as {v1,v2, …,vm}. Any threshold value lying between vi and vi+1 will have the same effect of dividing the cases into those whose value of the attribute Y lies in {v1, v2, …, vi} and those whose value is in {vi+1, vi+2, …, vm}. There are thus only m-1 possible splits on Y, all of which should be examined systematically to obtain an optimal split.
It is usual to choose the midpoint of each interval: (vi +vi+1)/2 as the representative threshold. C4.5 chooses as the threshold a smaller value vi for every interval {vi, vi+1}, rather than the midpoint itself.
I am studying an example of Play/Dont Play (value table) and do not understand how you get the number 75 (tree generated) for the attribute humidity when the state is sunny because the values of humidity to the sunny state are {70,85,90,95}.
Does anyone know?
5 chooses the attribute of the data that most effectively splits its set of samples into subsets enriched in one class or the other. The splitting criterion is the normalized information gain (difference in entropy). The attribute with the highest normalized information gain is chosen to make the decision. The C4.
The C4. 5 algorithm is used in Data Mining as a Decision Tree Classifier which can be employed to generate a decision, based on a certain sample of data (univariate or multivariate predictors). So, before we dive straight into C4. 5, let's discuss a little about Decision Trees and how they can be used as classifiers.
The selected threshold value can not reflect and judge the generalization capability of the continuous attribute. So, the C4. 5 algorithm may use continuous attributes with a low generalization performance to split data. Deciding based on those attributes will increase the tree size and decrease the model accuracy.
The J48 implementation of the C4. 5 algorithm has many additional features including accounting for missing values, decision trees pruning, continuous attribute value ranges, derivation of rules, etc. In the WEKA data mining tool, J48 is an open-source Java implementation of the C4.
As your generated tree image implies, you consider attributes in order. Your 75 example belongs to outlook = sunny branch. If you filter your data according to outlook = sunny, you get following table.
outlook temperature humidity windy play
sunny 69 70 FALSE yes
sunny 75 70 TRUE yes
sunny 85 85 FALSE no
sunny 80 90 TRUE no
sunny 72 95 FALSE no
As you can see, threshold for humidity is "< 75" for this condition.
j4.8 is successor to ID3 algorithm. It uses information gain and entropy to decide best split. According to wikipedia
The attribute with the smallest entropy
is used to split the set on this iteration.
The higher the entropy,
the higher the potential to improve the classification here.
I'm not entirely sure about J48, but assuming its based on C4.5 it would compute the gain for all possible splits (i.e., based on the possible values for the feature). For each split, it computes the information gain and chooses the split with the most information gain. In the case of {70,85,90,95} it would compute the information gain for {70|85,90,95} vs {70,85|90,95} vs {70,85,90|95} and choose the best one.
Quinlan's book on C4.5 book is a good starting point (https://goo.gl/J2SsPf). See page 25 in particular.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With