How to calculate the threshold value for numeric attributes in Quinlan's C4.5 algorithm?

Tags:

I am trying to find how the C4.5 algorithm determines the threshold value for numeric attributes. I have researched and can not understand, in most places I've found this information:

The training samples are first sorted on the values of the attribute Y being considered. There are only a finite number of these values, so let us denote them in sorted order as {v1,v2, …,vm}. Any threshold value lying between vi and vi+1 will have the same effect of dividing the cases into those whose value of the attribute Y lies in {v1, v2, …, vi} and those whose value is in {vi+1, vi+2, …, vm}. There are thus only m-1 possible splits on Y, all of which should be examined systematically to obtain an optimal split.

It is usual to choose the midpoint of each interval: (vi +vi+1)/2 as the representative threshold. C4.5 chooses as the threshold a smaller value vi for every interval {vi, vi+1}, rather than the midpoint itself.

I am studying an example of Play/Dont Play (value table) and do not understand how you get the number 75 (tree generated) for the attribute humidity when the state is sunny because the values of humidity to the sunny state are {70,85,90,95}.

Does anyone know?

993

asked Apr 19 '13 04:04

Fabrizzio

2 Answers

As your generated tree image implies, you consider attributes in order. Your 75 example belongs to outlook = sunny branch. If you filter your data according to outlook = sunny, you get following table.

outlook temperature humidity    windy   play
sunny   69           70         FALSE   yes
sunny   75           70         TRUE    yes
sunny   85           85         FALSE   no
sunny   80           90         TRUE    no
sunny   72           95         FALSE   no

As you can see, threshold for humidity is "< 75" for this condition.

j4.8 is successor to ID3 algorithm. It uses information gain and entropy to decide best split. According to wikipedia

The attribute with the smallest entropy 
is used to split the set on this iteration. 
The higher the entropy, 
the higher the potential to improve the classification here.

122

answered Nov 15 '22 08:11

Atilla Ozgur

I'm not entirely sure about J48, but assuming its based on C4.5 it would compute the gain for all possible splits (i.e., based on the possible values for the feature). For each split, it computes the information gain and chooses the split with the most information gain. In the case of {70,85,90,95} it would compute the information gain for {70|85,90,95} vs {70,85|90,95} vs {70,85,90|95} and choose the best one.

Quinlan's book on C4.5 book is a good starting point (https://goo.gl/J2SsPf). See page 25 in particular.

answered Nov 15 '22 08:11

Paul

Related questions
                            
                                Is there a supervised learning algorithm that takes tags as input, and produces a probability as output?
                            
                                Supervised Learning, (ii) Unsupervised Learning, (iii) Reinforcement Learn
                            
                                What are the advantages of using an autoencoder to build a set of filters versus a prebuilt set of gabor filters in relation to CNNs?
                            
                                Detecting circles and shots from paper target
                            
                                SVM - Difference between Energy vs Loss vs Regularization vs Cost function
                            
                                Tensorflow Type Error: Value passed to parameter 'shape' has DataType float32 not in list of allowed values: int32, int64
                            
                                Is it possible to certify an AI-based solution for safety-critical systems? [closed]
                            
                                Apple Vision – Barcode Detection doesn't work for barcodes with different colours
                            
                                Software for Classical Music Theory / Composition / Harmony and Counterpoint [closed]
                            
                                chess AI for GAE
                            
                                blindly classifying new trends in incoming data
                            
                                How could you give a computer a "natural need"?
                            
                                What are the advantages or disadvantages of having multiple output nodes compared to a few within a neural network
                            
                                Monte Carlo with UCB applied to complex card game
                            
                                Iterative deepening in common lisp
                            
                                Eight Queens Heuristic
                            
                                Algorithm for solving Flow Free Game
                            
                                How to code an artificial neural network (Tic-tac-toe)? [closed]
                            
                                When to use a certain Reinforcement Learning algorithm?
                            
                                Chess Optimizations

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to calculate the threshold value for numeric attributes in Quinlan's C4.5 algorithm?

Tags:

artificial-intelligence

decision-tree

weka

Fabrizzio

People also ask

2 Answers

Atilla Ozgur

Paul

Recent Activity

Donate For Us