Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Reptree (WEKA), Only sorts values for numeric attributes once

I am using Reptree algorithm from weka. There is no detail docs for this algorithm, only:

Fast decision tree learner. Builds a decision/regression tree using information gain/variance reduction and prunes it using reduced-error pruning (with backfitting). Only sorts values for numeric attributes once. Missing values are dealt with by splitting the corresponding instances into pieces (i.e. as in C4.5).

Can anyone please explain me, what is meant by: "Only sorts values for numeric attributes once."

I am trying re-implement this algorithm, but still getting not even close results.

Thank you

Lubomir

like image 804
gugatr0n1c Avatar asked Oct 02 '22 01:10

gugatr0n1c


1 Answers

I have been looking for the same answers about Weka's RepTree myself lately amd stumbled upon this post, I was not able to find sufficient answers online, so I opened the source-code and dove in.

"Only sorts values for numeric attributes once." means that the algorithm sorts all numeric fields in the data-set once, at the start of the run, and then uses the sorted lists to calculate the right splits in each tree node.

From reading the source code I was able to figure out, in general, how the algorithm works:

1) For classification of non-numeric (discrete) it uses regular decision tree with reduced-error pruning [http://www.stat.cmu.edu/~ryantibs/datamining/lectures/23-tree.pdf] It maximizes entropy values, you can see the exact method implementation in the source code under weka.core.ContingencyTables - entropyConditionedOnRows(double[][] matrix)

2) For classification of numeric values the change is that it minimizes total variance.

Also, for me, it was very confusing figuring out the output tree, specifically, in numeric classification, what does this leaf output mean:

years_since >= 9.03 : 4.64 (8/0.68) [8/6.76]

1) In non-numeric (discrete) classification the first number after ':' is the classification value of the tree, the first number in the () brackets is the amount of correctly classified instances from the training set under that leaf, the second number is the amount of instances which were under the leaf but had a different classification value, while in the [] brackets the first number is the amount of correct classifications from the pruning set and the second number is the wrong classifications. [http://weka.8497.n7.nabble.com/about-REPtree-td15188.html]

2) In numeric classification, this output means something else. The first number after the ':' is the mean (x_mean) of the samples from the training set which belong that that leaf, the first number in the () brackets is the weight -k- of the values which belong to that leaf (weight means the number of samples from what I figured out from the code) while the second value is the variance of the sample, meaning sum_i_from_1_to_k((x_i-x_mean)^2)/k. In the [] brackets the first is the amount of samples from the pruning -L- set which belong to that leaf while the second number is the variance relative to the x_mean from the training set, meaning sum_i_from_1_to_L((y_i-x_mean)^2)/L

Hope this helps

like image 65
PeterN Avatar answered Oct 05 '22 10:10

PeterN