Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Minbucket not working when producing trees with CHAID package

Tags:

r

tree

I have been trying to ensure that a classification tree that I obtain using the CHAID algorithm implemented in the CHAID package will produce a tree with terminal nodes (leafs) with at least minbucket number of observations. According to the description of the chaid procedure this can be done by specifying a chaid_control function:

chaid_control(alpha2 = 0.05, alpha3 = -1, alpha4 = 0.05,
              minsplit = 20, minbucket = 7, minprob = 0.01,
              stump = FALSE, maxheight = -1)

This is similar behavior to controlling trees in the rpart package.

Nevertheless, setting the minbucket parameter seems not to have any influence on the final shape of the resulting tree. Here is an example:

library("CHAID")
set.seed(290875)
USvoteS <- USvote[sample(1:nrow(USvote), 1000),]
chaid(vote3 ~ ., data = USvoteS)

Model formula:
vote3 ~ gender + ager + empstat + educr + marstat

Fitted party:
[1] root
|   [2] marstat in married
|   |   [3] educr <HS, HS, >HS: Gore (n = 311, err = 49.5%)
|   |   [4] educr in College, Post Coll: Bush (n = 249, err = 35.3%)
|   [5] marstat in widowed, divorced, never married
|   |   [6] gender in male: Gore (n = 159, err = 47.8%)
|   |   [7] gender in female
|   |   |   [8] ager in 18-24, 25-34, 35-44, 45-54: Gore (n = 127, err = 22.0%)
|   |   |   [9] ager in 55-64, 65+: Gore (n = 115, err = 40.9%)

Number of inner nodes:    4
Number of terminal nodes: 5

The terminal nodes 3, 4, 6, 8, and 9 consist of 311, 249, 159, 127, and 115 observations, respectively. Now, normally, in order to constrain the minimal number of observations one should proceed as follows:

ctrl <- chaid_control(minbucket = 200)

Nevertheless, invoking

chaid(vote3 ~ ., data = USvoteS, control = ctrl)

yields the same tree as before (instead of a tree with nodes with at least 200 observations).

I am not sure whether it is I who makes a mistake or something is missing in the implementation of the chaid procedure...

like image 217
Kamil Kosiński Avatar asked Dec 01 '14 22:12

Kamil Kosiński


1 Answers

The minimum number of observations in each terminal node is controlled by minbucket and minprob. The former gives the absolute number of observations, the latter the relative frequency (relative to the sample size of the current node). Internally, the minimum of both quantities is used in each node. This was also counterintuitive for me as I would have expected the maximum to be used - but I didn't check whether the original CHAID algorithm is described in this way.

If you want to make sure that only minbucket controls the minimum node size, then set minbucket = 200, minprob = 1.

like image 154
Achim Zeileis Avatar answered Nov 04 '22 21:11

Achim Zeileis