Minbucket not working when producing trees with CHAID package

Question

I have been trying to ensure that a classification tree that I obtain using the CHAID algorithm implemented in the CHAID package will produce a tree with terminal nodes (leafs) with at least minbucket number of observations. According to the description of the chaid procedure this can be done by specifying a chaid_control function:

chaid_control(alpha2 = 0.05, alpha3 = -1, alpha4 = 0.05,
              minsplit = 20, minbucket = 7, minprob = 0.01,
              stump = FALSE, maxheight = -1)

This is similar behavior to controlling trees in the rpart package.

Nevertheless, setting the minbucket parameter seems not to have any influence on the final shape of the resulting tree. Here is an example:

library("CHAID")
set.seed(290875)
USvoteS <- USvote[sample(1:nrow(USvote), 1000),]
chaid(vote3 ~ ., data = USvoteS)

Model formula:
vote3 ~ gender + ager + empstat + educr + marstat

Fitted party:
[1] root
|   [2] marstat in married
|   |   [3] educr <HS, HS, >HS: Gore (n = 311, err = 49.5%)
|   |   [4] educr in College, Post Coll: Bush (n = 249, err = 35.3%)
|   [5] marstat in widowed, divorced, never married
|   |   [6] gender in male: Gore (n = 159, err = 47.8%)
|   |   [7] gender in female
|   |   |   [8] ager in 18-24, 25-34, 35-44, 45-54: Gore (n = 127, err = 22.0%)
|   |   |   [9] ager in 55-64, 65+: Gore (n = 115, err = 40.9%)

Number of inner nodes:    4
Number of terminal nodes: 5

The terminal nodes 3, 4, 6, 8, and 9 consist of 311, 249, 159, 127, and 115 observations, respectively. Now, normally, in order to constrain the minimal number of observations one should proceed as follows:

ctrl <- chaid_control(minbucket = 200)

Nevertheless, invoking

chaid(vote3 ~ ., data = USvoteS, control = ctrl)

yields the same tree as before (instead of a tree with nodes with at least 200 observations).

I am not sure whether it is I who makes a mistake or something is missing in the implementation of the chaid procedure...

Achim Zeileis · Accepted Answer

The minimum number of observations in each terminal node is controlled by minbucket and minprob. The former gives the absolute number of observations, the latter the relative frequency (relative to the sample size of the current node). Internally, the minimum of both quantities is used in each node. This was also counterintuitive for me as I would have expected the maximum to be used - but I didn't check whether the original CHAID algorithm is described in this way.

If you want to make sure that only minbucket controls the minimum node size, then set minbucket = 200, minprob = 1.

Minbucket not working when producing trees with CHAID package

Tags:

r

tree

Kamil Kosiński

1 Answers

Achim Zeileis

Recent Activity

Donate For Us

Minbucket not working when producing trees with CHAID package

Tags:

r

tree

Kamil Kosiński

1 Answers

Achim Zeileis

Related questions

Recent Activity

Donate For Us