I have been trying to ensure that a classification tree that I obtain using the CHAID algorithm implemented in the CHAID package will produce a tree with terminal nodes (leafs) with at least minbucket
number of observations. According to the description of the chaid procedure this can be done by specifying a chaid_control
function:
chaid_control(alpha2 = 0.05, alpha3 = -1, alpha4 = 0.05,
minsplit = 20, minbucket = 7, minprob = 0.01,
stump = FALSE, maxheight = -1)
This is similar behavior to controlling trees in the rpart package.
Nevertheless, setting the minbucket
parameter seems not to have any influence on the final shape of the resulting tree. Here is an example:
library("CHAID")
set.seed(290875)
USvoteS <- USvote[sample(1:nrow(USvote), 1000),]
chaid(vote3 ~ ., data = USvoteS)
Model formula:
vote3 ~ gender + ager + empstat + educr + marstat
Fitted party:
[1] root
| [2] marstat in married
| | [3] educr <HS, HS, >HS: Gore (n = 311, err = 49.5%)
| | [4] educr in College, Post Coll: Bush (n = 249, err = 35.3%)
| [5] marstat in widowed, divorced, never married
| | [6] gender in male: Gore (n = 159, err = 47.8%)
| | [7] gender in female
| | | [8] ager in 18-24, 25-34, 35-44, 45-54: Gore (n = 127, err = 22.0%)
| | | [9] ager in 55-64, 65+: Gore (n = 115, err = 40.9%)
Number of inner nodes: 4
Number of terminal nodes: 5
The terminal nodes 3, 4, 6, 8, and 9 consist of 311, 249, 159, 127, and 115 observations, respectively. Now, normally, in order to constrain the minimal number of observations one should proceed as follows:
ctrl <- chaid_control(minbucket = 200)
Nevertheless, invoking
chaid(vote3 ~ ., data = USvoteS, control = ctrl)
yields the same tree as before (instead of a tree with nodes with at least 200 observations).
I am not sure whether it is I who makes a mistake or something is missing in the implementation of the chaid
procedure...
The minimum number of observations in each terminal node is controlled by minbucket
and minprob
. The former gives the absolute number of observations, the latter the relative frequency (relative to the sample size of the current node). Internally, the minimum of both quantities is used in each node. This was also counterintuitive for me as I would have expected the maximum to be used - but I didn't check whether the original CHAID algorithm is described in this way.
If you want to make sure that only minbucket
controls the minimum node size, then set minbucket = 200, minprob = 1
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With