Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Understanding of minbucket function in CART model using R

Assume the training data is "fruit", which I am going to use it for predict using CART model in R

> fruit=data.frame(
                   color=c("red",   "red",  "red",  "yellow", "red","yellow",
                           "orange","green","pink", "red",‌    ​"red"),
                   isApple=c(TRUE, TRUE, TRUE, FALSE, TRUE, FALSE,
                             FALSE,FALSE,FALSE,FALSE,TRUE))

> mod = rpart(isApple ~ color, data=fruit, method="class", minbucket=1)

> prp(mod)

Could anyone explain what is exactly the role of minbucket in plotting CART tree for this example if we are going to use minbucket = 2, 3, 4, 5?

See i have 2 variables color & isApple. Color variable has green, yellow, pink, orange and Red. is Apple variable has value TRUE or FALSE. In the last example, RED has three TRUE and 2 FALSE mapped with it. Red value appear five times. if i give minbucket = 1,2,3 then it is splitting. If I give minbucket = 4 or 5 then no split occurs though red appears five times.

like image 231
GBOT Avatar asked Apr 14 '15 06:04

GBOT


People also ask

What is Minbucket R?

The option minbucket provides the smallest number of observations that are allowed in a terminal node. If a split decision breaks up the data into a node with less than the minbucket, it won't accept it. The minsplit parameter is the smallest number of observations in the parent node that could be split further.

How does rpart work in R?

The rpart algorithm works by splitting the dataset recursively, which means that the subsets that arise from a split are further split until a predetermined termination criterion is reached.

What is Minsplit and Minbucket?

minsplit. The minimum number of observations that must exist in a node in order for a split to be attempted. minbucket. the minimum number of observations in any terminal <leaf> node.

What is rpart in decision tree?

Rpart is a powerful machine learning library in R that is used for building classification and regression trees. This library implements recursive partitioning and is very easy to use.


1 Answers

From the documentation for the rpart package:

minbucket

the minimum number of observations in any terminal node. If onlyone of minbucket or minsplit is specified, the code either sets minsplit tominbucket*3 or minbucket to minsplit/3, as appropriate.

Setting minbucket to 1 is meaningless, since each leaf node will (by definition) have at least one observation on it. If you set it to a higher value, say 3, then it would mean that every leaf node would have at least 3 observations in that bucket.

The smaller the value of minbucket, the more precise your CART model will be. By setting minbucket to too small a value, such as 1, you may run the risk of overfitting your model.

like image 109
Tim Biegeleisen Avatar answered Oct 11 '22 08:10

Tim Biegeleisen