Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Tree sizes given by CP table in rpart

In the R package rpart, what determines the size of trees presented within the CP table for a decision tree? In the below example, the CP table defaults to presenting only trees with 1, 2, and 5 nodes (as nsplit = 0, 1 and 4 respectively).

library(rpart)   
fit <- rpart(Kyphosis ~ Age + Number + Start, method="class", data=kyphosis)
> printcp(fit) 

Classification tree:
rpart(formula = Kyphosis ~ Age + Number + Start, data = kyphosis, 
method = "class")

Variables actually used in tree construction:
[1] Age   Start

Root node error: 17/81 = 0.20988

n= 81 

        CP nsplit rel error  xerror    xstd
1 0.176471      0   1.00000 1.00000 0.21559
2 0.019608      1   0.82353 0.94118 0.21078
3 0.010000      4   0.76471 0.94118 0.21078

Is there an inherent rule rpart() used to determine what size of trees to present? And is it possible to force printcp() to return cross-validation statistics for all possible sizes of tree, i.e. for the above example, also include rows for trees with 3 and 4 nodes (nsplit = 2, 3)?

like image 523
alopex Avatar asked Jan 09 '15 14:01

alopex


People also ask

What is CP table in rpart?

cp: Complexity Parameter The complexity parameter (cp) in rpart is the minimum improvement in the model needed at each node. It's based on the cost complexity of the model defined as… For the given tree, add up the misclassification at every terminal node.

What is CP value in decision tree?

'CP' stands for Complexity Parameter of the tree. Syntax : printcp ( x ) where x is the rpart object. This function provides the optimal prunings based on the cp value. We prune the tree to avoid any overfitting of the data.

What is rpart in decision tree?

Rpart is a powerful machine learning library in R that is used for building classification and regression trees. This library implements recursive partitioning and is very easy to use.

How do you choose complexity parameter in rpart?

You should first start by using the arguments minsplit=0 and cp=0 (complexity parameter) then use the functions plotcp(T. max) and printcp(T. max) choose the value of cp corresponding the minimum relative error and prune the tree by the function prune. rpart(T.


2 Answers

The rpart() function is controlled using the rpart.control() function. It has parameters such as minsplit which tells the function to only split when there are more observations then the value specified and cp which tells the function to only split if the overall lack of fit is decreased by a factor of cp. If you look at summary(fit) on your above example it shows the statistics for all values of nsplit. To get these values to print when using printcp(fit) you need to choose appropriate values of cp and minsplit when calling the original rpart function.

like image 59
Kevin Avatar answered Oct 07 '22 15:10

Kevin


The cran-r documentation on rpart mentions adding option cp=0 to the rpart function. http://cran.r-project.org/web/packages/rpart/vignettes/longintro.pdf It also mentions other options which can be given in the rpart function for eg to control the number of splits.

    dfit <- rpart(y ~ x, method='class',
            control = rpart.control(xval = 10, minbucket = 2, **cp = 0**))
like image 28
Amrita Sawant Avatar answered Oct 07 '22 15:10

Amrita Sawant