Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

format split labels in rpart.plot

Tags:

r

rpart

I am plotting a tree with rpart.plot::prp(), much like:

library("rpart.plot")
data("ptitanic")
data <- ptitanic
data$sibsp <- as.integer(data$sibsp) # just to show that these are integers
data$age <- as.integer(data$age) # just to show that these are integers
tree <- rpart(survived~., data=data, cp=.02)
prp(tree, , fallen.leaves = FALSE, type=4, extra=1, varlen=0, faclen=0, yesno.yshift=-1)

enter image description here

Even though certain variables are integers (age and sibsp), rpart creates a seemingly arbitrary split point, which confuses the viewer. Nobody has 2.5 siblings/spouses aboard -- the logical split is sibsp >= 3

I have looked at split.fun in this excellent tutorial and ?prp. Other than using a regex to capture the number, format it properly, and replace it in the label string, I can't think of any solutions within prp.

A workaround I am considering is to pass a modified tree (object of class rpart) where the contents have been rounded. Is it possible to do this by modifying tree$splits?

Any other ideas?

like image 974
C8H10N4O2 Avatar asked May 20 '26 00:05

C8H10N4O2


2 Answers

1) ordered factors I think age is OK as a continuous variable but to handle sibsp and parch make them into ordered factors:

data <- transform(data, sibsp = ordered(sibsp), parch = ordered(parch))
tree <- rpart(survived~., data=data, cp=.02)
prp(tree, , fallen.leaves = FALSE, type=4, extra=1, varlen=0, faclen=0, yesno.yshift=-1)

screenshot

2) split.fun Another approach is to specify our own split.fun like this:

# next 4 lines are same as in question
data <- ptitanic
data$sibsp <- as.integer(data$sibsp) # just to show that these are integers
data$age <- as.integer(data$age) # just to show that these are integers
tree <- rpart(survived~., data=data, cp=.02)

split.labs <- function(x, labs, digits, varlen, faclen) {
   sapply(labs, function(lab) 
      if (grepl(">=|<", lab)) {
         rhs <- sub(".* ", "", lab)
         lab <- sub(rhs, ceiling(as.numeric(rhs)), lab)
      } else lab)
} 
prp(tree, , fallen.leaves = FALSE, type=4, extra=1, varlen=0, faclen=0, yesno.yshift=-1, 
   split.fun = split.labs) # same as in question except for split.fun= arg

This gives:

screenshot

(2a) A variation of (2) which gives slightly more control, i.e. one can specify precisely which variables to modify, is the following:

# next 4 lines are same as in question
data <- ptitanic
data$sibsp <- as.integer(data$sibsp) # just to show that these are integers
data$age <- as.integer(data$age) # just to show that these are integers
tree <- rpart(survived~., data=data, cp=.02)

split.labs2 <- function(x, labs, digits, varlen, faclen) {
    sapply(labs, function(lab) 
        if (grepl("age|sibsp|parch", lab)) {
            rhs <- sub(".* ", "", lab);
            lab <- sub(rhs, ceiling(as.numeric(rhs)), lab)
        } else lab)
} 

# similar to (2) except we use clip.right.labs = FALSE and split.labs2

prp(tree, type = 4, fallen.leaves = FALSE, extra=1, varlen=0, faclen=0, 
   yesno.yshift=-1, clip.right.labs = FALSE, split.fun = split.labs2)

screenshot

like image 158
G. Grothendieck Avatar answered May 23 '26 03:05

G. Grothendieck


Version 3.0.0 of the rpart.plot package (July 2018) treats predictors with integer values specially to automatically get the results you want.

So rpart.plot now automatically prints sibsp >= 3 instead of sibsp >= 2.5, since it sees that in the training data all values of sibsp are integral.

Section 4.1 of the vignette for the rpart.plot package has an example.

like image 21
Stephen Milborrow Avatar answered May 23 '26 03:05

Stephen Milborrow



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!