Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

partykit: Displaying terminal node percentile values above terminal node boxplots

I'm trying to plot a regression tree generated with rpart using partykit. Let's suppose the formula used is y ~ x1 + x2 + x3 + ... + xn. What I would like to achieve is a tree with boxplots in terminal nodes, with a label on top listing the 10th, 50th, and 90th percentiles of the distribution of the y values for the observations assigned to each node, i.e., above the boxplot representing each terminal node, I would like to display a label like "10th percentile = $200, mean = $247, 90th percentile = $292."

The code below generates the desired tree:

library("rpart")
fit <- rpart(Price ~ Mileage + Type + Country, cu.summary)
library("partykit")
tree.2 <- as.party(fit)

The following code generates the terminal plots but without the desired labels on the terminal nodes:

plot(tree.2, type = "simple", terminal_panel = node_boxplot(tree.2,
  col = "black", fill = "lightgray", width = 0.5, yscale = NULL,
  ylines = 3, cex = 0.5, id = TRUE))

If I can display a mean y-value for a node, then it should be easy enough to augment the label with percentiles, so my first step is to display, above each terminal node, just its mean y-value.

I know I can retrieve the mean y-value within a node (here node #12) with code such as this:

colMeans(tree.2[12]$fitted[2])

So I tried to create a formula and use the mainlab parameter of the boxplot panel-generating function to generate a label containing this mean:

labf <- function(node) colMeans(node$fitted[2])
plot(tree.2, type = "simple", terminal_panel = node_boxplot(tree.2,
  col = "black", fill = "lightgray", width = 0.5, yscale = NULL,
  ylines = 3, cex = 0.5, id = TRUE, mainlab = tf))

Unfortunately, this generates the error message:

Error in mainlab(names(obj)[nid], sum(wn)) : unused argument (sum(wn)).

But it seems this is on the right track, since if I use:

plot(tree.2, type = "simple", terminal_panel = node_boxplot(tree.2,
  col = "black", fill = "lightgray", width = 0.5, yscale = NULL,
  ylines = 3, cex = 0.5, id = TRUE, mainlab = colMeans(tree.2$fitted[2])))

then I get the correct mean y-value at the root node displayed. I would appreciate help with fixing the error described above so that I show the mean y-values for each separate terminal node. From there, it should be easy to add in the other percentiles and format things nicely.

like image 303
djr99 Avatar asked Oct 24 '15 03:10

djr99


1 Answers

In principle, you are on the right track. But if mainlab should be a function, it is not a function of the node but of id and nobs, see ?node_boxplot. Also you can compute the table of means (or some quantiles) more easily for all terminal nodes using the fitted data for the whole tree:

tab <- tapply(tree.2$fitted[["(response)"]],
  factor(tree.2$fitted[["(fitted)"]], levels = 1:length(tree.2)),
  FUN = mean)

Then you can prepare this for plotting by rounding/formatting:

tab <- format(round(tab, digits = 3))
tab
##           1           2           3           4           5           6 
## "       NA" "       NA" "       NA" " 7629.048" "       NA" "12241.552" 
##           7           8           9          10          11          12 
## "14846.895" "22317.727" "       NA" "       NA" "17607.444" "21499.714" 
##          13 
## "27646.000" 

And for adding this into the display, write your own helper function for the mainlab:

mlab <- function(id, nobs) paste("Mean =", tab[id])
plot(tree.2, tp_args = list(mainlab = mlab))

enter image description here

like image 172
Achim Zeileis Avatar answered Sep 18 '22 15:09

Achim Zeileis