Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

custom split rule with partykit

Tags:

split

r

tree

party

this post follows this question : https://stackoverflow.com/questions/31234329/rpart-user-defined-implementation

I'm very interested in tools which could handle tree growing with customized criteria, such that I could test different model.

I tried to use the partykit R package to grow a tree for which the split rule is given by the negative log-likelihood of a Cox model (which is log-quasi-likelihood in case of the Cox model) and a Cox model is fitted in each leaf.

As I understood reading the vignette about the MOB function, there are two way to implement my own split criteria, namely to get the fit function return either a list or a model object.

For my purpose, I tried the two solutions but I failed to make it work.

Solution 1 : return a list object :

I take as an example the "breast cancer dataset" as in the "mob" vignette.

I tried this :

cox1 = function(y,x, start = NULL, weights = NULL, offset = NULL, ...,
           estfun = FALSE, object = TRUE){
  res_cox = coxph(formula = y ~ x )
list(
  coefficients = res_cox$coefficients,
  objfun = - res_cox$loglik[2],
  object = res_cox)
}


mob(formula = Surv(time, cens) ~ horTh + pnodes - 1 | age + tsize + tgrade + progrec +
  estrec + menostat , 
    data = GBSG2 ,
    fit = cox1,
    control = mob_control(alpha = 0.0001) )

There is a warning about the singularity of the X matrix, and the mob function a tree with a single node (even with smaller values for alpha).

Note that there is no singularity problem with the X matrix when running the coxph function :

res_cox = coxph( formula = Surv(time, cens) ~ horTh + pnodes  ,
             data = GBSG2 )

Solution 2 : Return a coxph.object :

I tried this :

cox2 = function(y,x, start = NULL, weights = NULL, offset = NULL, ... ){
  res_cox = coxph(formula = y ~ x )
}

logLik.cox2 <- function(object, ...)
  structure( - object$loglik[2], class = "logLik")

mob(formula = Surv(time, cens) ~ horTh + pnodes - 1 | age + tsize + tgrade + progrec +
  estrec + menostat , 
    data = GBSG2 ,
    fit = cox2,
    control = mob_control(alpha = 0.0001 ) )

So this time I get a split along the "progrec" variable :

Model-based recursive partitioning (cox2)

Model formula:
Surv(time, cens) ~ horTh + pnodes - 1 | age + tsize + tgrade + 
progrec + estrec + menostat

Fitted party:
[1] root
|   [2] progrec <= 21: n = 281
|         xhorThno  xhorThyes    xpnodes 
|       0.19306661         NA 0.07832756 
|   [3] progrec > 21: n = 405
|         xhorThno  xhorThyes    xpnodes 
|       0.64810352         NA 0.04482348 

Number of inner nodes:    1
Number of terminal nodes: 2
Number of parameters per node: 3
Objective function: 1531.132
Warning message:
In coxph(formula = y ~ x) : X matrix deemed to be singular; variable 2

I would like to know what's wrong with my Solution 1.

I also tried a similar thing for a regression problem and get the same result, ending with a single leaf :

data("BostonHousing", package = "mlbench")

BostonHousing <- transform(BostonHousing,
                       chas = factor(chas, levels = 0:1, labels = c("no", "yes")),
                       rad = factor(rad, ordered = TRUE))


linear_reg = function(y,x, start = NULL, weights = NULL, offset = NULL, ...,
                  estfun = FALSE, object = TRUE){
  res_lm = glm(formula = y ~ x , family = "gaussian")
  list(
    coefficients = res_lm$coefficients,
    objfun = res_lm$deviance,
    object = res_lm )
}

mob( formula = medv ~ log(lstat) + I(rm^2) | zn + indus + chas + nox +
   + age + dis + rad + tax + crim + b + ptratio, 
     data = BostonHousing ,
     fit = linear_reg)

Also I would like to know if there is no problem using a variable for both "fit the model in a node" and "make a split".

Thank you in advance.

I will probably have other questions about partykit functioning.

like image 793
Ooona Avatar asked Mar 14 '16 15:03

Ooona


1 Answers

The problem with the cox1() and linear_reg() functions you have set up are that you do not supply the estimating functions aka score contributions. As these are the basis for the inference that selects the splitting variable, the algorithm does not split at all if these are not provided. See this recent answer for some discussion of this issues.

But for coxph() objects (unlike the fitdistr() example in the discussion linked above) it is very easy to obtain these estimating functions or scores because there is an estfun() method available. So your cox2() approach is the easier route to go here.

The reason that the latter doesn't work correctly is due to the special handling of intercepts in coxph(). Internally, this always forces the intercept into the model but then omits the first column from the design matrix. When interfacing this through mob() you need to be careful not to mess this up because mob() sets up its own model matrix. And because you exclude the intercept, mob() thinks that it can estimate both levels of horTh. But this is not the case because the intercept is not identified in the Cox-PH model.

The best solution in this case (IMO) is the following: You let mob() set up an intercept but then exclude it again when passing the model matrix to coxph(). Because there are coef(), logLik(), and estfun() methods for the resulting objects, one can use the simple setup of your cox2() function.

Packages and data:

library("partykit")
library("survival")
data("GBSG2", package = "TH.data")

Fitting function:

cox <- function(y, x, start = NULL, weights = NULL, offset = NULL, ... ) {
  x <- x[, -1]
  coxph(formula = y ~ 0 + x)
}

Fitting of the MOB tree to the GBSG2 data:

mb <- mob(formula = Surv(time, cens) ~ horTh + pnodes | age + tsize + tgrade + progrec + estrec + menostat, 
  data = GBSG2, fit = cox)
mb
## Model-based recursive partitioning (cox)
## 
## Model formula:
## Surv(time, cens) ~ horTh + pnodes | age + tsize + tgrade + progrec + 
##     estrec + menostat
## 
## Fitted party:
## [1] root: n = 686
##       xhorThyes     xpnodes 
##     -0.35701115  0.05768026  
## 
## Number of inner nodes:    0
## Number of terminal nodes: 1
## Number of parameters per node: 2
## Objective function: 1758.86
like image 182
Achim Zeileis Avatar answered Oct 22 '22 05:10

Achim Zeileis