Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

rpart: Computational time for categorical vs continuous regressors

i am currently using the rpart package to fit a regression tree to a data with relatively few observations and several thousand categorical predictors taking two possible values.

from testing the package out on smaller data i know that in this instance it doesn't matter whether i declare the regressors as categorical (i.e. factors) or leave them as they are (they are coded as +/-1).

however, i would still like to understand why passing my explanatory variables as factors significantly slows the algorithm down (not least because i shall soon get new data where response takes 3 diffirent values and treating them as continuous would no longer be an option). surely it should be the other way round?

here is a sample code emulating my data:

library(rpart)

x <- as.data.frame(matrix(sample(c(-1, +1), 50 * 3000, replace = T), nrow = 50))
y <- rnorm(50)

x.fac <- as.data.frame(lapply(x, factor))

now compare:

system.time(rpart( y ~ ., data = x, method = 'anova'))

   user  system elapsed 
   1.62    0.21    1.85 

system.time(rpart( y ~ ., data = x.fac, method = 'anova'))

   user  system elapsed 
   246.87  165.91  412.92 

dealing with only one potential split possibility per variable (factors) is simpler and faster than dealing with a whole range of potential splits (for continuous variables), so i am most confused by the rpart behaviour. any clarifications/suggestions would be very apprecaited.

like image 256
stas g Avatar asked Jun 19 '13 15:06

stas g


People also ask

What does rpart package do?

Rpart is a powerful machine learning library in R that is used for building classification and regression trees. This library implements recursive partitioning and is very easy to use.

How does rpart work in R?

The rpart algorithm works by splitting the dataset recursively, which means that the subsets that arise from a split are further split until a predetermined termination criterion is reached.


2 Answers

You need to profile the code to be sure, but I would be surprised if the timing difference did not come from R having to turn each factor variable into two binary variables as it prepares the model matrix.

Try

Rprof("rpartProfile.Rprof")
rpart( y ~ ., data = x.fac, method = 'anova')
Rprof()

summaryRprof("rpartProfile.Rprof")

and look to see where the time is being spent. Which I have now done:

> summaryRprof("rpartProfile.Rprof")
$by.self
                          self.time self.pct total.time total.pct
"[[<-.data.frame"            786.46    72.45     786.56     72.46
"rpart.matrix"               294.26    27.11    1081.78     99.66
"model.frame.default"          1.04     0.10       3.00      0.28
"terms.formula"                0.96     0.09       0.96      0.09
"as.list.data.frame"           0.46     0.04       0.46      0.04
"makepredictcall.default"      0.46     0.04       0.46      0.04
"rpart"                        0.44     0.04    1085.38     99.99
"[[.data.frame"                0.16     0.01       0.42      0.04
"<Anonymous>"                  0.16     0.01       0.18      0.02
"match"                        0.14     0.01       0.22      0.02
"print"                        0.12     0.01       0.12      0.01
"model.matrix.default"         0.10     0.01       0.44      0.04
....

$by.total
                          total.time total.pct self.time self.pct
"rpart"                      1085.38     99.99      0.44     0.04
"rpart.matrix"               1081.78     99.66    294.26    27.11
"[[<-"                        786.62     72.47      0.06     0.01
"[[<-.data.frame"             786.56     72.46    786.46    72.45
"model.frame.default"           3.00      0.28      1.04     0.10
"eval"                          3.00      0.28      0.04     0.00
"eval.parent"                   3.00      0.28      0.00     0.00
"model.frame"                   3.00      0.28      0.00     0.00
"terms.formula"                 0.96      0.09      0.96     0.09
"terms"                         0.96      0.09      0.00     0.00
"makepredictcall"               0.50      0.05      0.04     0.00
"as.list.data.frame"            0.46      0.04      0.46     0.04
"makepredictcall.default"       0.46      0.04      0.46     0.04
"as.list"                       0.46      0.04      0.00     0.00
"vapply"                        0.46      0.04      0.00     0.00
"model.matrix.default"          0.44      0.04      0.10     0.01
"[["                            0.44      0.04      0.02     0.00
"model.matrix"                  0.44      0.04      0.00     0.00
....

$sample.interval
[1] 0.02

$sampling.time
[1] 1085.5

Note from the above that a big chunk of time is spent in function rpart.matrix:

> rpart:::rpart.matrix
function (frame) 
{
    if (!inherits(frame, "data.frame") || is.null(attr(frame, 
        "terms"))) 
        return(as.matrix(frame))
    for (i in 1:ncol(frame)) {
        if (is.character(frame[[i]])) 
            frame[[i]] <- as.numeric(factor(frame[[i]]))
        else if (!is.numeric(frame[[i]])) 
            frame[[i]] <- as.numeric(frame[[i]])
    }
    X <- model.matrix(attr(frame, "terms"), frame)[, -1L, drop = FALSE]
    colnames(X) <- sub("^`(.*)`", "\\1", colnames(X))
    class(X) <- c("rpart.matrix", class(X))
    X
}

But it is the for loop in that function where most of the time is spent, essentially converting each column and adding them back to the data frame.

like image 89
Gavin Simpson Avatar answered Nov 11 '22 20:11

Gavin Simpson


Just building on @gavin simpson's discovery above... I decided to hack around with rpart.matrix, to see if I could do something about that excessive execution time.

The problem boils down to the use of a for loop. Normally I'm agnostic about for compared to [sl]apply; the latter is generally considered more elegant, but I'm not going to replace a for when it's working fine, just for that. In particular I think the performance benefits of *apply are sometimes overhyped; for has been improved significantly in terms of speed and memory use compared to the old S-Plus days.

Not in this case though. Simply replacing the for with lapply cuts the run time for this example by >2 orders of magnitude. Would be good to see if others can confirm this.

m <- model.frame(x.fac)


# call rpart.matrix
system.time(mm <- rpart:::rpart.matrix(m))
   user  system elapsed 
 208.25   88.03  296.99 


# exactly the same as rpart.matrix, but with for replaced by lapply
f <- function(frame)
{
    if (!inherits(frame, "data.frame") || is.null(attr(frame, 
        "terms"))) 
        return(as.matrix(frame))
    frame[] <- lapply(frame, function(x) {
        if (is.character(x))
            as.numeric(factor(x))
        else if(!is.numeric(x))
            as.numeric(x)
        else x
    })
    X <- model.matrix(attr(frame, "terms"), frame)[, -1L, drop = FALSE]
    colnames(X) <- sub("^`(.*)`", "\\1", colnames(X))
    class(X) <- c("rpart.matrix", class(X))
    X
}

system.time(mm2 <- f(m))
   user  system elapsed 
   0.65    0.04    0.70 


identical(mm, mm2)
[1] TRUE
like image 37
Hong Ooi Avatar answered Nov 11 '22 21:11

Hong Ooi