i am currently using the <code>rpart</code> package to fit a regression tree to a data with relatively few observations and several thousand categorical predictors taking two possible values. from testing the package out on smaller data i know that in this instance it doesn't matter whether i declare the regressors as categorical (i.e. factors) or leave them as they are (they are coded as +/-1). however, i would still like to understand why passing my explanatory variables as factors significantly slows the algorithm down (not least because i shall soon get new data where response takes 3 diffirent values and treating them as continuous would no longer be an option). surely it should be the other way round? here is a sample code emulating my data: <pre class="prettyprint"><code>library(rpart) x <- as.data.frame(matrix(sample(c(-1, +1), 50 * 3000, replace = T), nrow = 50)) y <- rnorm(50) x.fac <- as.data.frame(lapply(x, factor)) </code></pre> now compare: <pre class="prettyprint"><code>system.time(rpart( y ~ ., data = x, method = 'anova')) user system elapsed 1.62 0.21 1.85 system.time(rpart( y ~ ., data = x.fac, method = 'anova')) user system elapsed 246.87 165.91 412.92 </code></pre> dealing with only one potential split possibility per variable (factors) is simpler and faster than dealing with a whole range of potential splits (for continuous variables), so i am most confused by the <code>rpart</code> behaviour. any clarifications/suggestions would be very apprecaited.

Just building on @gavin simpson's discovery above... I decided to hack around with <code>rpart.matrix</code>, to see if I could do something about that excessive execution time. The problem boils down to the use of a <code>for</code> loop. Normally I'm agnostic about <code>for</code> compared to <code>[sl]apply</code>; the latter is generally considered more elegant, but I'm not going to replace a <code>for</code> when it's working fine, just for that. In particular I think the performance benefits of <code>*apply</code> are sometimes overhyped; <code>for</code> has been improved significantly in terms of speed and memory use compared to the old S-Plus days. Not in this case though. Simply replacing the <code>for</code> with <code>lapply</code> cuts the run time for this example by >2 orders of magnitude. Would be good to see if others can confirm this. <pre class="prettyprint"><code>m <- model.frame(x.fac) # call rpart.matrix system.time(mm <- rpart:::rpart.matrix(m)) user system elapsed 208.25 88.03 296.99 # exactly the same as rpart.matrix, but with for replaced by lapply f <- function(frame) { if (!inherits(frame, "data.frame") || is.null(attr(frame, "terms"))) return(as.matrix(frame)) frame[] <- lapply(frame, function(x) { if (is.character(x)) as.numeric(factor(x)) else if(!is.numeric(x)) as.numeric(x) else x }) X <- model.matrix(attr(frame, "terms"), frame)[, -1L, drop = FALSE] colnames(X) <- sub("^`(.*)`", "\\1", colnames(X)) class(X) <- c("rpart.matrix", class(X)) X } system.time(mm2 <- f(m)) user system elapsed 0.65 0.04 0.70 identical(mm, mm2) [1] TRUE </code></pre>

rpart: Computational time for categorical vs continuous regressors

Tags:

r

tree

statistics

rpart

i am currently using the rpart package to fit a regression tree to a data with relatively few observations and several thousand categorical predictors taking two possible values.

from testing the package out on smaller data i know that in this instance it doesn't matter whether i declare the regressors as categorical (i.e. factors) or leave them as they are (they are coded as +/-1).

however, i would still like to understand why passing my explanatory variables as factors significantly slows the algorithm down (not least because i shall soon get new data where response takes 3 diffirent values and treating them as continuous would no longer be an option). surely it should be the other way round?

here is a sample code emulating my data:

library(rpart)

x <- as.data.frame(matrix(sample(c(-1, +1), 50 * 3000, replace = T), nrow = 50))
y <- rnorm(50)

x.fac <- as.data.frame(lapply(x, factor))

now compare:

system.time(rpart( y ~ ., data = x, method = 'anova'))

   user  system elapsed 
   1.62    0.21    1.85 

system.time(rpart( y ~ ., data = x.fac, method = 'anova'))

   user  system elapsed 
   246.87  165.91  412.92

dealing with only one potential split possibility per variable (factors) is simpler and faster than dealing with a whole range of potential splits (for continuous variables), so i am most confused by the rpart behaviour. any clarifications/suggestions would be very apprecaited.

256

asked Jun 19 '13 15:06

stas g

2 Answers

You need to profile the code to be sure, but I would be surprised if the timing difference did not come from R having to turn each factor variable into two binary variables as it prepares the model matrix.

Try

Rprof("rpartProfile.Rprof")
rpart( y ~ ., data = x.fac, method = 'anova')
Rprof()

summaryRprof("rpartProfile.Rprof")

and look to see where the time is being spent. Which I have now done:

> summaryRprof("rpartProfile.Rprof")
$by.self
                          self.time self.pct total.time total.pct
"[[<-.data.frame"            786.46    72.45     786.56     72.46
"rpart.matrix"               294.26    27.11    1081.78     99.66
"model.frame.default"          1.04     0.10       3.00      0.28
"terms.formula"                0.96     0.09       0.96      0.09
"as.list.data.frame"           0.46     0.04       0.46      0.04
"makepredictcall.default"      0.46     0.04       0.46      0.04
"rpart"                        0.44     0.04    1085.38     99.99
"[[.data.frame"                0.16     0.01       0.42      0.04
"<Anonymous>"                  0.16     0.01       0.18      0.02
"match"                        0.14     0.01       0.22      0.02
"print"                        0.12     0.01       0.12      0.01
"model.matrix.default"         0.10     0.01       0.44      0.04
....

$by.total
                          total.time total.pct self.time self.pct
"rpart"                      1085.38     99.99      0.44     0.04
"rpart.matrix"               1081.78     99.66    294.26    27.11
"[[<-"                        786.62     72.47      0.06     0.01
"[[<-.data.frame"             786.56     72.46    786.46    72.45
"model.frame.default"           3.00      0.28      1.04     0.10
"eval"                          3.00      0.28      0.04     0.00
"eval.parent"                   3.00      0.28      0.00     0.00
"model.frame"                   3.00      0.28      0.00     0.00
"terms.formula"                 0.96      0.09      0.96     0.09
"terms"                         0.96      0.09      0.00     0.00
"makepredictcall"               0.50      0.05      0.04     0.00
"as.list.data.frame"            0.46      0.04      0.46     0.04
"makepredictcall.default"       0.46      0.04      0.46     0.04
"as.list"                       0.46      0.04      0.00     0.00
"vapply"                        0.46      0.04      0.00     0.00
"model.matrix.default"          0.44      0.04      0.10     0.01
"[["                            0.44      0.04      0.02     0.00
"model.matrix"                  0.44      0.04      0.00     0.00
....

$sample.interval
[1] 0.02

$sampling.time
[1] 1085.5

Note from the above that a big chunk of time is spent in function rpart.matrix:

> rpart:::rpart.matrix
function (frame) 
{
    if (!inherits(frame, "data.frame") || is.null(attr(frame, 
        "terms"))) 
        return(as.matrix(frame))
    for (i in 1:ncol(frame)) {
        if (is.character(frame[[i]])) 
            frame[[i]] <- as.numeric(factor(frame[[i]]))
        else if (!is.numeric(frame[[i]])) 
            frame[[i]] <- as.numeric(frame[[i]])
    }
    X <- model.matrix(attr(frame, "terms"), frame)[, -1L, drop = FALSE]
    colnames(X) <- sub("^`(.*)`", "\\1", colnames(X))
    class(X) <- c("rpart.matrix", class(X))
    X
}

But it is the for loop in that function where most of the time is spent, essentially converting each column and adding them back to the data frame.

answered Nov 11 '22 20:11

Gavin Simpson

Just building on @gavin simpson's discovery above... I decided to hack around with rpart.matrix, to see if I could do something about that excessive execution time.

The problem boils down to the use of a for loop. Normally I'm agnostic about for compared to [sl]apply; the latter is generally considered more elegant, but I'm not going to replace a for when it's working fine, just for that. In particular I think the performance benefits of *apply are sometimes overhyped; for has been improved significantly in terms of speed and memory use compared to the old S-Plus days.

Not in this case though. Simply replacing the for with lapply cuts the run time for this example by >2 orders of magnitude. Would be good to see if others can confirm this.

m <- model.frame(x.fac)


# call rpart.matrix
system.time(mm <- rpart:::rpart.matrix(m))
   user  system elapsed 
 208.25   88.03  296.99 


# exactly the same as rpart.matrix, but with for replaced by lapply
f <- function(frame)
{
    if (!inherits(frame, "data.frame") || is.null(attr(frame, 
        "terms"))) 
        return(as.matrix(frame))
    frame[] <- lapply(frame, function(x) {
        if (is.character(x))
            as.numeric(factor(x))
        else if(!is.numeric(x))
            as.numeric(x)
        else x
    })
    X <- model.matrix(attr(frame, "terms"), frame)[, -1L, drop = FALSE]
    colnames(X) <- sub("^`(.*)`", "\\1", colnames(X))
    class(X) <- c("rpart.matrix", class(X))
    X
}

system.time(mm2 <- f(m))
   user  system elapsed 
   0.65    0.04    0.70 


identical(mm, mm2)
[1] TRUE

answered Nov 11 '22 21:11

Hong Ooi

Related questions
                            
                                3D equivalent of the curve function in R?
                            
                                Factor order within faceted dotplot using ggplot2
                            
                                How to get values of Z - statistic from glm object?
                            
                                How to obtain multiple lines in a single density plot, with a corrected scale? [duplicate]
                            
                                Converting weeks of year to dates
                            
                                Getting rid of "AsIs" class attribute
                            
                                Show special primitive functions in call stack
                            
                                R: creating vectors of latin/greek expression for plot titles, axis labels, or legends
                            
                                strtoi fails to convert string to integer, returns NA
                            
                                Returning a vector of class POSIXct with vapply
                            
                                Is there a work around for slow performance of do.call(cbind.xts,...) in R 2.15.2?
                            
                                Get indices of K smallest or largest elements in each row of a matrix in R
                            
                                Launching an R graph from within C# .NET
                            
                                ggplot2 + aes_string inside a function via formula interface
                            
                                R: RStudio: How to get silhouette plot working?
                            
                                Omit some borders in ggplot/ggmap
                            
                                Downloading FRED data with quantmod: can dates be specified?
                            
                                R: Converting cartesian coordinates to polar coordinates, and then calculating distance from origin
                            
                                First bracketed assignment is as time-consuming as full assignment?
                            
                                trouble with cbind in manova call in r

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With