Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why is sapply relatively slow when querying attributes on variables in a data.frame?

Tags:

r

Something that kind of surprised me: let's compare two ways of getting the classes for variables in a big data frame with many columns: an sapply solution and a for loop solution.

bigDF <- as.data.frame( matrix( 0, nrow=1E5, ncol=1E3 ) )
library( microbenchmark )

for_soln <- function(x) {
  out <- character( ncol(x) )
  for( i in 1:ncol(x) ) {
    out[i] <- class(x[,i])
  }
  return( out )
}

microbenchmark( times=20,
  sapply( bigDF, class ),
  for_soln( bigDF )
)

gives me, on my machine,

Unit: milliseconds
                  expr       min        lq    median       uq      max
1      for_soln(bigDF)  21.26563  21.58688  26.03969 163.6544 300.6819
2 sapply(bigDF, class) 385.90406 405.04047 444.69212 471.8829 889.6217

Interestingly, if we transform bigDF into a list, sapply is once again nice and speedy.

bigList <- as.list( bigDF )
for_soln2 <- function(x) {
  out <- character( length(x) )
  for( i in 1:length(x) ) {
    out[i] <- class( x[[i]] )
  }
  return( out )
}

microbenchmark( sapply( bigList, class ), for_soln2( bigList ) )

gives me

Unit: milliseconds
                    expr      min       lq   median       uq      max
1     for_soln2(bigList) 1.887353 1.959856 2.010270 2.058968 4.497837
2 sapply(bigList, class) 1.348461 1.386648 1.401706 1.428025 3.825547

Why are these operations, especially sapply, taking so much longer with a data.frame as compared to a list? And is there a more idiomatic solution?

like image 722
Kevin Ushey Avatar asked Jan 05 '13 07:01

Kevin Ushey


1 Answers

edit: The old proposed solution t3 <- sapply(1:ncol(bigDF), function(idx) class(bigDF[,idx])) is now changed to t3 <- sapply(1:ncol(bigDF), function(idx) class(bigDF[[idx]])). Its even faster. Thanks to @Wojciech's comment

The reason I could think of is that you are converting the data.frame to a list unnecessarily. In addition, your results are also not identical

bigDF <- as.data.frame(matrix(0, nrow=1E5, ncol=1E3))
t1 <- sapply(bigDF, class)
t2 <- for_soln(bigDF)

> head(t1)
    V1        V2        V3        V4        V5        V6 
"numeric" "numeric" "numeric" "numeric" "numeric" "numeric" 
> head(t2)
[1] "numeric" "numeric" "numeric" "numeric" "numeric" "numeric"

> identical(t1, t2)
[1] FALSE

Doing an Rprof on sapply tells that all time spent is on as.list.data.fraame

Rprof()
t1 <- sapply(bigDF, class)
Rprof(NULL)
summaryRprof()

$by.self
                     self.time self.pct total.time total.pct
"as.list.data.frame"      1.16      100       1.16       100    

You could speed up the operation by not asking for as.list.data.frame. Instead we could just query the class of each column of the data.frame directly as shown below. This is exactly equivalent to what you accomplish with the for-loop in fact.

t3 <- sapply(1:ncol(bigDF), function(idx) class(bigDF[[idx]]))
> identical(t2, t3)
[1] TRUE

microbenchmark(times=20, 
    sapply(bigDF, class),
    for_soln(bigDF),
    sapply(1:ncol(bigDF), function(idx) 
        class(bigDF[[idx]]))
)

Unit: milliseconds
        expr             min        lq       median       uq       max
1   for-soln (t2)     38.31545   39.45940   40.48152   43.05400  313.9484
2   sapply-new (t3)   18.51510   18.82293   19.87947   26.10541  261.5233
3   sapply-orig (t1) 952.94612 1075.38915 1159.49464 1204.52747 1484.1522

The difference in t3 is that you create a list of length 1000 each with length 1. Whereas, in t1, its a list of length 1000, each with length 10000.

like image 177
Arun Avatar answered Oct 10 '22 03:10

Arun