Something that kind of surprised me: let's compare two ways of getting the classes for variables in a big data frame with many columns: an sapply solution and a for loop solution.
bigDF <- as.data.frame( matrix( 0, nrow=1E5, ncol=1E3 ) )
library( microbenchmark )
for_soln <- function(x) {
out <- character( ncol(x) )
for( i in 1:ncol(x) ) {
out[i] <- class(x[,i])
}
return( out )
}
microbenchmark( times=20,
sapply( bigDF, class ),
for_soln( bigDF )
)
gives me, on my machine,
Unit: milliseconds
expr min lq median uq max
1 for_soln(bigDF) 21.26563 21.58688 26.03969 163.6544 300.6819
2 sapply(bigDF, class) 385.90406 405.04047 444.69212 471.8829 889.6217
Interestingly, if we transform bigDF into a list, sapply is once again nice and speedy.
bigList <- as.list( bigDF )
for_soln2 <- function(x) {
out <- character( length(x) )
for( i in 1:length(x) ) {
out[i] <- class( x[[i]] )
}
return( out )
}
microbenchmark( sapply( bigList, class ), for_soln2( bigList ) )
gives me
Unit: milliseconds
expr min lq median uq max
1 for_soln2(bigList) 1.887353 1.959856 2.010270 2.058968 4.497837
2 sapply(bigList, class) 1.348461 1.386648 1.401706 1.428025 3.825547
Why are these operations, especially sapply, taking so much longer with a data.frame as compared to a list? And is there a more idiomatic solution?
edit: The old proposed solution t3 <- sapply(1:ncol(bigDF), function(idx) class(bigDF[,idx])) is now changed to t3 <- sapply(1:ncol(bigDF), function(idx) class(bigDF[[idx]])). Its even faster. Thanks to @Wojciech's comment
The reason I could think of is that you are converting the data.frame to a list unnecessarily. In addition, your results are also not identical
bigDF <- as.data.frame(matrix(0, nrow=1E5, ncol=1E3))
t1 <- sapply(bigDF, class)
t2 <- for_soln(bigDF)
> head(t1)
V1 V2 V3 V4 V5 V6
"numeric" "numeric" "numeric" "numeric" "numeric" "numeric"
> head(t2)
[1] "numeric" "numeric" "numeric" "numeric" "numeric" "numeric"
> identical(t1, t2)
[1] FALSE
Doing an Rprof on sapply tells that all time spent is on as.list.data.fraame
Rprof()
t1 <- sapply(bigDF, class)
Rprof(NULL)
summaryRprof()
$by.self
self.time self.pct total.time total.pct
"as.list.data.frame" 1.16 100 1.16 100
You could speed up the operation by not asking for as.list.data.frame. Instead we could just query the class of each column of the data.frame directly as shown below. This is exactly equivalent to what you accomplish with the for-loop in fact.
t3 <- sapply(1:ncol(bigDF), function(idx) class(bigDF[[idx]]))
> identical(t2, t3)
[1] TRUE
microbenchmark(times=20,
sapply(bigDF, class),
for_soln(bigDF),
sapply(1:ncol(bigDF), function(idx)
class(bigDF[[idx]]))
)
Unit: milliseconds
expr min lq median uq max
1 for-soln (t2) 38.31545 39.45940 40.48152 43.05400 313.9484
2 sapply-new (t3) 18.51510 18.82293 19.87947 26.10541 261.5233
3 sapply-orig (t1) 952.94612 1075.38915 1159.49464 1204.52747 1484.1522
The difference in t3 is that you create a list of length 1000 each with length 1. Whereas, in t1, its a list of length 1000, each with length 10000.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With