Consider the following script, which we will call Foo.r
.
set.seed(1)
x=matrix(rnorm(1000*1000),ncol=1000)
x=data.frame(x)
dummy = sapply(1:1000,function(i) sum(x[i,]) )
#dummy = sapply(1:1000,function(i) sum(x[,i]) )
When the first dummy
line is commented out, we are summing columns, and the code takes less than a second to run on my machine.
$ time Rscript Foo.r
real 0m0.766s
user 0m0.536s
sys 0m0.080s
When the second dummy
line is commented out (and the first is commented in), we are summing rows, and the run time is closer to 30 seconds.
$ time Rscript Foo.r
real 0m30.589s
user 0m30.248s
sys 0m0.104s
Note that I am aware of the standard summing functions rowSums
and colSums
, but I am using sum only as an example for this strange asymmetric performance behavior.
This isn't really the result of sapply
, rather it has to do with how data frames are stored and the implications that has for extracting rows versus columns. Data frames are stored as lists, where each element of the list is a column.
This means that extracting columns is easier than extracting rows.
To demonstrate that this has nothing to do with sapply
, consider this, using your data frame x
:
foo1 <- function(){
+ for (i in 1:1000){
+ tmp <- x[i, ]
+ }
+ }
>
> foo2 <- function(){
+ for (i in 1:1000){
+ tmp <- x[ ,i]
+ }
+ }
> system.time(foo2())
user system elapsed
0.029 0.000 0.031
> system.time(foo1())
user system elapsed
15.986 0.074 15.894
If you need to do things row-wise and fast, data frames will often be a bad choice. To operate on rows, it has to extract corresponding elements from each list item. To operate on columns it only has to loop through the columns.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With