Why is `sapply` much slower to process rows than columns in a dataframe in R?

Question

Consider the following script, which we will call Foo.r.

set.seed(1)                                       
x=matrix(rnorm(1000*1000),ncol=1000)              
x=data.frame(x)                                   


dummy = sapply(1:1000,function(i) sum(x[i,]) )    
#dummy = sapply(1:1000,function(i) sum(x[,i]) )

When the first dummy line is commented out, we are summing columns, and the code takes less than a second to run on my machine.

$ time Rscript Foo.r  

real    0m0.766s
user    0m0.536s
sys     0m0.080s

When the second dummy line is commented out (and the first is commented in), we are summing rows, and the run time is closer to 30 seconds.

$ time Rscript Foo.r  

real    0m30.589s
user    0m30.248s
sys     0m0.104s

Note that I am aware of the standard summing functions rowSums and colSums, but I am using sum only as an example for this strange asymmetric performance behavior.

joran · Accepted Answer

This isn't really the result of sapply, rather it has to do with how data frames are stored and the implications that has for extracting rows versus columns. Data frames are stored as lists, where each element of the list is a column.

This means that extracting columns is easier than extracting rows.

To demonstrate that this has nothing to do with sapply, consider this, using your data frame x:

foo1 <- function(){
+   for (i in 1:1000){
+       tmp <- x[i, ]
+   }
+ }  
> 
> foo2 <- function(){
+   for (i in 1:1000){
+       tmp <- x[ ,i]
+   }
+ }
> system.time(foo2())
   user  system elapsed 
  0.029   0.000   0.031 
> system.time(foo1())  
   user  system elapsed 
 15.986   0.074  15.894

If you need to do things row-wise and fast, data frames will often be a bad choice. To operate on rows, it has to extract corresponding elements from each list item. To operate on columns it only has to loop through the columns.

Why is `sapply` much slower to process rows than columns in a dataframe in R?

Tags:

r

merlin2011

1 Answers

joran

Recent Activity

Donate For Us

Why is `sapply` much slower to process rows than columns in a dataframe in R?

Tags:

r

merlin2011

1 Answers

joran

Related questions

Recent Activity

Donate For Us