Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Writing to data frame with many lines is very slow

Tags:

r

Let us consider the three data frames below:

toto.small  <- data.frame(col1=rep(1,850), col2=rep(2,850))
toto.medium <- data.frame(col1=rep(1,85000), col2=rep(2,85000))
toto.big    <- data.frame(col1=rep(1,850000), col2=rep(2,850000))

And the timings below:

system.time(for(i in 1:100) { toto.small[i,2] <- 3 })
user  system elapsed 
0.004   0.000   0.006 

system.time(for(i in 1:100) { toto.medium[i,2] <- 3 })
user  system elapsed 
0.088   0.000   0.087 

system.time(for(i in 1:100) { toto.big[i,2] <- 3 })
user  system elapsed 
2.248   0.000   2.254 

It is two orders slower to iterate over the big data frame that the small one. Those loops are merely writing 100 pre-allocated elements in memory; the time should not even depend on the total length of the data frame.

Does anyone know the reason for this?

I still get similar time differences with data tables, as well as with apply functions.

EDIT 1: R 3.0.2 vs. R 3.1

For those curious here are the timings for data.table and data.frame with R v. 3.1 and 3.0.2 (I measure 3 times each):

R 3.0.2

      type   size time1 time2 time3
data frame  small 0.005 0.005 0.005
data frame medium 0.074 0.077 0.075
data frame    big 3.184 3.373 3.101
data table  small 0.048 0.048 0.047
data table medium 0.073 0.068 0.066
data table    big 0.615 0.621 0.593

R 3.1

      type   size time1 time2 time3
data frame  small 0.004 0.004 0.004
data frame medium 0.021 0.020 0.022
data frame    big 0.221 0.207 0.243
data table  small 0.055 0.055 0.055
data table medium 0.076 0.076 0.076
data table    big 0.705 0.699 0.663

R 3.1 is faster, but still we get some slow-down; the same stands for data table.

EDIT 2: using function set

The same numbers on R 3.1.0, using the function "set" instead of the "[]" operator

      type   size        time1        time2        time3
data frame  small 0.0249999999 0.0020000000 0.0009999999
data frame medium 0.0010000000 0.0009999999 0.0010000000
data frame    big 0.0010000000 0.0000000000 0.0009999999
data table  small 0.0009999999 0.0209999999 0.0000000000
data table medium 0.0009999999 0.0009999999 0.0010000000
data table    big 0.0000000000 0.0029999999 0.0009999999

This solves completely the performance problem.

like image 806
Antoine Trouve Avatar asked Jul 04 '14 08:07

Antoine Trouve


Video Answer


1 Answers

Your code is slow because the function [.<-data.frame makes a copy of the underlying object each time you modify the object.

If you trace the memory usage it becomes clear:

tracemem(toto.big)
system.time({
  for(i in 1:100) { toto.big[i,2] <- 3 }
})


tracemem[0x000000001d416b58 -> 0x000000001e08e9f8]: system.time 
tracemem[0x000000001e08e9f8 -> 0x000000001e08eb10]: [<-.data.frame [<- system.time 
tracemem[0x000000001e08eb10 -> 0x000000001e08ebb8]: [<-.data.frame [<- system.time 
tracemem[0x000000001e08ebb8 -> 0x000000001e08e7c8]: system.time 
tracemem[0x000000001e08e7c8 -> 0x000000001e08e758]: [<-.data.frame [<- system.time 
tracemem[0x000000001e08e758 -> 0x000000001e08e800]: [<-.data.frame [<- system.time 
....
tracemem[0x000000001e08e790 -> 0x000000001e08e838]: system.time 
tracemem[0x000000001e08e838 -> 0x000000001e08eaa0]: [<-.data.frame [<- system.time 
tracemem[0x000000001e08eaa0 -> 0x000000001e08e790]: [<-.data.frame [<- system.time 
   user  system elapsed 
   4.31    1.01    5.29 

To resolve this, your best action is to modify the data frame only once:

untracemem(toto.big)

system.time({
  toto.big[1:100, 2] <- 5
})

   user  system elapsed 
   0.02    0.00    0.02

In those cases where it is more convenient to calculates values in a loop (or lapply) then you can perform the calculation on a vector in a loop, then allocate into the data frame in one vectorised allocation:

system.time({
  newvalues <- numeric(100)
  for(i in 1:100)newvalues[i] <- rnorm(1)
  toto.big[1:100, 2] <- newvalues
})

   user  system elapsed 
   0.02    0.00    0.02 

You can view the code for <-.data.frame by typing `<-.data.frame` into your console.

like image 158
Andrie Avatar answered Oct 13 '22 10:10

Andrie