Let us consider the three data frames below:
toto.small <- data.frame(col1=rep(1,850), col2=rep(2,850))
toto.medium <- data.frame(col1=rep(1,85000), col2=rep(2,85000))
toto.big <- data.frame(col1=rep(1,850000), col2=rep(2,850000))
And the timings below:
system.time(for(i in 1:100) { toto.small[i,2] <- 3 })
user system elapsed
0.004 0.000 0.006
system.time(for(i in 1:100) { toto.medium[i,2] <- 3 })
user system elapsed
0.088 0.000 0.087
system.time(for(i in 1:100) { toto.big[i,2] <- 3 })
user system elapsed
2.248 0.000 2.254
It is two orders slower to iterate over the big data frame that the small one. Those loops are merely writing 100 pre-allocated elements in memory; the time should not even depend on the total length of the data frame.
Does anyone know the reason for this?
I still get similar time differences with data tables, as well as with apply functions.
EDIT 1: R 3.0.2 vs. R 3.1
For those curious here are the timings for data.table and data.frame with R v. 3.1 and 3.0.2 (I measure 3 times each):
R 3.0.2
type size time1 time2 time3
data frame small 0.005 0.005 0.005
data frame medium 0.074 0.077 0.075
data frame big 3.184 3.373 3.101
data table small 0.048 0.048 0.047
data table medium 0.073 0.068 0.066
data table big 0.615 0.621 0.593
R 3.1
type size time1 time2 time3
data frame small 0.004 0.004 0.004
data frame medium 0.021 0.020 0.022
data frame big 0.221 0.207 0.243
data table small 0.055 0.055 0.055
data table medium 0.076 0.076 0.076
data table big 0.705 0.699 0.663
R 3.1 is faster, but still we get some slow-down; the same stands for data table.
EDIT 2: using function set
The same numbers on R 3.1.0, using the function "set" instead of the "[]" operator
type size time1 time2 time3
data frame small 0.0249999999 0.0020000000 0.0009999999
data frame medium 0.0010000000 0.0009999999 0.0010000000
data frame big 0.0010000000 0.0000000000 0.0009999999
data table small 0.0009999999 0.0209999999 0.0000000000
data table medium 0.0009999999 0.0009999999 0.0010000000
data table big 0.0000000000 0.0029999999 0.0009999999
This solves completely the performance problem.
Your code is slow because the function [.<-data.frame
makes a copy of the underlying object each time you modify the object.
If you trace the memory usage it becomes clear:
tracemem(toto.big)
system.time({
for(i in 1:100) { toto.big[i,2] <- 3 }
})
tracemem[0x000000001d416b58 -> 0x000000001e08e9f8]: system.time
tracemem[0x000000001e08e9f8 -> 0x000000001e08eb10]: [<-.data.frame [<- system.time
tracemem[0x000000001e08eb10 -> 0x000000001e08ebb8]: [<-.data.frame [<- system.time
tracemem[0x000000001e08ebb8 -> 0x000000001e08e7c8]: system.time
tracemem[0x000000001e08e7c8 -> 0x000000001e08e758]: [<-.data.frame [<- system.time
tracemem[0x000000001e08e758 -> 0x000000001e08e800]: [<-.data.frame [<- system.time
....
tracemem[0x000000001e08e790 -> 0x000000001e08e838]: system.time
tracemem[0x000000001e08e838 -> 0x000000001e08eaa0]: [<-.data.frame [<- system.time
tracemem[0x000000001e08eaa0 -> 0x000000001e08e790]: [<-.data.frame [<- system.time
user system elapsed
4.31 1.01 5.29
To resolve this, your best action is to modify the data frame only once:
untracemem(toto.big)
system.time({
toto.big[1:100, 2] <- 5
})
user system elapsed
0.02 0.00 0.02
In those cases where it is more convenient to calculates values in a loop (or lapply
) then you can perform the calculation on a vector in a loop, then allocate into the data frame in one vectorised allocation:
system.time({
newvalues <- numeric(100)
for(i in 1:100)newvalues[i] <- rnorm(1)
toto.big[1:100, 2] <- newvalues
})
user system elapsed
0.02 0.00 0.02
You can view the code for <-.data.frame
by typing `<-.data.frame`
into your console.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With