Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why this is so slow? (loop in a DF row vs. a standalone vector)

Tags:

performance

r

I have a piece of code and total elapsed time is around 30 secs of which, the following code is around 27 secs. I narrowed the offending code to this:

d$dis300[i] <- h

So I change to this other piece and is now working really fast (as expected).

My question is why this is too slow against the second. The datos DF is around 7500x18 vars

First: (27 sec elapsed)

d$dis300 <- 0
for (i in 1:netot) {
  h <- aaa[d$ent[i], d$dis[i]]
  if (h == 0) writeLines(sprintf("ERROR. ent:%i dis:%i", d$ent[i], d$dis[i]))
  d$dis300[i] <- h
}

Second: (0.2 sec elapsed)

d$dis300 <- 0
for (i in 1:netot) {
  h <- aaa[d$ent[i], d$dis[i]]
  if (h == 0) writeLines(sprintf("ERROR. ent:%i dis:%i", d$ent[i], d$dis[i]))
  foo[i] <- h
}
d$foo <- foo

You can see both are the "same" but the offending one has this DF instead of a single vector.

Any comment is really appreciated. I came from another type of languages and this drove me nuts for a while. At least I have solution but I like to prevent this kind of issues in the future.

Thanks for your time,

like image 529
notuo Avatar asked Apr 24 '12 23:04

notuo


2 Answers

The reason is that d$dis300[i] <- h calls $<-.data.frame.

It's a rather complex function as you can see:

`$<-.data.frame`

You don't say what foo is, but if it is an atomic vector, the $<- function is implemented in C for speed.

Still, I hope you declare foo as follows:

foo <- numeric(netot)

This will ensure you don't need to reallocate the vector for each assignment in the loop:

foo <- 0 # BAD!
system.time( for(i in 1:5e4) foo[i] <- 0 ) # 4.40 secs
foo <- numeric(5e4) # Pre-allocate
system.time( for(i in 1:5e4) foo[i] <- 0 ) # 0.09 secs

Using the *apply family instead you don't worry about that:

d$foo <- vapply(1:netot, function(i, aaa, ent, dis) {
  h <- aaa[ent[i], dis[i]]
  if (h == 0) writeLines(sprintf("ERROR. ent:%i dis:%i", ent[i], dis[i]))
  h
}, numeric(1), aaa=aaa, ent=d$ent, dis=d$dis)

...here I also extracted d$ent and d$dis outside the loop which should improve things a bit too. Can't run it myself though since you didn't give reproducible data. But here's a similar example:

d <- data.frame(x=1)
system.time( vapply(1:1e6, function(i) d$x, numeric(1)) )         # 3.20 secs
system.time( vapply(1:1e6, function(i, x) x, numeric(1), x=d$x) ) # 0.56 secs

... but finally it seems it can all be reduced to (barring your error detection code):

d$foo <- aaa[cbind(d$ent, d$dis)]
like image 172
Tommy Avatar answered Nov 07 '22 09:11

Tommy


Tommy's is the best answer. This was too big for comment so adding it as an answer...

This is how you can see the copies (of the whole of DF, as joran commented) yourself :

> DF = data.frame(a=1:3,b=4:6)
> tracemem(DF)
[1] "<0x0000000003104800"
> for (i in 1:3) {DF$b[i] <- i; .Internal(inspect(DF))}
tracemem[0000000003104800 -> 000000000396EAD8]: 
tracemem[000000000396EAD8 -> 000000000396E4F0]: $<-.data.frame $<- 
tracemem[000000000396E4F0 -> 000000000399CDC8]: $<-.data.frame $<- 
@000000000399CDC8 19 VECSXP g0c2 [OBJ,NAM(2),TR,ATT] (len=2, tl=0)
  @000000000399CD90 13 INTSXP g0c2 [] (len=3, tl=0) 1,2,3
  @000000000399CCE8 13 INTSXP g0c2 [] (len=3, tl=0) 1,5,6
ATTRIB: # .. snip ..

tracemem[000000000399CDC8 -> 000000000399CC40]: 
tracemem[000000000399CC40 -> 000000000399CAB8]: $<-.data.frame $<- 
tracemem[000000000399CAB8 -> 000000000399C9A0]: $<-.data.frame $<- 
@000000000399C9A0 19 VECSXP g0c2 [OBJ,NAM(2),TR,ATT] (len=2, tl=0)
  @000000000399C968 13 INTSXP g0c2 [] (len=3, tl=0) 1,2,3
  @000000000399C888 13 INTSXP g0c2 [] (len=3, tl=0) 1,2,6
ATTRIB: # .. snip ..

tracemem[000000000399C9A0 -> 000000000399C7E0]: 
tracemem[000000000399C7E0 -> 000000000399C700]: $<-.data.frame $<- 
tracemem[000000000399C700 -> 00000000039C78D8]: $<-.data.frame $<- 
@00000000039C78D8 19 VECSXP g0c2 [OBJ,NAM(2),TR,ATT] (len=2, tl=0)
  @00000000039C78A0 13 INTSXP g0c2 [] (len=3, tl=0) 1,2,3
  @0000000003E07890 13 INTSXP g0c2 [] (len=3, tl=0) 1,2,3
ATTRIB: # .. snip ..
> DF
  a b
1 1 1
2 2 2
3 3 3

Each of those tracemem[] lines corresponds to a copy of the whole object. You can see the hex addresses of the a column vector changing each time, too, despite it not being modifed by the assignment to b.

AFAIK, without dropping into C code yourself, the only way (currently) in R to modify an item of a data.frame with no copy of any memory at all, is the := operator and set() function, both in package data.table. There are 17 questions about assigning by reference using := here on Stack Overflow.

But in this case Tommy's one liner is definitely best as you don't even need a loop at all.

like image 26
Matt Dowle Avatar answered Nov 07 '22 11:11

Matt Dowle