I have a piece of code and total elapsed time is around 30 secs of which, the following code is around 27 secs. I narrowed the offending code to this:
d$dis300[i] <- h
So I change to this other piece and is now working really fast (as expected).
My question is why this is too slow against the second. The datos DF is around 7500x18 vars
First: (27 sec elapsed)
d$dis300 <- 0
for (i in 1:netot) {
h <- aaa[d$ent[i], d$dis[i]]
if (h == 0) writeLines(sprintf("ERROR. ent:%i dis:%i", d$ent[i], d$dis[i]))
d$dis300[i] <- h
}
Second: (0.2 sec elapsed)
d$dis300 <- 0
for (i in 1:netot) {
h <- aaa[d$ent[i], d$dis[i]]
if (h == 0) writeLines(sprintf("ERROR. ent:%i dis:%i", d$ent[i], d$dis[i]))
foo[i] <- h
}
d$foo <- foo
You can see both are the "same" but the offending one has this DF instead of a single vector.
Any comment is really appreciated. I came from another type of languages and this drove me nuts for a while. At least I have solution but I like to prevent this kind of issues in the future.
Thanks for your time,
The reason is that d$dis300[i] <- h
calls $<-.data.frame
.
It's a rather complex function as you can see:
`$<-.data.frame`
You don't say what foo
is, but if it is an atomic vector, the $<-
function is implemented in C for speed.
Still, I hope you declare foo as follows:
foo <- numeric(netot)
This will ensure you don't need to reallocate the vector for each assignment in the loop:
foo <- 0 # BAD!
system.time( for(i in 1:5e4) foo[i] <- 0 ) # 4.40 secs
foo <- numeric(5e4) # Pre-allocate
system.time( for(i in 1:5e4) foo[i] <- 0 ) # 0.09 secs
Using the *apply
family instead you don't worry about that:
d$foo <- vapply(1:netot, function(i, aaa, ent, dis) {
h <- aaa[ent[i], dis[i]]
if (h == 0) writeLines(sprintf("ERROR. ent:%i dis:%i", ent[i], dis[i]))
h
}, numeric(1), aaa=aaa, ent=d$ent, dis=d$dis)
...here I also extracted d$ent
and d$dis
outside the loop which should improve things a bit too. Can't run it myself though since you didn't give reproducible data. But here's a similar example:
d <- data.frame(x=1)
system.time( vapply(1:1e6, function(i) d$x, numeric(1)) ) # 3.20 secs
system.time( vapply(1:1e6, function(i, x) x, numeric(1), x=d$x) ) # 0.56 secs
... but finally it seems it can all be reduced to (barring your error detection code):
d$foo <- aaa[cbind(d$ent, d$dis)]
Tommy's is the best answer. This was too big for comment so adding it as an answer...
This is how you can see the copies (of the whole of DF
, as joran commented) yourself :
> DF = data.frame(a=1:3,b=4:6)
> tracemem(DF)
[1] "<0x0000000003104800"
> for (i in 1:3) {DF$b[i] <- i; .Internal(inspect(DF))}
tracemem[0000000003104800 -> 000000000396EAD8]:
tracemem[000000000396EAD8 -> 000000000396E4F0]: $<-.data.frame $<-
tracemem[000000000396E4F0 -> 000000000399CDC8]: $<-.data.frame $<-
@000000000399CDC8 19 VECSXP g0c2 [OBJ,NAM(2),TR,ATT] (len=2, tl=0)
@000000000399CD90 13 INTSXP g0c2 [] (len=3, tl=0) 1,2,3
@000000000399CCE8 13 INTSXP g0c2 [] (len=3, tl=0) 1,5,6
ATTRIB: # .. snip ..
tracemem[000000000399CDC8 -> 000000000399CC40]:
tracemem[000000000399CC40 -> 000000000399CAB8]: $<-.data.frame $<-
tracemem[000000000399CAB8 -> 000000000399C9A0]: $<-.data.frame $<-
@000000000399C9A0 19 VECSXP g0c2 [OBJ,NAM(2),TR,ATT] (len=2, tl=0)
@000000000399C968 13 INTSXP g0c2 [] (len=3, tl=0) 1,2,3
@000000000399C888 13 INTSXP g0c2 [] (len=3, tl=0) 1,2,6
ATTRIB: # .. snip ..
tracemem[000000000399C9A0 -> 000000000399C7E0]:
tracemem[000000000399C7E0 -> 000000000399C700]: $<-.data.frame $<-
tracemem[000000000399C700 -> 00000000039C78D8]: $<-.data.frame $<-
@00000000039C78D8 19 VECSXP g0c2 [OBJ,NAM(2),TR,ATT] (len=2, tl=0)
@00000000039C78A0 13 INTSXP g0c2 [] (len=3, tl=0) 1,2,3
@0000000003E07890 13 INTSXP g0c2 [] (len=3, tl=0) 1,2,3
ATTRIB: # .. snip ..
> DF
a b
1 1 1
2 2 2
3 3 3
Each of those tracemem[]
lines corresponds to a copy of the whole object. You can see the hex addresses of the a
column vector changing each time, too, despite it not being modifed by the assignment to b
.
AFAIK, without dropping into C code yourself, the only way (currently) in R to modify an item of a data.frame
with no copy of any memory at all, is the :=
operator and set()
function, both in package data.table
. There are 17 questions about assigning by reference using :=
here on Stack Overflow.
But in this case Tommy's one liner is definitely best as you don't even need a loop at all.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With