I needed to assign a "second" id to group some values inside my original id
. this is my sample data:
dt<-structure(list(id = c("aaaa", "aaaa", "aaas", "aaas", "bbbb", "bbbb"),
period = c("start", "end", "start", "end", "start", "end"),
date = structure(c(15401L, 15401L, 15581L, 15762L, 15430L, 15747L), class = c("IDate", "Date"))),
class = c("data.table", "data.frame"),
.Names = c("id", "period", "date"),
sorted = "id")
> dt
id period date
1: aaaa start 2012-03-02
2: aaaa end 2012-03-05
3: aaas start 2012-08-21
4: aaas end 2013-02-25
5: bbbb start 2012-03-31
6: bbbb end 2013-02-11
column id
needs to be grouped (using the same value in say id2
) according to this list:
> groups
[[1]]
[1] "aaaa" "aaas"
[[2]]
[1] "bbbb"
I used the following code, which seems to work by gives the following warning
:
> dt[, id2 := which(vapply(groups, function(x,y) any(x==y), .BY[[1]], FUN.VALUE=T)), by=id]
Warning message:
In `[.data.table`(dt, , `:=`(id2, which(vapply(groups, function(x, :
Invalid .internal.selfref detected and fixed by taking a copy of the whole table,
so that := can add this new column by reference. At an earlier point, this data.table has
been copied by R (or been created manually using structure() or similar). Avoid key<-,
names<- and attr<- which in R currently (and oddly) may copy the whole data.table. Use
set* syntax instead to avoid copying: setkey(), setnames() and setattr(). Also,
list (DT1,DT2) will copy the entire DT1 and DT2 (R's list() copies named objects),
use reflist() instead if needed (to be implemented). If this message doesn't help,
please report to datatable-help so the root cause can be fixed.
> dt
id period date id2
1: aaaa start 2012-03-02 1
2: aaaa end 2012-03-02 1
3: aaas start 2012-08-29 1
4: aaas end 2013-02-26 1
5: bbbb start 2012-03-31 2
6: bbbb end 2013-02-11 2
could someone briefly explain the nature of this warning and any eventual implication in the final results (if any)? thanks
EDIT:
the follwing code is actually showing when dt
is created and how is passed to the function that gives the warning:
f.main <- function(){
f2 <- function(x){
groups <- list(c("aaaa", "aaas"), "bbbb") # actually generated depending on the similarity between values of x$id
x <- x[, id2 := which(vapply(groups, function(x,y) any(x==y), .BY[[1]], FUN.VALUE=T)), by=id]
return(x)
}
x <- f1()
if(!is.null(x[["res"]])){
x <- f2(x[["res"]])
return(x)
} else {
# something else
}
}
f1 <- function(){
dt<-data.table(id = c("aaaa", "aaaa", "aaas", "aaas", "bbbb", "bbbb"),
period = c("start", "end", "start", "end", "start", "end"),
date = structure(c(15401L, 15401L, 15581L, 15762L, 15430L, 15747L), class = c("IDate", "Date")))
return(list(res=dt, other_results=""))
}
> f.main()
id period date id2
1: aaaa start 2012-03-02 1
2: aaaa end 2012-03-02 1
3: aaas start 2012-08-29 1
4: aaas end 2013-02-26 1
5: bbbb start 2012-03-31 2
6: bbbb end 2013-02-11 2
Warning message:
In `[.data.table`(x, , `:=`(id2, which(vapply(groups, function(x, :
Invalid .internal.selfref detected and fixed by taking a copy of the whole table,
so that := can add this new column by reference. At an earlier point, this data.table
has been copied by R (or been created manually using structure() or similar).
Avoid key<-, names<- and attr<- which in R currently (and oddly) may copy the whole
data.table. Use set* syntax instead to avoid copying: setkey(), setnames() and setattr().
Also, list(DT1,DT2) will copy the entire DT1 and DT2 (R's list() copies named objects),
use reflist() instead if needed (to be implemented). If this message doesn't help,
please report to datatable-help so the root cause can be fixed.
Yes, the problem is the list. Here is a simple example:
DT <- data.table(1:5)
mylist1 <- list(DT,"a")
mylist1[[1]][,id:=.I]
#warning
mylist2 <- list(data.table(1:5),"a")
mylist2[[1]][,id:=.I]
#no warning
You should avoid copying a data.table into a list (and to be on the safe side I would avoid having a DT in a list at all). Try this:
f1 <- function(){
mylist <- list(res=data.table(id = c("aaaa", "aaaa", "aaas", "aaas", "bbbb", "bbbb"),
period = c("start", "end", "start", "end", "start", "end"),
date = structure(c(15401L, 15401L, 15581L, 15762L, 15430L, 15747L), class = c("IDate", "Date"))))
other_results <- ""
mylist$other_results <- other_results
mylist
}
You could "shallow copy" while creating the list, so that 1) you don't do full memory copy (speed isn't affected) and 2) you don't get internal ref error (thanks to @mnel for this trick).
set.seed(45)
ss <- function() {
tt <- sample(1:10, 1e6, replace=TRUE)
}
tt <- replicate(100, ss(), simplify=FALSE)
tt <- as.data.table(tt)
system.time( {
ll <- list(d1 = { # shallow copy here...
data.table:::settruelength(tt, 0)
invisible(alloc.col(tt))
}, "a")
})
user system elapsed
0 0 0
> system.time(tt[, bla := 2])
user system elapsed
0.012 0.000 0.013
> system.time(ll[[1]][, bla :=2 ])
user system elapsed
0.008 0.000 0.010
So you don't compromise in speed and you don't get a warning followed by a full copy. Hope this helps.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With