I am trying to figuring out the data.table 'invalid .selfref' error that I am getting with the code below.
library(data.table)
library(dplyr)
DT <- data.table(aa=1:100, bb=rnorm(n=100), dd=gl(2,100))
DT <- DT %.% group_by(dd, aa) %.% summarize(m=mean(bb))
DT <- DT[, ee := 3]
The last line throws the error. Here there is the suggestion to just write the last line as DT$ee <- 3
but doesn't really explain why it works (and the :=
doesn't) and being a beginner data.table user also doesn't feel like the proper data.table idiom.
It IS related to the dplyr line in there that obviously changes the DT data table. But when I change that line (and those following) into DDT <- DT %.% group_by() ...
then I still get the selfref error from the DT[, ee := 3]
line.
Been checking the various sources but all the info there doesn't really come down, so I am still confused.
R version 3.1.0 (2014-04-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)
locale:
[1] LC_COLLATE=Dutch_Netherlands.1252 LC_CTYPE=Dutch_Netherlands.1252
[3] LC_MONETARY=Dutch_Netherlands.1252 LC_NUMERIC=C
[5] LC_TIME=Dutch_Netherlands.1252
attached base packages:
[1] graphics grDevices utils datasets stats methods base
other attached packages:
[1] dplyr_0.2 data.table_1.9.2 ggplot2_1.0.0
loaded via a namespace (and not attached):
[1] assertthat_0.1 colorspace_1.2-4 digest_0.6.4 grid_3.1.0
[5] gtable_0.1.2 MASS_7.3-31 munsell_0.4.2 parallel_3.1.0
[9] plyr_1.8.1 proto_0.3-10 Rcpp_0.11.2 reshape2_1.4
[13] scales_0.2.4 stringr_0.6.2 tools_3.1.0
I just ran your code, and I see the problem. data.table
over-allocates vector of column pointers (for efficiently adding columns by reference later on) and this warning occurs when an operation (most likely inadvertently) removes that over allocation.
Let me try to explain over-allocation using slide 45 from Matt's useR 2014 presentation. The (blue and yellow) boxes on the top correspond to the vector of column pointers and the arrow shows the data each pointer is pointing to.
The figure on the left depicts pictorially how adding (or cbind
ing) a column to a data.frame
works. cbind
ing a column basically results in a (deep or shallow) copy resulting in a new location for the vector of column pointers (shown in yellow) and the data (which has now one more column).
The figure on the right shows the data.table
way, where there are more than 3 blue boxes to begin with, due to over-allocation while data.table
creation. And by using :=
, not even a shallow copy is being made. The vector of column pointers that were there before stay where they are and the next unused over-allocated box is used to assign your new column.
This is about the difference and as to what over-allocation here means.
Now the warning tells that whatever operation you did has removed this over-allocation - meaning the extra blue boxes are gone! So, we can't add columns by reference anymore, until we over-allocate again (which is unnecessary and should be avoided, but since it's already gone, we do what's the next best thing).
My guess is that your dplyr
syntax somehow removes this over-allocation which is caught int the next step when you use :=
and data.table
over-allocates once again before to add new column by reference (which'll result in a shallow copy).
If I do it the data.table
way:
DT <- DT[, list(m=mean(bb)), by=list(dd,aa)]
DT[, ee := 3]
it works just fine.
I don't have the time to look into dplyr
right now to verify or find out what's doing this.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With