Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Understanding data.table invalid .selfref warning

Tags:

r

data.table

I am trying to figuring out the data.table 'invalid .selfref' error that I am getting with the code below.

library(data.table) 
library(dplyr)
DT <- data.table(aa=1:100, bb=rnorm(n=100), dd=gl(2,100))
DT <- DT %.% group_by(dd, aa) %.% summarize(m=mean(bb))
DT <- DT[, ee := 3]

The last line throws the error. Here there is the suggestion to just write the last line as DT$ee <- 3 but doesn't really explain why it works (and the := doesn't) and being a beginner data.table user also doesn't feel like the proper data.table idiom.

It IS related to the dplyr line in there that obviously changes the DT data table. But when I change that line (and those following) into DDT <- DT %.% group_by() ... then I still get the selfref error from the DT[, ee := 3] line.

Been checking the various sources but all the info there doesn't really come down, so I am still confused.

R version 3.1.0 (2014-04-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)

locale:
[1] LC_COLLATE=Dutch_Netherlands.1252  LC_CTYPE=Dutch_Netherlands.1252   
[3] LC_MONETARY=Dutch_Netherlands.1252 LC_NUMERIC=C                      
[5] LC_TIME=Dutch_Netherlands.1252    

attached base packages:
[1] graphics  grDevices utils     datasets  stats     methods   base     

other attached packages:
[1] dplyr_0.2        data.table_1.9.2 ggplot2_1.0.0   

loaded via a namespace (and not attached):
 [1] assertthat_0.1   colorspace_1.2-4 digest_0.6.4     grid_3.1.0      
 [5] gtable_0.1.2     MASS_7.3-31      munsell_0.4.2    parallel_3.1.0  
 [9] plyr_1.8.1       proto_0.3-10     Rcpp_0.11.2      reshape2_1.4    
[13] scales_0.2.4     stringr_0.6.2    tools_3.1.0     
like image 899
Paul Lemmens Avatar asked Jun 28 '14 21:06

Paul Lemmens


1 Answers

I just ran your code, and I see the problem. data.table over-allocates vector of column pointers (for efficiently adding columns by reference later on) and this warning occurs when an operation (most likely inadvertently) removes that over allocation.

Let me try to explain over-allocation using slide 45 from Matt's useR 2014 presentation. The (blue and yellow) boxes on the top correspond to the vector of column pointers and the arrow shows the data each pointer is pointing to.

The figure on the left depicts pictorially how adding (or cbinding) a column to a data.frame works. cbinding a column basically results in a (deep or shallow) copy resulting in a new location for the vector of column pointers (shown in yellow) and the data (which has now one more column).

The figure on the right shows the data.table way, where there are more than 3 blue boxes to begin with, due to over-allocation while data.table creation. And by using :=, not even a shallow copy is being made. The vector of column pointers that were there before stay where they are and the next unused over-allocated box is used to assign your new column.

This is about the difference and as to what over-allocation here means.

Now the warning tells that whatever operation you did has removed this over-allocation - meaning the extra blue boxes are gone! So, we can't add columns by reference anymore, until we over-allocate again (which is unnecessary and should be avoided, but since it's already gone, we do what's the next best thing).

My guess is that your dplyr syntax somehow removes this over-allocation which is caught int the next step when you use := and data.table over-allocates once again before to add new column by reference (which'll result in a shallow copy).

If I do it the data.table way:

DT <- DT[, list(m=mean(bb)), by=list(dd,aa)]
DT[, ee := 3]

it works just fine.

I don't have the time to look into dplyr right now to verify or find out what's doing this.

Update: Have suggested necessary changes as a pull request here.

like image 117
Arun Avatar answered Nov 08 '22 23:11

Arun