Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Confusing issue on multi-key subsetting data.table within function

Tags:

r

data.table

I'm trying to use arguments to a data.table to subset (and apply a mean to that subset). Basically I'll pass to the function two keys and several elements of the third key; this seems to be confusing R, but the operation works exactly as expected when done outside of a function environment.

Here's an example that basically gets what I'm trying to do; it returns a solution that is incorrect, while my own code produces an error (text pasted below):

set.seed(12345)
dt<-data.table(yr=rep(2000:2005,each=20),
               id=paste0(rep(rep(1:10,each=2),6)),
               deg=paste0(rep(1:2,60)),
               var=rnorm(120),
               key=c("yr","id","deg"))

fcn <- function(yr,ids,deg){
  dt[.(yr,ids,deg),mean(var)]
}

fcn(2004,paste0(1:3),"1")

This is giving an answer, but it's totally wrong (more on that in a second). If I do this by hand, there's no problem:

> fcn(2004,paste0(1:3),"1")
[1] 0.1262586
> dt[yr==2004&id %in% paste0(1:3)&deg=="1",mean(var)]
[1] 0.4374115
> dt[.(2004,paste0(1:3),"1"),mean(var)]
[1] 0.4374115

To crack what's going on, I changed the fcn code to:

fcn <- function(yr,ids,deg){
  dt[.(yr,ids,deg),]
}

Which yields:

> fcn(2004,paste0(1:3),"1")
       yr id deg        var
  1: 2000  1   1  0.5855288
  2: 2000  2   2 -0.4534972
  3: 2000  3   1  0.6058875
  4: 2000  1   2  0.7094660
  5: 2000  2   1 -0.1093033
 ---                       
116: 2005  2   2 -1.3247553
117: 2005  3   1  0.1410843
118: 2005  1   2 -1.1562233
119: 2005  2   1  0.4224185
120: 2005  3   2 -0.5360480

Basically, fcn has done no subsetting! Why is this happening? Really frustrated.

If I only pass one key instead of three, dt subsets on the middle key only. Weird:

> fcn(2004,"1","1")
       yr id deg        var
  1: 2000  1   1  0.5855288
  2: 2000  1   2  0.7094660
  3: 2000  1   1  0.5855288
  4: 2000  1   2  0.7094660
  5: 2000  1   1  0.5855288
 ---                       
116: 2005  1   2 -1.1562233
117: 2005  1   1  0.2239254
118: 2005  1   2 -1.1562233
119: 2005  1   1  0.2239254
120: 2005  1   2 -1.1562233

But if I pass only the middle keys to the function, it works fine:

fcn <- function(ids){
  dt[.(2004,ids,"1")]
}
> fcn(paste0(1:3))
     yr id deg        var
1: 2004  1   1  0.6453831
2: 2004  2   1 -0.3043691
3: 2004  3   1  0.9712207

Final edit: problem solved, but it would still be nice to know what exactly was going wrong:

Rename the arguments:

fcn <- function(yyr,ids,ddeg){
  dt[.(yyr,ids,ddeg),mean(var)]
}

Something about re-using the column names as variable names caused an issue, it seems--but I'm still not fully understanding what went wrong.

like image 552
MichaelChirico Avatar asked Apr 28 '15 21:04

MichaelChirico


1 Answers

The problem is you're using names of columns inside your i-expression, but expecting them to be names outside of the data.table. You can either rename the variable names in your function, or construct the join data.table outside and then use the fact that for single names data.table will always use the outside environment:

fcn <- function(yr,ids,deg){
  tmp = data.table(yr, ids, deg)
  dt[tmp, mean(var)]
}

fcn(2004, paste0(1:3), "1")
#[1] 0.4374115

See FAQ 2.12-2.13.

like image 129
eddi Avatar answered Nov 02 '22 14:11

eddi