I'm trying to use arguments to a data.table
to subset (and apply a mean to that subset). Basically I'll pass to the function two keys and several elements of the third key; this seems to be confusing R
, but the operation works exactly as expected when done outside of a function environment.
Here's an example that basically gets what I'm trying to do; it returns a solution that is incorrect, while my own code produces an error (text pasted below):
set.seed(12345)
dt<-data.table(yr=rep(2000:2005,each=20),
id=paste0(rep(rep(1:10,each=2),6)),
deg=paste0(rep(1:2,60)),
var=rnorm(120),
key=c("yr","id","deg"))
fcn <- function(yr,ids,deg){
dt[.(yr,ids,deg),mean(var)]
}
fcn(2004,paste0(1:3),"1")
This is giving an answer, but it's totally wrong (more on that in a second). If I do this by hand, there's no problem:
> fcn(2004,paste0(1:3),"1")
[1] 0.1262586
> dt[yr==2004&id %in% paste0(1:3)°=="1",mean(var)]
[1] 0.4374115
> dt[.(2004,paste0(1:3),"1"),mean(var)]
[1] 0.4374115
To crack what's going on, I changed the fcn
code to:
fcn <- function(yr,ids,deg){
dt[.(yr,ids,deg),]
}
Which yields:
> fcn(2004,paste0(1:3),"1")
yr id deg var
1: 2000 1 1 0.5855288
2: 2000 2 2 -0.4534972
3: 2000 3 1 0.6058875
4: 2000 1 2 0.7094660
5: 2000 2 1 -0.1093033
---
116: 2005 2 2 -1.3247553
117: 2005 3 1 0.1410843
118: 2005 1 2 -1.1562233
119: 2005 2 1 0.4224185
120: 2005 3 2 -0.5360480
Basically, fcn
has done no subsetting! Why is this happening? Really frustrated.
If I only pass one key instead of three, dt
subsets on the middle key only. Weird:
> fcn(2004,"1","1")
yr id deg var
1: 2000 1 1 0.5855288
2: 2000 1 2 0.7094660
3: 2000 1 1 0.5855288
4: 2000 1 2 0.7094660
5: 2000 1 1 0.5855288
---
116: 2005 1 2 -1.1562233
117: 2005 1 1 0.2239254
118: 2005 1 2 -1.1562233
119: 2005 1 1 0.2239254
120: 2005 1 2 -1.1562233
But if I pass only the middle keys to the function, it works fine:
fcn <- function(ids){
dt[.(2004,ids,"1")]
}
> fcn(paste0(1:3))
yr id deg var
1: 2004 1 1 0.6453831
2: 2004 2 1 -0.3043691
3: 2004 3 1 0.9712207
Final edit: problem solved, but it would still be nice to know what exactly was going wrong:
Rename the arguments:
fcn <- function(yyr,ids,ddeg){
dt[.(yyr,ids,ddeg),mean(var)]
}
Something about re-using the column names as variable names caused an issue, it seems--but I'm still not fully understanding what went wrong.
The problem is you're using names of columns inside your i-expression
, but expecting them to be names outside of the data.table
. You can either rename the variable names in your function, or construct the join data.table
outside and then use the fact that for single names data.table
will always use the outside environment:
fcn <- function(yr,ids,deg){
tmp = data.table(yr, ids, deg)
dt[tmp, mean(var)]
}
fcn(2004, paste0(1:3), "1")
#[1] 0.4374115
See FAQ 2.12-2.13.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With