I have two data.tables; I'd like to assign an element of one to the other at random from among those that match keys. The way I'm doing so right now is quite slow.
Let's get specific; here's some sample data:
dt1<-data.table(id=sample(letters[1:5],500,replace=T),var1=rnorm(500),key="id")
dt2<-data.table(id=c(rep("a",4),rep("b",8),rep("c",2),rep("d",5),rep("e",7)),
                place=paste(sample(c("Park","Pool","Rec Center","Library"),
                                   26,replace=T),
                            sample(26)),key="id")
I want to add two randomly chosen places to dt1 for each observation, but the places have to match on id.
Here's what I'm doing now:
get_place<-function(xx) sapply(xx,function(x) dt2[.(x),sample(place,1)])
dt1[,paste0("place",1:2):=list(get_place(id),get_place(id))]
This works, but it's quite slow--took 66 seconds to run on my computer, basically an eon.
One issue seems to be I can't seem to take proper advantage of keying:
Something like dt2[.(dt1$id),mult="random"] would be perfect, but it doesn't appear to be possible. 
Any suggestions?
A simple answer
dt2[.(dt1),as.list(c(
  place=sample(place,size=2,replace=TRUE)
)),by=.EACHI,allow.cartesian=TRUE]
This approach is simple and illustrates data.table features like Cartesian joins and by=.EACHI, but is very slow because for each row of dt1 it (i) samples and (ii) coerces the result to a list.
A faster answer
nsamp <- 2
dt3   <- dt2[.(unique(dt1$id)),list(i0=.I[1]-1L,.N),by=.EACHI]
dt1[.(dt3),paste0("place",1:nsamp):=
  replicate(nsamp,dt2$place[i0+sample(N,.N,replace=TRUE)],simplify=FALSE)
,by=.EACHI]
Using replicate with simplify=FALSE (as also in @bgoldst's answer) makes the most sense:
data.table requires when making new columns. replicate is the standard R function for repeated simulations.Benchmarks. We should look at varying several features and not modify dt1 as we go along:
# candidate functions
frank2 <- function(){
  dt3   <- dt2[.(unique(dt1$id)),list(i0=.I[1]-1L,.N),by=.EACHI]
  dt1[.(dt3),
    replicate(nsamp,dt2$place[i0+sample(N,.N,replace=TRUE)],simplify=FALSE)
  ,by=.EACHI]
}
david2 <- function(){
  indx <- dt1[,.N, id]
  sim <- dt2[.(indx),
    replicate(2,sample(place,size=N,replace=TRUE),simplify=FALSE)
  ,by=.EACHI]
  dt1[, sim[,-1,with=FALSE]]
}
bgoldst<-function(){
  dt1[,
    replicate(2,ave(id,id,FUN=function(x) 
      sample(dt2$place[dt2$id==x[1]],length(x),replace=T)),simplify=F)
  ]
}
# simulation
size <- 1e6
nids <- 1e3
npls <- 2:15
dt1 <- data.table(id=sample(1:nids,size=size,replace=TRUE),var1=rnorm(size),key="id")
dt2 <- unique(dt1)[,list(place=sample(letters,sample(npls,1),replace=TRUE)),by=id]
# benchmarking
res <- microbenchmark(frank2(),david2(),bgoldst(),times=10)
print(res,order="cld",unit="relative")
which gives
Unit: relative
      expr      min       lq     mean   median       uq      max neval cld
 bgoldst() 8.246783 8.280276 7.090995 7.142832 6.579406 5.692655    10   b
  frank2() 1.042862 1.107311 1.074722 1.152977 1.092632 0.931651    10  a 
  david2() 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000    10  a 
And if we switch around the parameters...
# new simulation
size <- 1e4
nids <- 10
npls <- 1e6:2e6
dt1 <- data.table(id=sample(1:nids,size=size,replace=TRUE),var1=rnorm(size),key="id")
dt2 <- unique(dt1)[,list(place=sample(letters,sample(npls,1),replace=TRUE)),by=id]
# new benchmarking
res <- microbenchmark(frank2(),david2(),times=10)
print(res,order="cld",unit="relative")
we see
Unit: relative
     expr    min     lq     mean   median       uq     max neval cld
 david2() 3.3008 3.2842 3.274905 3.286772 3.280362 3.10868    10   b
 frank2() 1.0000 1.0000 1.000000 1.000000 1.000000 1.00000    10  a 
As one might expect, which way is faster -- collapsing dt1 in david2 or collapsing dt2 in frank2 -- depends on how much information is compressed by collapsing.
The perfect function for this purpose is ave(), since it allows running a function for each group of a vector, and automatically maps the return value back to the elements of the group:
set.seed(1);
dt1 <- data.table(id=sample(letters[1:5],500,replace=T), var1=rnorm(500), key='id' );
dt2 <- data.table(id=c(rep('a',4),rep('b',8),rep('c',2),rep('d',5),rep('e',7)), place=paste(sample(c('Park','Pool','Rec Center','Library'),26,replace=T), sample(26) ), key='id' );
dt1[,paste0('place',1:2):=replicate(2,ave(id,id,FUN=function(x) sample(dt2$place[dt2$id==x[1]],length(x),replace=T)),simplify=FALSE)]
dt1;
##      id       var1        place1        place2
##   1:  a -0.4252677 Rec Center 23       Park 12
##   2:  a -0.3892372       Park 12    Library 22
##   3:  a  2.6491669       Park 14 Rec Center 23
##   4:  a -2.2891240 Rec Center 23       Park 14
##   5:  a -0.7012317    Library 22       Park 12
##  ---
## 496:  e -1.0624084    Library 16    Library 16
## 497:  e -0.9838209     Library 4    Library 26
## 498:  e  1.1948510    Library 26       Pool 21
## 499:  e -1.3353714       Pool 18    Library 26
## 500:  e  1.8017255       Park 20       Pool 21
This should work with data.frames as well as data.tables.
Edit: Adding benchmarking
This solution seems fastest, at least after having made the correction suggested by Frank below.
frank<-function(){dt2[.(dt1),as.list(c(
  place=sample(place,size=2,replace=TRUE))),
  by=.EACHI,allow.cartesian=TRUE]}
david<-function(){
  dt1[,paste0("place",1:2):=
        lapply(1:2,function(x) get_place(id,.N)),by=id]}
bgoldst<-function(){dt1[,paste0("place",1:2):=
                          replicate(2,ave(id,id,FUN=function(x) 
                            sample(dt2$place[dt2$id==x[1]],length(x),replace=T)),
                                    simplify=F)]}
microbenchmark(times=1000L,frank(),david(),bgoldst())
Unit: milliseconds
      expr      min       lq     mean   median       uq      max neval cld
   frank() 5.125843 5.353918 6.276879 5.496042 5.772051 15.57155  1000  b 
   david() 6.049172 6.305768 7.172360 6.455687 6.669202 93.06398  1000   c
 bgoldst() 1.421330 1.521046 1.847821 1.570573 1.628424 89.60315  1000 a  
                        If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With