I have two data.table
s; I'd like to assign an element of one to the other at random from among those that match keys. The way I'm doing so right now is quite slow.
Let's get specific; here's some sample data:
dt1<-data.table(id=sample(letters[1:5],500,replace=T),var1=rnorm(500),key="id")
dt2<-data.table(id=c(rep("a",4),rep("b",8),rep("c",2),rep("d",5),rep("e",7)),
place=paste(sample(c("Park","Pool","Rec Center","Library"),
26,replace=T),
sample(26)),key="id")
I want to add two randomly chosen place
s to dt1
for each observation, but the place
s have to match on id
.
Here's what I'm doing now:
get_place<-function(xx) sapply(xx,function(x) dt2[.(x),sample(place,1)])
dt1[,paste0("place",1:2):=list(get_place(id),get_place(id))]
This works, but it's quite slow--took 66 seconds to run on my computer, basically an eon.
One issue seems to be I can't seem to take proper advantage of keying:
Something like dt2[.(dt1$id),mult="random"]
would be perfect, but it doesn't appear to be possible.
Any suggestions?
A simple answer
dt2[.(dt1),as.list(c(
place=sample(place,size=2,replace=TRUE)
)),by=.EACHI,allow.cartesian=TRUE]
This approach is simple and illustrates data.table
features like Cartesian joins and by=.EACHI
, but is very slow because for each row of dt1
it (i) samples and (ii) coerces the result to a list.
A faster answer
nsamp <- 2
dt3 <- dt2[.(unique(dt1$id)),list(i0=.I[1]-1L,.N),by=.EACHI]
dt1[.(dt3),paste0("place",1:nsamp):=
replicate(nsamp,dt2$place[i0+sample(N,.N,replace=TRUE)],simplify=FALSE)
,by=.EACHI]
Using replicate
with simplify=FALSE
(as also in @bgoldst's answer) makes the most sense:
data.table
requires when making new columns. replicate
is the standard R function for repeated simulations.Benchmarks. We should look at varying several features and not modify dt1
as we go along:
# candidate functions
frank2 <- function(){
dt3 <- dt2[.(unique(dt1$id)),list(i0=.I[1]-1L,.N),by=.EACHI]
dt1[.(dt3),
replicate(nsamp,dt2$place[i0+sample(N,.N,replace=TRUE)],simplify=FALSE)
,by=.EACHI]
}
david2 <- function(){
indx <- dt1[,.N, id]
sim <- dt2[.(indx),
replicate(2,sample(place,size=N,replace=TRUE),simplify=FALSE)
,by=.EACHI]
dt1[, sim[,-1,with=FALSE]]
}
bgoldst<-function(){
dt1[,
replicate(2,ave(id,id,FUN=function(x)
sample(dt2$place[dt2$id==x[1]],length(x),replace=T)),simplify=F)
]
}
# simulation
size <- 1e6
nids <- 1e3
npls <- 2:15
dt1 <- data.table(id=sample(1:nids,size=size,replace=TRUE),var1=rnorm(size),key="id")
dt2 <- unique(dt1)[,list(place=sample(letters,sample(npls,1),replace=TRUE)),by=id]
# benchmarking
res <- microbenchmark(frank2(),david2(),bgoldst(),times=10)
print(res,order="cld",unit="relative")
which gives
Unit: relative
expr min lq mean median uq max neval cld
bgoldst() 8.246783 8.280276 7.090995 7.142832 6.579406 5.692655 10 b
frank2() 1.042862 1.107311 1.074722 1.152977 1.092632 0.931651 10 a
david2() 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 10 a
And if we switch around the parameters...
# new simulation
size <- 1e4
nids <- 10
npls <- 1e6:2e6
dt1 <- data.table(id=sample(1:nids,size=size,replace=TRUE),var1=rnorm(size),key="id")
dt2 <- unique(dt1)[,list(place=sample(letters,sample(npls,1),replace=TRUE)),by=id]
# new benchmarking
res <- microbenchmark(frank2(),david2(),times=10)
print(res,order="cld",unit="relative")
we see
Unit: relative
expr min lq mean median uq max neval cld
david2() 3.3008 3.2842 3.274905 3.286772 3.280362 3.10868 10 b
frank2() 1.0000 1.0000 1.000000 1.000000 1.000000 1.00000 10 a
As one might expect, which way is faster -- collapsing dt1
in david2
or collapsing dt2
in frank2
-- depends on how much information is compressed by collapsing.
The perfect function for this purpose is ave()
, since it allows running a function for each group of a vector, and automatically maps the return value back to the elements of the group:
set.seed(1);
dt1 <- data.table(id=sample(letters[1:5],500,replace=T), var1=rnorm(500), key='id' );
dt2 <- data.table(id=c(rep('a',4),rep('b',8),rep('c',2),rep('d',5),rep('e',7)), place=paste(sample(c('Park','Pool','Rec Center','Library'),26,replace=T), sample(26) ), key='id' );
dt1[,paste0('place',1:2):=replicate(2,ave(id,id,FUN=function(x) sample(dt2$place[dt2$id==x[1]],length(x),replace=T)),simplify=FALSE)]
dt1;
## id var1 place1 place2
## 1: a -0.4252677 Rec Center 23 Park 12
## 2: a -0.3892372 Park 12 Library 22
## 3: a 2.6491669 Park 14 Rec Center 23
## 4: a -2.2891240 Rec Center 23 Park 14
## 5: a -0.7012317 Library 22 Park 12
## ---
## 496: e -1.0624084 Library 16 Library 16
## 497: e -0.9838209 Library 4 Library 26
## 498: e 1.1948510 Library 26 Pool 21
## 499: e -1.3353714 Pool 18 Library 26
## 500: e 1.8017255 Park 20 Pool 21
This should work with data.frame
s as well as data.table
s.
Edit: Adding benchmarking
This solution seems fastest, at least after having made the correction suggested by Frank below.
frank<-function(){dt2[.(dt1),as.list(c(
place=sample(place,size=2,replace=TRUE))),
by=.EACHI,allow.cartesian=TRUE]}
david<-function(){
dt1[,paste0("place",1:2):=
lapply(1:2,function(x) get_place(id,.N)),by=id]}
bgoldst<-function(){dt1[,paste0("place",1:2):=
replicate(2,ave(id,id,FUN=function(x)
sample(dt2$place[dt2$id==x[1]],length(x),replace=T)),
simplify=F)]}
microbenchmark(times=1000L,frank(),david(),bgoldst())
Unit: milliseconds
expr min lq mean median uq max neval cld
frank() 5.125843 5.353918 6.276879 5.496042 5.772051 15.57155 1000 b
david() 6.049172 6.305768 7.172360 6.455687 6.669202 93.06398 1000 c
bgoldst() 1.421330 1.521046 1.847821 1.570573 1.628424 89.60315 1000 a
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With