If I have the data.tables
DT
and neighbors
:
set.seed(1)
library(data.table)
DT <- data.table(idx=rep(1:10, each=5), x=rnorm(50), y=letters[1:5], ok=rbinom(50, 1, 0.90))
n <- data.table(y=letters[1:5], y1=letters[c(2:5,1)])
n
is a lookup table. Whenever ok == 0
, I want to look up the corresponding y1
in n
and use that value for x
and the given idx
. By way of example, row 4 of DT:
> DT
idx x y ok
1: 1 -0.6264538 a 1
2: 1 0.1836433 b 1
3: 1 -0.8356286 c 1
4: 1 1.5952808 d 0
5: 1 0.3295078 e 1
6: 2 -0.8204684 a 1
The y1
from n
for d
is e
:
> n[y == 'd']
y y1
1: d e
and idx
for row 4 is 1. So I would use:
> DT[idx == 1 & y == 'e', x]
[1] 0.3295078
I want my output to be a data.table
just like DT[ok == 0]
with all the x
values replaced by their appropriate n['y1'] x
value:
> output
idx x y ok
1: 1 0.3295078 d 0
2: 2 -0.3053884 d 0
3: 3 0.3898432 a 0
4: 5 0.7821363 a 0
5: 7 1.3586800 e 0
6: 8 0.7631757 d 0
I can think of a few ways of doing this with base R or with plyr
... and maybe its late on Friday... but whatever the sequences of merges that this would require in data.table
is beyond me!
Great question. Using the functions in the other answers and wrapping Blue's answer into a function blue
, how about the following. The benchmarks include the time to setkey
in all cases.
red = function() {
ans = DT[ok==0]
# Faster than setkey(DT,ok)[J(0)] if the vector scan is just once
# If lots of lookups to "ok" need to be done, then setkey may be worth it
# If DT[,ok:=as.integer(ok)] can be done first, then ok==0L slightly faster
# After extracting ans in the original order of DT, we can now set the key :
setkey(DT,idx,y)
setkey(n,y)
# Now working with the reduced ans ...
ans[,y1:=n[y,y1,mult="first"]]
# Add a new column y1 by reference containing the lookup in n
# mult="first" because we know n's key is unique, for speed (to save looking
# for groups of matches in n). Future version of data.table won't need this.
# Also, mult="first" has the advantage of dropping group columns (so we don't
# need [[2L]]). mult="first"|"last" turns off by-without-by of mult="all".
ans[,x:=DT[ans[,list(idx,y1)],x,mult="first"]]
# Changes the contents of ans$x by reference. The ans[,list(idx,y1)] part is
# how to pick the columns of ans to join to DT's key when they are not the key
# columns of ans and not the first 1:n columns of ans. There is no need to key
# ans, especially since that would change ans's order and not strictly answer
# the question. If idx and y1 were columns 1 and 2 of (unkeyed) ans then we
# wouldn't need that part, just
# ans[,x:=DT[ans,x,mult="first"]]
# would do (relying on DT having 2 columns in its key). That has the advantage
# of not copying the idx and y1 columns into a new data.table to pass as the i
# DT. To save that copy y1 could be moved to column 2 using setcolorder first.
redans <<- ans
}
crdt(1e5)
origDT = copy(DT)
benchmark(blue={DT=copy(origDT); system.time(blue())},
red={DT=copy(origDT); system.time(red())},
fun={DT=copy(origDT); system.time(fun(DT,n))},
replications=3, order="relative")
test replications elapsed relative user.self sys.self user.child sys.child
red 3 1.107 1.000 1.100 0.004 0 0
blue 3 5.797 5.237 5.660 0.120 0 0
fun 3 8.255 7.457 8.041 0.184 0 0
crdt(1e6)
[ .. snip .. ]
test replications elapsed relative user.self sys.self user.child sys.child
red 3 14.647 1.000 14.613 0.000 0 0
blue 3 87.589 5.980 87.197 0.124 0 0
fun 3 197.243 13.466 195.240 0.644 0 0
identical(blueans[,list(idx,x,y,ok,y1)],redans[order(idx,y1)])
# [1] TRUE
The order
is needed in the identical
because red
returns the result in the same order as DT[ok==0]
whereas blue
appears to be ordered by y1
in the case of ties in idx
.
If y1
is unwanted in the result it can be removed instantly (regardless of table size) using ans[,y1:=NULL]
; i.e., this can be included above to produce the exact result requested in question, without affecting the timings at all.
library(data.table)
crdt <- function(i=10){
set.seed(1)
DT <<- data.table(idx=rep(1:i, each=5), x=rnorm(5*i),
y=letters[1:5], ok=rbinom(5*i, 1, 0.90))
n <<- data.table(y=letters[1:5], y1=letters[c(2:5,1)])
}
fun <- function(DT,n){
setkey(DT,ok)
n1 <- merge(n,DT[J(0),list(y,idx)],by="y")
DT[J(0),x:=DT[paste0(y,idx) %in% paste0(n1[,y1],n1[,idx]),x]]
}
crdt(10)
fun(DT,n)[J(0)]
ok idx x y
[1,] 0 1 0.3295078 d
[2,] 0 2 -0.3053884 d
[3,] 0 3 0.3898432 a
[4,] 0 5 0.7821363 a
[5,] 0 7 1.3586796 e
[6,] 0 8 0.7631757 d
But it is still pretty slow for bigger data.tables:
crdt(1e6)
system.time(fun(DT,n)[J(0)])
User System elapsed
4.213 0.162 4.374
crdt(1e7)
system.time(fun(DT,n)[J(0)])
User System elapsed
195.685 3.949 199.592
I'm interested to learn a faster solution.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With