I have two data.table's dt1 and dt2 with id1 and id2 as the unique key column and I perform an unequal many to many join result set dtR with the matching id1 and id2.
library(data.table)
dt1 = data.table(id1 = c(1:12),a = c(1,1,1,1,2,2,2,2,3,3,3,3),key='id1')
dt2 = data.table(id2 = c(13:24), b = c(0,0,1,1,2,2,3,3,4,4,5,5),key='id2')
dtR <- dt2[dt1,.(id1,id2,a,b),on=.(b < a)]
How do I filter dtR such that id1 is unique in dtR$id1 column and id2 is unique in dtR$id2:
data.table(
id1=c(1,2,5,6,9,10),id2=c(13,14,15,16,17,18))
Note that dt1 and dt2 involve millions of rows and high memory, so it should be efficient and copy or loops should be avoided. Thanks!
I tried iterating over id1, creating an additional column with matched id2 and checking for existing id2, adding the next one. Very time consuming. A SQL solution would also be acceptable as dt1 and dt2 reside in DuckDb.
A data.table grouping solution will be pretty fast.
system.time({
u2 <- unique(sort(dtR$id2))
avail <- rep(TRUE, length(u2))
x <- dtR[, m2 := match(id2, u2)][
, .(
id2 = {
m <- m2[which.max(avail[m2])]
if (avail[m]) {
avail[m] <- FALSE
u2[m]
} else NULL
}
), id1
]
})
#> user system elapsed
#> 0 0 0
x
#> id1 id2
#> 1: 1 77
#> 2: 2 188
#> 3: 3 109
#> 4: 4 14
#> 5: 5 47
#> ---
#> 983: 996 9472
#> 984: 997 9036
#> 985: 998 9688
#> 986: 999 8646
#> 987: 1000 9066
uniqueN(x[[1]])
#> [1] 987
uniqueN(x[[2]])
#> [1] 987
Data:
library(data.table)
set.seed(42)
dt1 = data.table(
id1 = c(1:1e3),
a = sample(1:1e2,1e3,replace = T),key='id1')
dt1[,`:=`(alo=a-5,ahi=a+5)]
dt2 = data.table(
id2 = c(1:1e4),
b = sample(1:1e3,1e4,replace = T),key='id2')
dtR <- dt2[dt1,.(id1,id2,a,b),on=.(b < ahi, b>alo)]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With