Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Select one of many-to-many matches between data.table dt1$id and dt2$id such that dt1$id and dt2$id are unique in the result set

Tags:

join

r

data.table

I have two data.table's dt1 and dt2 with id1 and id2 as the unique key column and I perform an unequal many to many join result set dtR with the matching id1 and id2.

library(data.table)
dt1 = data.table(id1 = c(1:12),a = c(1,1,1,1,2,2,2,2,3,3,3,3),key='id1')
dt2 = data.table(id2 = c(13:24), b = c(0,0,1,1,2,2,3,3,4,4,5,5),key='id2')
dtR <- dt2[dt1,.(id1,id2,a,b),on=.(b < a)]

How do I filter dtR such that id1 is unique in dtR$id1 column and id2 is unique in dtR$id2:

data.table(
  id1=c(1,2,5,6,9,10),id2=c(13,14,15,16,17,18))

Note that dt1 and dt2 involve millions of rows and high memory, so it should be efficient and copy or loops should be avoided. Thanks!

I tried iterating over id1, creating an additional column with matched id2 and checking for existing id2, adding the next one. Very time consuming. A SQL solution would also be acceptable as dt1 and dt2 reside in DuckDb.

like image 303
Jorge Sepulveda Avatar asked Dec 31 '25 19:12

Jorge Sepulveda


1 Answers

A data.table grouping solution will be pretty fast.

system.time({
  u2 <- unique(sort(dtR$id2))
  avail <- rep(TRUE, length(u2))
  x <- dtR[, m2 := match(id2, u2)][
    , .(
      id2 = {
        m <- m2[which.max(avail[m2])]
        if (avail[m]) {
          avail[m] <- FALSE
          u2[m]
        }  else NULL
      }
    ), id1
  ]
})
#>    user  system elapsed 
#>       0       0       0

x
#>       id1  id2
#>   1:    1   77
#>   2:    2  188
#>   3:    3  109
#>   4:    4   14
#>   5:    5   47
#>  ---          
#> 983:  996 9472
#> 984:  997 9036
#> 985:  998 9688
#> 986:  999 8646
#> 987: 1000 9066
uniqueN(x[[1]])
#> [1] 987
uniqueN(x[[2]])
#> [1] 987

Data:

library(data.table)
set.seed(42)
dt1 = data.table(
  id1 = c(1:1e3),
  a = sample(1:1e2,1e3,replace = T),key='id1')
dt1[,`:=`(alo=a-5,ahi=a+5)]
dt2 = data.table(
  id2 = c(1:1e4), 
  b = sample(1:1e3,1e4,replace = T),key='id2')
dtR <- dt2[dt1,.(id1,id2,a,b),on=.(b < ahi, b>alo)]
like image 63
jblood94 Avatar answered Jan 03 '26 18:01

jblood94



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!