Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fast subset based on list of ids

I am trying to subset the same data frame by a list of ids. I have one dataframe of about 50,000 rows and a list of about 1,000 data frames. Each data frame in the list has between 100 and 1000 rows and has the same structure.

Consider this example:

df1 <- data.frame(id = sample(sample(1000:3000, 1000), 50000, TRUE), info = runif(50000, 200, 300))
set.seed(1)
l <- replicate(1000,
               data.frame(id = sample(1000:3000, sample(400:700, 1), replace = TRUE)),
               simplify = FALSE)

I want to subset df1 depending on the ids in l. I can do that by performing a semi-join, or subsetting with %in%:

library(dplyr)
semi_join(df1, l[[1]], "id")
df1[df1$id %in% l[[1]]$id, ]

I'm looking for a fast solution that scales to a list of thousands of data frames. So far, I wrap this in lapply (but there might be faster, vectorized solutions).

lapply(l, \(x) semi_join(df1, x, "id"))

Here's a starting benchmark with the solutions:

bc <- 
  bench::mark(dplyr = lapply(l, \(x) semi_join(df1, x, "id")),
              baseR = lapply(l, \(x) df1[df1$id %in% x$id, ]),
              unique = lapply(l, \(x) df1[df1$id %in% unique(x$id), ]),
              data.table = {df2 <- setDT(df1); lapply(l, \(x) df2[df2$id %in% unique(x$id), ])},
              iterations = 10, check = FALSE)

#> bc
# A tibble: 4 × 13
#  expression      min   median `itr/sec` mem_alloc gc/se…¹ n_itr  n_gc
#  <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>   <dbl> <int> <dbl>
#1 dplyr         2.25s    2.43s    0.416     1.64GB   4.03     10    97
#2 baseR         9.04s    9.55s    0.105     1.56GB   0.536    10    51
#3 unique        10.3s   10.95s    0.0912    1.57GB   0.420    10    46
#4 data.table   10.21s    10.9s    0.0916   979.5MB   0.458    10    50

Any package can be used.

like image 586
Maël Avatar asked Sep 14 '25 09:09

Maël


1 Answers

A one-shot data.table join on an rbindlist will be fast.

library(data.table)
library(dplyr)

microbenchmark::microbenchmark(
  dplyr = lapply(l, \(x) semi_join(df1, x, "id")),
  baseR = lapply(l, \(x) df1[df1$id %in% x$id, ]),
  unique = lapply(l, \(x) df1[df1$id %in% unique(x$id), ]),
  data.table = {setDT(df2); lapply(l, \(x) df2[id %in% unique(x$id)])},
  bindJoinSplit = split(setDT(df2, key = "id")[unique(rbindlist(l, idcol = "df")), on = "id", allow.cartesian = TRUE, nomatch = 0], by = "df", keep.by = FALSE),
  times = 10,
  setup = df2 <- copy(df1)
)
#> Unit: milliseconds
#>           expr       min        lq     mean    median        uq       max neval
#>          dplyr 1555.2198 1569.4590 1630.778 1619.9118 1626.8791 1857.1259    10
#>          baseR 1079.4916 1087.2524 1131.043 1126.8084 1179.5400 1196.9150    10
#>         unique 1171.3705 1214.4915 1253.490 1234.3908 1274.9278 1398.2908    10
#>     data.table 1925.1388 1950.9440 1978.538 1982.5494 1995.3917 2038.3573    10
#>  bindJoinSplit  368.8109  380.4029  412.595  401.4089  437.4869  503.8478    10

If the binding and splitting could be avoided, this would be much faster and easier. From ?data.table::split:

Be aware that processing list of data.tables will be generally much slower than manipulation in single data.table by group using by argument

Say the data had been built into a single data.table instead of a list of data.frames. We can also keep the results in a single data.table instead of list of data.frames. This is faster still.

dt <- rbindlist(l, idcol = "df")

microbenchmark::microbenchmark(
  bindJoinSplit = split(setDT(df2, key = "id")[unique(rbindlist(l, idcol = "df")), on = "id", allow.cartesian = TRUE, nomatch = 0], by = "df", keep.by = FALSE),
  join = setDT(df2, key = "id")[unique(dt), on = "id", allow.cartesian = TRUE, nomatch = 0],
  times = 10,
  setup = df2 <- copy(df1)
)
#> Unit: milliseconds
#>           expr      min       lq     mean   median       uq      max neval
#>  bindJoinSplit 317.1514 326.3613 343.2588 333.3914 364.1945 398.5467    10
#>           join 188.5818 191.4344 198.7380 194.4094 208.7339 219.9847    10

Data:

set.seed(1)
df1 <- data.frame(id = sample(sample(1000:3000, 1000), 50000, TRUE), info = runif(50000, 200, 300))
l <- replicate(1000,
               data.frame(id = sample(1000:3000, sample(400:700, 1), replace = TRUE)),
               simplify = FALSE)
like image 151
jblood94 Avatar answered Sep 17 '25 01:09

jblood94