Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Prevent data.table from coercing a list of vectors of different lengths to a data.table

Tags:

r

data.table

The (sensible) default in data.table seems to be to preserve the rectangular nature of vectors within a list where possible, repeating single-element items as necessary, then coercing the result into a data.table.

Suppose you have a data table foo

foo <- data.table(
  a = c(1, 2, 3),
  b = c(TRUE, TRUE, FALSE)
)

And what you want is to return a non-data.table named (ragged) list like this:

list(
  a = foo[(b), a],
  n = foo[(b), .N]
)

#$a
#[1] 1 2
#n
#[1] 2

I've been playing around with lists of lists in different ways, but I haven't been able to prevent data.table from coercing what's returned in j into a data.table if it is capable of doing so. For example, this:

foo[(b), {
  .(a = a,
    n = .N) 
}]

Returns a data.table with two rows, column names a and n.

Nesting this within another list nets me a list back as a column, and without the names:

foo[(b),
  .(.("a" = a,
      "n" = .N))
]

#     V1
# 1: 1,2
# 2:   2

Anyway, I feel like I'm missing something, but I'm running out of ways to search and documentation to read.

I'd also be interested to know where the inflection points may be when it comes to indexing the resulting list, because maybe it doesn't matter. If I have one vector that is length ten million in a list alongside 100 vectors of length one, I assume that the ragged list will be more efficient memory-wise than a data.table. But aside from memory concerns, is one of these two options more computationally efficient:

bar_list$singleInt

# vs

bar_dt[1, singleInt]

Edit: Clarifying last question

library(data.table)

set.seed(123)
foo <- data.table(a = sample(1:100, 1e8, TRUE),
                  b = sample(c(TRUE, FALSE), 1e8, TRUE))

The way I see it, given the answers, I now have three potential objects that can contain the information I want.

There's a ragged list (regardless of how I get there)

bar_list <- list(
  a = foo[(b), a],
  n = foo[(b), .N],
  s = foo[(b), sum(b)],
  m = foo[(b), max(a)]  
)

# List of 4
# $ a: int [1:50005713] 9 84 37 56 63 60 55 74 99 32 ...
# $ n: int 50005713
# $ s: int 50005713
# $ m: int 100

There's this, which is almost like you're transposing the list. You generate a data.table where the columns are the named list elements, as suggested by @jblood94. (This has the unfortunate side effect for the long vectors that they now require list indexing to access them, but is otherwise an attractive option.)

bar_dt   <- foo[(b),
                .(a = .(a),
                  n = .N)]
# Classes ‘data.table’ and 'data.frame':    1 obs. of  2 variables:
#   $ a:List of 1
# ..$ : int  9 84 37 56 63 60 55 74 99 32 ...
# $ n: int 50005713

And then there's this, which is allowing data.table to coerce the list into a traditional rectangular form by repeating the single-element vectors to make them the same length as the other vector.

bar_dt2  <- foo[(b),
                .(a = a,
                  n = .N,
                  s = sum(b),
                  m = max(a))]

# Classes ‘data.table’ and 'data.frame':    50005713 obs. of  4 variables:
#   $ a: int  9 84 37 56 63 60 55 74 99 32 ...
# $ n: int  50005713 50005713 50005713 50005713 50005713 50005713 50005713 50005713 50005713 50005713 ...
# $ s: int  50005713 50005713 50005713 50005713 50005713 50005713 50005713 50005713 50005713 50005713 ...
# $ m: int  100 100 100 100 100 100 100 100 100 100 ...

Benchmarking accessing elements In terms of memory, bar_list and bar_dt are equivalent (here 190.7 MB), and bar_dt2 is much larger (762.9 MB). We have two general types of indexing we are interested in: indexing the single-element vectors like n, and indexing the long vectors like a. Nothing much changes from your microbenchmark @jblood94. The fastest method is still the list, avoiding the data.table overhead.

microbenchmark::microbenchmark(
  bar_list$n,
  bar_dt[1, n],
  bar_dt[, n],
  bar_dt$n,
  bar_dt[["n"]],
  bar_dt[[2]],
  bar_dt2[1, n],
  check = "identical"
)

# Unit: nanoseconds
# expr             min     lq   mean median     uq    max neval
# bar_list$n       200    500    940    850   1100   8700   100
# bar_dt[1, n]  237000 270450 371931 324100 465800 898100   100
# bar_dt[, n]   237100 278500 378559 340950 465800 954300   100
# bar_dt$n         800   1700   2961   2700   3250  19700   100
# bar_dt[["n"]]   5000   7750  10832   9850  12100  45200   100
# bar_dt[[2]]     5000   6950  10125   9050  11300  44400   100
# bar_dt2[1, n] 238800 278950 381454 338700 418350 925300   100

It's still the fastest when indexing the longer vector. The only thing that the rectangular bar_dt2 has going for it is that it's very slightly faster here, because it doesn't have to index with [[, but still the list is faster.

microbenchmark::microbenchmark(
  bar_list$a,
  bar_dt[, a[[1]]],
  bar_dt$a[[1]],
  bar_dt[["a"]][[1]],
  bar_dt[[1]][[1]],
  bar_dt2[, a],
  bar_dt2$a,
  bar_dt2[["a"]],
  bar_dt2[[1]],
  check = "identical"
)

# Unit: nanoseconds
# expr                   min       lq     mean   median       uq       max neval
# bar_list$a             200      600     1966     1500     2600     19300   100
# bar_dt[, a[[1]]]    271900   420600   559898   537650   675300   1068900   100
# bar_dt$a[[1]]         1100     1900     5285     3700     6700     26500   100
# bar_dt[["a"]][[1]]    6600     8350    16123    10950    16650     70900   100
# bar_dt[[1]][[1]]      6100     8100    18479    11350    23750     91800   100
# bar_dt2[, a]      47781800 57865500 86941905 64417200 71424750 326081400   100
# bar_dt2$a              900     1500     4130     2250     5100     30600   100
# bar_dt2[["a"]]        5900     8100    18919    11550    26550     91700   100
# bar_dt2[[1]]          5800     8100    19484    11950    19300    282600   100

So that answers my original question -- there does seem to be a reason to prefer a ragged list when it comes to the efficiency of using it. It leads to a second question, though, which is which one is more efficient to create and by how much, because if the creation of bar_dt2 is much more efficient than any of the methods of getting to the ragged list, then the small differences between bar_list$a and bar_dt$a may not matter.

Benchmarking list or data.table creation

Sticking with microbenchmark but only running it once.

microbenchmark::microbenchmark(
  list1 <- with(
    foo[(b),],
    list("a" = a,
         "n" = length(a),
         "s" = sum(b),
         "m" = max(a))),
  list2 <- unlist(
    foo[(b),
        .(a = .(a),
          n = .N,
          s = sum(b),
          m = max(a))], FALSE),
  list3 <- list(
    "a" = foo[(b), a],
    "n" = foo[(b), .N],
    "s" = foo[(b), sum(b)],
    "m" = foo[(b), max(a)]),
  list4   <- foo[(b),
                 .(a = .(a),
                   n = .N,
                   s = sum(b),
                   m = max(a))],
  times = 1
)

# Unit: seconds
#      min       lq     mean   median       uq      max neval
# 1.263952 1.263952 **1.263952** 1.263952 1.263952 1.263952     1
# 1.928070 1.928070 1.928070 1.928070 1.928070 1.928070     1
# 4.501306 4.501306 4.501306 4.501306 4.501306 4.501306     1
# 1.800121 1.800121 1.800121 1.800121 1.800121 1.800121     1

tl;dr: It seems like using the with construction is the fastest by quite a bit, with (as expected) the list made by repeatedly calling foo being much slower by far. Since that results in a ragged list automatically, which was also the fastest method for accessing its contents, I think this construction is the winner for me. I absolutely never would even have considered doing it like that, so I'm glad I asked the question.

like image 596
Danielle McCool Avatar asked Sep 02 '25 16:09

Danielle McCool


2 Answers

A data.table will return a data.table if the j argument is a list, data.frame, or data.table. But the data.table way to get your ragged list would be to keep it as a data.table:

(bar_dt <- foo[(b),.(a = .(a), n = .N)])
#>      a n
#> 1: 1,2 2

However, data.tables are lists of vectors, so if you really want a non-data.table list:

(bar_list <- unlist(foo[(b),.(a = .(a), n = .N)], FALSE))
#> $a
#> [1] 1 2
#> 
#> $n
#> [1] 2

As for the question about accessing an element from a 1-row data.table vs. a list, data.table adds some additional overhead, so the list access will be faster:

microbenchmark::microbenchmark(
  bar_list$n,
  bar_dt[1, n],
  bar_dt[, n],
  bar_dt$n,
  bar_dt[["n"]],
  bar_dt[[2]],
  check = "identical"
)
#> Unit: nanoseconds
#>           expr    min     lq   mean median     uq     max neval
#>     bar_list$n    200    400    882    800   1100    4200   100
#>   bar_dt[1, n] 274500 299850 347330 311750 358850 1217500   100
#>    bar_dt[, n] 269100 294900 343569 309900 355950  803000   100
#>       bar_dt$n    800   1350   2335   2300   2850    7600   100
#>  bar_dt[["n"]]   5200   6150   9860   8450  11100   64000   100
#>    bar_dt[[2]]   5000   6200   9601   8700  10550   49400   100

This is consistent with the documentation:

DT[["v"]] # same as DT[, v] but much faster

like image 130
jblood94 Avatar answered Sep 05 '25 08:09

jblood94


foo[(b), .(.("a" = a, "n" = .N)) ]
#        V1
#    <list>
# 1:    1,2
# 2:      2

This has V1 as a name because the outer .(..) means to return a new list/table, but since you don't name any of the immediately-inner arguments, it names it itself. You can change that behavior with

foo[(b), .(quux = .("a" = a, "n" = .N)) ]
#      quux
#    <list>
# 1:    1,2
# 2:      2

I also have (by default) options(datatable.print.class=TRUE) set in my R instance, which allows us to see that this is a list-column (as you may have suspected). This is a list because of the inner .(.) inside the summary.

Unfortunately, data.table defaults to converting any list-return in the j= argument into a data.table (realize that .(.) is a synonym for list(.), for convenience). If you want a ragged list, then you need to go a outside of the normal data.table semantics.

with(foo[(b),],  list("a" = a, "n" = length(a)))
# $a
# [1] 1 2
# $n
# [1] 2

noting that we can no longer use the .N special symbol, though it's very inexpensive to use length(a) here.

like image 35
r2evans Avatar answered Sep 05 '25 09:09

r2evans