The (sensible) default in data.table seems to be to preserve the rectangular nature of vectors within a list where possible, repeating single-element items as necessary, then coercing the result into a data.table.
Suppose you have a data table foo
foo <- data.table(
a = c(1, 2, 3),
b = c(TRUE, TRUE, FALSE)
)
And what you want is to return a non-data.table named (ragged) list like this:
list(
a = foo[(b), a],
n = foo[(b), .N]
)
#$a
#[1] 1 2
#n
#[1] 2
I've been playing around with lists of lists in different ways, but I haven't been able to prevent data.table from coercing what's returned in j into a data.table if it is capable of doing so. For example, this:
foo[(b), {
.(a = a,
n = .N)
}]
Returns a data.table with two rows, column names a
and n
.
Nesting this within another list nets me a list back as a column, and without the names:
foo[(b),
.(.("a" = a,
"n" = .N))
]
# V1
# 1: 1,2
# 2: 2
Anyway, I feel like I'm missing something, but I'm running out of ways to search and documentation to read.
I'd also be interested to know where the inflection points may be when it comes to indexing the resulting list, because maybe it doesn't matter. If I have one vector that is length ten million in a list alongside 100 vectors of length one, I assume that the ragged list will be more efficient memory-wise than a data.table. But aside from memory concerns, is one of these two options more computationally efficient:
bar_list$singleInt
# vs
bar_dt[1, singleInt]
Edit: Clarifying last question
library(data.table)
set.seed(123)
foo <- data.table(a = sample(1:100, 1e8, TRUE),
b = sample(c(TRUE, FALSE), 1e8, TRUE))
The way I see it, given the answers, I now have three potential objects that can contain the information I want.
There's a ragged list
(regardless of how I get there)
bar_list <- list(
a = foo[(b), a],
n = foo[(b), .N],
s = foo[(b), sum(b)],
m = foo[(b), max(a)]
)
# List of 4
# $ a: int [1:50005713] 9 84 37 56 63 60 55 74 99 32 ...
# $ n: int 50005713
# $ s: int 50005713
# $ m: int 100
There's this, which is almost like you're transposing the list. You generate a data.table where the columns are the named list elements, as suggested by @jblood94. (This has the unfortunate side effect for the long vectors that they now require list indexing to access them, but is otherwise an attractive option.)
bar_dt <- foo[(b),
.(a = .(a),
n = .N)]
# Classes ‘data.table’ and 'data.frame': 1 obs. of 2 variables:
# $ a:List of 1
# ..$ : int 9 84 37 56 63 60 55 74 99 32 ...
# $ n: int 50005713
And then there's this, which is allowing data.table
to coerce the list into a traditional rectangular form by repeating the single-element vectors to make them the same length as the other vector.
bar_dt2 <- foo[(b),
.(a = a,
n = .N,
s = sum(b),
m = max(a))]
# Classes ‘data.table’ and 'data.frame': 50005713 obs. of 4 variables:
# $ a: int 9 84 37 56 63 60 55 74 99 32 ...
# $ n: int 50005713 50005713 50005713 50005713 50005713 50005713 50005713 50005713 50005713 50005713 ...
# $ s: int 50005713 50005713 50005713 50005713 50005713 50005713 50005713 50005713 50005713 50005713 ...
# $ m: int 100 100 100 100 100 100 100 100 100 100 ...
Benchmarking accessing elements
In terms of memory, bar_list
and bar_dt
are equivalent (here 190.7 MB), and bar_dt2
is much larger (762.9 MB). We have two general types of indexing we are interested in: indexing the single-element vectors like n
, and indexing the long vectors like a
. Nothing much changes from your microbenchmark @jblood94. The fastest method is still the list, avoiding the data.table overhead.
microbenchmark::microbenchmark(
bar_list$n,
bar_dt[1, n],
bar_dt[, n],
bar_dt$n,
bar_dt[["n"]],
bar_dt[[2]],
bar_dt2[1, n],
check = "identical"
)
# Unit: nanoseconds
# expr min lq mean median uq max neval
# bar_list$n 200 500 940 850 1100 8700 100
# bar_dt[1, n] 237000 270450 371931 324100 465800 898100 100
# bar_dt[, n] 237100 278500 378559 340950 465800 954300 100
# bar_dt$n 800 1700 2961 2700 3250 19700 100
# bar_dt[["n"]] 5000 7750 10832 9850 12100 45200 100
# bar_dt[[2]] 5000 6950 10125 9050 11300 44400 100
# bar_dt2[1, n] 238800 278950 381454 338700 418350 925300 100
It's still the fastest when indexing the longer vector. The only thing that the rectangular bar_dt2
has going for it is that it's very slightly faster here, because it doesn't have to index with [[
, but still the list is faster.
microbenchmark::microbenchmark(
bar_list$a,
bar_dt[, a[[1]]],
bar_dt$a[[1]],
bar_dt[["a"]][[1]],
bar_dt[[1]][[1]],
bar_dt2[, a],
bar_dt2$a,
bar_dt2[["a"]],
bar_dt2[[1]],
check = "identical"
)
# Unit: nanoseconds
# expr min lq mean median uq max neval
# bar_list$a 200 600 1966 1500 2600 19300 100
# bar_dt[, a[[1]]] 271900 420600 559898 537650 675300 1068900 100
# bar_dt$a[[1]] 1100 1900 5285 3700 6700 26500 100
# bar_dt[["a"]][[1]] 6600 8350 16123 10950 16650 70900 100
# bar_dt[[1]][[1]] 6100 8100 18479 11350 23750 91800 100
# bar_dt2[, a] 47781800 57865500 86941905 64417200 71424750 326081400 100
# bar_dt2$a 900 1500 4130 2250 5100 30600 100
# bar_dt2[["a"]] 5900 8100 18919 11550 26550 91700 100
# bar_dt2[[1]] 5800 8100 19484 11950 19300 282600 100
So that answers my original question -- there does seem to be a reason to prefer a ragged list when it comes to the efficiency of using it. It leads to a second question, though, which is which one is more efficient to create and by how much, because if the creation of bar_dt2
is much more efficient than any of the methods of getting to the ragged list, then the small differences between bar_list$a
and bar_dt$a
may not matter.
Benchmarking list or data.table creation
Sticking with microbenchmark
but only running it once.
microbenchmark::microbenchmark(
list1 <- with(
foo[(b),],
list("a" = a,
"n" = length(a),
"s" = sum(b),
"m" = max(a))),
list2 <- unlist(
foo[(b),
.(a = .(a),
n = .N,
s = sum(b),
m = max(a))], FALSE),
list3 <- list(
"a" = foo[(b), a],
"n" = foo[(b), .N],
"s" = foo[(b), sum(b)],
"m" = foo[(b), max(a)]),
list4 <- foo[(b),
.(a = .(a),
n = .N,
s = sum(b),
m = max(a))],
times = 1
)
# Unit: seconds
# min lq mean median uq max neval
# 1.263952 1.263952 **1.263952** 1.263952 1.263952 1.263952 1
# 1.928070 1.928070 1.928070 1.928070 1.928070 1.928070 1
# 4.501306 4.501306 4.501306 4.501306 4.501306 4.501306 1
# 1.800121 1.800121 1.800121 1.800121 1.800121 1.800121 1
tl;dr: It seems like using the with
construction is the fastest by quite a bit, with (as expected) the list made by repeatedly calling foo
being much slower by far. Since that results in a ragged list automatically, which was also the fastest method for accessing its contents, I think this construction is the winner for me. I absolutely never would even have considered doing it like that, so I'm glad I asked the question.
A data.table
will return a data.table
if the j
argument is a list
, data.frame
, or data.table
. But the data.table
way to get your ragged list would be to keep it as a data.table
:
(bar_dt <- foo[(b),.(a = .(a), n = .N)])
#> a n
#> 1: 1,2 2
However, data.tables
are lists of vectors, so if you really want a non-data.table
list
:
(bar_list <- unlist(foo[(b),.(a = .(a), n = .N)], FALSE))
#> $a
#> [1] 1 2
#>
#> $n
#> [1] 2
As for the question about accessing an element from a 1-row data.table
vs. a list
, data.table
adds some additional overhead, so the list
access will be faster:
microbenchmark::microbenchmark(
bar_list$n,
bar_dt[1, n],
bar_dt[, n],
bar_dt$n,
bar_dt[["n"]],
bar_dt[[2]],
check = "identical"
)
#> Unit: nanoseconds
#> expr min lq mean median uq max neval
#> bar_list$n 200 400 882 800 1100 4200 100
#> bar_dt[1, n] 274500 299850 347330 311750 358850 1217500 100
#> bar_dt[, n] 269100 294900 343569 309900 355950 803000 100
#> bar_dt$n 800 1350 2335 2300 2850 7600 100
#> bar_dt[["n"]] 5200 6150 9860 8450 11100 64000 100
#> bar_dt[[2]] 5000 6200 9601 8700 10550 49400 100
This is consistent with the documentation:
DT[["v"]] # same as DT[, v] but much faster
foo[(b), .(.("a" = a, "n" = .N)) ]
# V1
# <list>
# 1: 1,2
# 2: 2
This has V1
as a name because the outer .(..)
means to return a new list/table, but since you don't name any of the immediately-inner arguments, it names it itself. You can change that behavior with
foo[(b), .(quux = .("a" = a, "n" = .N)) ]
# quux
# <list>
# 1: 1,2
# 2: 2
I also have (by default) options(datatable.print.class=TRUE)
set in my R instance, which allows us to see that this is a list-column (as you may have suspected). This is a list because of the inner .(.)
inside the summary.
Unfortunately, data.table
defaults to converting any list
-return in the j=
argument into a data.table
(realize that .(.)
is a synonym for list(.)
, for convenience). If you want a ragged list, then you need to go a outside of the normal data.table
semantics.
with(foo[(b),], list("a" = a, "n" = length(a)))
# $a
# [1] 1 2
# $n
# [1] 2
noting that we can no longer use the .N
special symbol, though it's very inexpensive to use length(a)
here.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With