We have a very large data.table
, to which we append columns, mainly by data.table.merge
. Occasionally, this triggers a "Cannot allocate vector of size xx Gb"
error, even though we know that there is this amount of memory available on the system.
Our suspicion is that this is due to the fact that this memory isn't part of a contiguous block, so we would like to somehow preallocate a larger chunk of RAM when creating the data.table.
One obvious suggestion is to just create all the columns that will be eventually merged into our data.table from another one at the outset. However, this isn't necessarily going to work, because merge
is designed not to overwrite the columns of the DT1
with those of DT2
having the same name, but to rename them such that both can be kept.
Is there anything else that can be done?
Minimal example:
x = data.table(a = 1:10, b=2:11)
y = data.table(a = 1:10, c=2:11)
# want this to happen in the most memory-efficient way possible
# and ideally without allocating new memory at all
# (i.e., want to be able to pre-allocate enough memory in x
# in line 1 to be able to do this)
x = merge(x, y, by=a)
Addressing the question from the code block: "want this to happen in the most memory-efficient way possible".
The most memory-efficient you can get is to add columns to your x
dataset by reference while doing join.
Since the recent devel version of data.table, v1.9.5 you don't have to setkey before join.
library(data.table)
x = data.table(a = 1:10, b=2:11)
y = data.table(a = 1:10, c=2:11)
x[y, c := i.c, on="a"]
If you don't have the recent data.table version you have to setkehy in advance.
library(data.table)
x = data.table(a = 1:10, b=2:11, key="a")
y = data.table(a = 1:10, c=2:11, key="a")
x[y, c := i.c]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With