Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

data.table: preallocating memory for future columns

Tags:

r

data.table

We have a very large data.table, to which we append columns, mainly by data.table.merge. Occasionally, this triggers a "Cannot allocate vector of size xx Gb" error, even though we know that there is this amount of memory available on the system.

Our suspicion is that this is due to the fact that this memory isn't part of a contiguous block, so we would like to somehow preallocate a larger chunk of RAM when creating the data.table.

One obvious suggestion is to just create all the columns that will be eventually merged into our data.table from another one at the outset. However, this isn't necessarily going to work, because merge is designed not to overwrite the columns of the DT1 with those of DT2 having the same name, but to rename them such that both can be kept.

Is there anything else that can be done?

Minimal example:

x = data.table(a = 1:10, b=2:11)
y = data.table(a = 1:10, c=2:11)

# want this to happen in the most memory-efficient way possible 
# and ideally without allocating new memory at all 
# (i.e., want to be able to pre-allocate enough memory in x 
# in line 1 to be able to do this)
x = merge(x, y, by=a)
like image 921
msp Avatar asked Nov 10 '22 12:11

msp


1 Answers

Addressing the question from the code block: "want this to happen in the most memory-efficient way possible".
The most memory-efficient you can get is to add columns to your x dataset by reference while doing join.

Since the recent devel version of data.table, v1.9.5 you don't have to setkey before join.

library(data.table)
x = data.table(a = 1:10, b=2:11)
y = data.table(a = 1:10, c=2:11)
x[y, c := i.c, on="a"]

If you don't have the recent data.table version you have to setkehy in advance.

library(data.table)
x = data.table(a = 1:10, b=2:11, key="a")
y = data.table(a = 1:10, c=2:11, key="a")
x[y, c := i.c]
like image 164
jangorecki Avatar answered Nov 15 '22 05:11

jangorecki