I'm seeing odd memory usage when using assignment by reference by group in a data.table
. Here's a simple example to demonstrate (please excuse the triviality of the example):
N <- 1e6
dt <- data.table(id=round(rnorm(N)), value=rnorm(N))
gc()
for (i in seq(100)) {
dt[, value := value+1, by="id"]
}
gc()
tables()
which produces the following output:
> gc()
used (Mb) gc trigger (Mb) max used (Mb)
Ncells 303909 16.3 597831 32.0 407500 21.8
Vcells 2442853 18.7 3260814 24.9 2689450 20.6
> for (i in seq(100)) {
+ dt[, value := value+1, by="id"]
+ }
> gc()
used (Mb) gc trigger (Mb) max used (Mb)
Ncells 315907 16.9 597831 32.0 407500 21.8
Vcells 59966825 457.6 73320781 559.4 69633650 531.3
> tables()
NAME NROW MB COLS KEY
[1,] dt 1,000,000 16 id,value
Total: 16MB
So about 440MB of used Vcells memory were added after the loop. This memory is not accounted for after removing the data.table from memory:
> rm(dt)
> gc()
used (Mb) gc trigger (Mb) max used (Mb)
Ncells 320888 17.2 597831 32 407500 21.8
Vcells 57977069 442.4 77066820 588 69633650 531.3
> tables()
No objects of class data.table exist in .GlobalEnv
The memory leak seems to disappear when removing the by=... from the assignment:
> gc()
used (Mb) gc trigger (Mb) max used (Mb)
Ncells 312955 16.8 597831 32.0 467875 25.0
Vcells 2458890 18.8 3279586 25.1 2704448 20.7
> for (i in seq(100)) {
+ dt[, value := value+1]
+ }
> gc()
used (Mb) gc trigger (Mb) max used (Mb)
Ncells 322698 17.3 597831 32.0 467875 25.0
Vcells 2478772 19.0 5826337 44.5 5139567 39.3
> tables()
NAME NROW MB COLS KEY
[1,] dt 1,000,000 16 id,value
Total: 16MB
To summarize, two questions:
For reference, here's the output of sessionInfo()
:
R version 3.0.2 (2013-09-25)
Platform: x86_64-pc-linux-gnu (64-bit)
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8
[6] LC_MESSAGES=en_US.UTF-8 LC_PAPER=en_US.UTF-8 LC_NAME=C LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] data.table_1.8.10
loaded via a namespace (and not attached):
[1] tools_3.0.2
UPDATE from Matt - Now fixed in v1.8.11. From NEWS :
Long outstanding (usually small) memory leak in grouping fixed. When the last group is smaller than the largest group, the difference in those sizes was not being released. Most users run a grouping query once and will never have noticed, but anyone looping calls to grouping (such as when running in parallel, or benchmarking) may have suffered, #2648. Test added.
Many thanks to vc273, Y T and others.
From Arun ...
Why was this happening?
I wish I had come across this post before sitting on this issue. Nevertheless, a nice learning experience. Simon Urbanek summarises the issue pretty succinctly, that it's not a memory leak but bad-reporting of memory used/freed. I had the feeling this is what was happening.
What's the reason for this to happen in data.table
? This part is on identifying the portion of code from dogroups.c
responsible for the apparent memory increase.
Okay, so after some tedious testing, I think I've managed to at least find as to what the reason is for this to happen. Hopefully someone can help me get there from this post. My conclusion is that this is not a memory leak.
The short explanation is that this seems to be an effect of the usage of SETLENGTH
function (from R's C-interface) in data.table's dogroups.c
.
In data.table
, when you use by=...
, for example,
set.seed(45)
DT <- data.table(x=sample(3, 12, TRUE), id=rep(3:1, c(2,4,6)))
DT[, list(y=mean(x)), by=id]
Corresponding to id=1
, the values of "x" (=c(1,2,1,1,2,3)
) has to be picked. This means, having to allocate memory for .SD
(all columns not in by
) per by
value.
To overcome this allocation for each group in by
, data.table
accomplishes this cleverly by first allocating .SD
with the length of the largest group in by
(which here is corresponding to id=1
, length 6). Then, we could, for each value of id
, re-use the (overly) allocated data.table and by using the function SETLENGTH
we can just adjust the length to the length of current group. Note that, by doing this, no memory is being actually allocated here, except just the once that's allocated for the biggest group.
But what seems strange is that when the number of elements for each group in by
all have the same number of items, nothing special seems to be happening with regard to gc()
output. However, when they aren't the same, gc()
seems to report increasing usage in Vcells. This is in spite of the fact that no extra memory is being allocated in both cases.
To illustrate this point, I've written a C-code that mimics the SETLENGTH
function usage in dogroups.c
in `data.table.
// test.c
#include <R.h>
#define USE_RINTERNALS
#include <Rinternals.h>
#include <Rdefines.h>
int sizes[100];
#define SIZEOF(x) sizes[TYPEOF(x)]
// test function - no checks!
SEXP test(SEXP vec, SEXP SD, SEXP lengths)
{
R_len_t i, j;
char before_address[32], after_address[32];
SEXP tmp, ans;
PROTECT(tmp = allocVector(INTSXP, 1));
PROTECT(ans = allocVector(STRSXP, 2));
snprintf(before_address, 32, "%p", (void *)SD);
for (i=0; i<LENGTH(lengths); i++) {
memcpy((char *)DATAPTR(SD), (char *)DATAPTR(vec), INTEGER(lengths)[i] * SIZEOF(tmp));
SETLENGTH(SD, INTEGER(lengths)[i]);
// do some computation here.. ex: mean(SD)
}
snprintf(after_address, 32, "%p", (void *)SD);
SET_STRING_ELT(ans, 0, mkChar(before_address));
SET_STRING_ELT(ans, 1, mkChar(after_address));
UNPROTECT(2);
return(ans);
}
Here vec
is equivalent to any data.table dt
and SD
is equivalent to .SD
and lengths
is the length of each group. This is just a dummy program. Basically for each value of lengths
, say n
, the first n
elements are copied from vec
on to SD
. Then one can compute whatever one wants on this SD (which is not done here). For our purposes, the address of SD before and after operation using SETLENGTH are being returned, to illustrate that there's no copy being made by SETLENGTH.
Save this file as test.c
and then compile it as follows from terminal:
R CMD SHLIB -o test.so test.c
Now, open a new R-session, go to the path where test.so
exists and then type:
dyn.load("test.so")
require(data.table)
set.seed(45)
max_len <- as.integer(1e6)
lengths <- as.integer(sample(4:(max_len)/10, max_len/10))
gc()
vec <- 1:max_len
for (i in 1:100) {
SD <- vec[1:max(lengths)]
bla <- .Call("test", vec, SD, lengths)
print(gc())
}
Note that for each i
here, .SD
will be allocated a different memory location and that's being replicated here by assigning SD
for each i
.
By running this code, you'll find that 1) the two values returned are identical for each i
to that of address(SD)
and 2) Vcells used Mb
keeps increasing. Now, remove all variables from the workspace with rm(list=ls())
and then do gc()
, you'll find that not all memory is being restored/freed.
Initial:
used (Mb) gc trigger (Mb) max used (Mb)
Ncells 332708 17.8 597831 32.0 467875 25.0
Vcells 1033531 7.9 2327578 17.8 2313676 17.7
After 100 runs:
used (Mb) gc trigger (Mb) max used (Mb)
Ncells 332912 17.8 597831 32.0 467875 25.0
Vcells 2631370 20.1 4202816 32.1 2765872 21.2
After rm(list=ls())
and gc()
:
used (Mb) gc trigger (Mb) max used (Mb)
Ncells 341275 18.3 597831 32.0 467875 25.0
Vcells 2061531 15.8 4202816 32.1 3121469 23.9
If you remove the line SETLENGTH(SD, ...)
from the C-code, and run it again, you'll find that there's no change in the Vcells.
Now as to why SETLENGTH on grouping with non-identical group lengths has this effect, I'm still trying to understand - check out the link in the edit above.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With