Memory leak in data.table grouped assignment by reference

Question

I'm seeing odd memory usage when using assignment by reference by group in a data.table. Here's a simple example to demonstrate (please excuse the triviality of the example):

N <- 1e6
dt <- data.table(id=round(rnorm(N)), value=rnorm(N))

gc()
for (i in seq(100)) {
  dt[, value := value+1, by="id"]
}
gc()
tables()

which produces the following output:

> gc()
used (Mb) gc trigger (Mb) max used (Mb)
Ncells  303909 16.3     597831 32.0   407500 21.8
Vcells 2442853 18.7    3260814 24.9  2689450 20.6
> for (i in seq(100)) {
  +   dt[, value := value+1, by="id"]
  + }
> gc()
used  (Mb) gc trigger  (Mb) max used  (Mb)
Ncells   315907  16.9     597831  32.0   407500  21.8
Vcells 59966825 457.6   73320781 559.4 69633650 531.3
> tables()
NAME      NROW MB COLS     KEY
[1,] dt   1,000,000 16 id,value    
Total: 16MB

So about 440MB of used Vcells memory were added after the loop. This memory is not accounted for after removing the data.table from memory:

> rm(dt)
> gc()
used  (Mb) gc trigger (Mb) max used  (Mb)
Ncells   320888  17.2     597831   32   407500  21.8
Vcells 57977069 442.4   77066820  588 69633650 531.3
> tables()
No objects of class data.table exist in .GlobalEnv

The memory leak seems to disappear when removing the by=... from the assignment:

>     gc()
used (Mb) gc trigger (Mb) max used (Mb)
Ncells  312955 16.8     597831 32.0   467875 25.0
Vcells 2458890 18.8    3279586 25.1  2704448 20.7
>     for (i in seq(100)) {
  +       dt[, value := value+1]
  +     }
>     gc()
used (Mb) gc trigger (Mb) max used (Mb)
Ncells  322698 17.3     597831 32.0   467875 25.0
Vcells 2478772 19.0    5826337 44.5  5139567 39.3
>     tables()
NAME      NROW MB COLS     KEY
[1,] dt   1,000,000 16 id,value    
Total: 16MB

To summarize, two questions:

Am I missing something or is there a memory leak?
If there is indeed a memory leak, can anyone suggest a workaround that lets me use assignment by reference by group without the memory leak?

For reference, here's the output of sessionInfo():

R version 3.0.2 (2013-09-25)
Platform: x86_64-pc-linux-gnu (64-bit)

locale:
  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8   
[6] LC_MESSAGES=en_US.UTF-8    LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
  [1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
  [1] data.table_1.8.10

loaded via a namespace (and not attached):
  [1] tools_3.0.2

Arun · Accepted Answer

UPDATE from Matt - Now fixed in v1.8.11. From NEWS :

Long outstanding (usually small) memory leak in grouping fixed. When the last group is smaller than the largest group, the difference in those sizes was not being released. Most users run a grouping query once and will never have noticed, but anyone looping calls to grouping (such as when running in parallel, or benchmarking) may have suffered, #2648. Test added.

Many thanks to vc273, Y T and others.

From Arun ...

Why was this happening?

I wish I had come across this post before sitting on this issue. Nevertheless, a nice learning experience. Simon Urbanek summarises the issue pretty succinctly, that it's not a memory leak but bad-reporting of memory used/freed. I had the feeling this is what was happening.

What's the reason for this to happen in data.table? This part is on identifying the portion of code from dogroups.c responsible for the apparent memory increase.

Okay, so after some tedious testing, I think I've managed to at least find as to what the reason is for this to happen. Hopefully someone can help me get there from this post. My conclusion is that this is not a memory leak.

The short explanation is that this seems to be an effect of the usage of SETLENGTH function (from R's C-interface) in data.table's dogroups.c .

In data.table, when you use by=..., for example,

set.seed(45)
DT <- data.table(x=sample(3, 12, TRUE), id=rep(3:1, c(2,4,6)))
DT[, list(y=mean(x)), by=id]

Corresponding to id=1, the values of "x" (=c(1,2,1,1,2,3)) has to be picked. This means, having to allocate memory for .SD (all columns not in by) per by value.

To overcome this allocation for each group in by, data.table accomplishes this cleverly by first allocating .SD with the length of the largest group in by (which here is corresponding to id=1, length 6). Then, we could, for each value of id, re-use the (overly) allocated data.table and by using the function SETLENGTH we can just adjust the length to the length of current group. Note that, by doing this, no memory is being actually allocated here, except just the once that's allocated for the biggest group.

But what seems strange is that when the number of elements for each group in by all have the same number of items, nothing special seems to be happening with regard to gc() output. However, when they aren't the same, gc() seems to report increasing usage in Vcells. This is in spite of the fact that no extra memory is being allocated in both cases.

To illustrate this point, I've written a C-code that mimics the SETLENGTH function usage in dogroups.c in `data.table.

// test.c
#include <R.h>
#define USE_RINTERNALS
#include <Rinternals.h>
#include <Rdefines.h>

int sizes[100];
#define SIZEOF(x) sizes[TYPEOF(x)]

// test function - no checks!
SEXP test(SEXP vec, SEXP SD, SEXP lengths)
{
    R_len_t i, j;
    char before_address[32], after_address[32];
    SEXP tmp, ans;
    PROTECT(tmp = allocVector(INTSXP, 1));
    PROTECT(ans = allocVector(STRSXP, 2));
    snprintf(before_address, 32, "%p", (void *)SD);
    for (i=0; i<LENGTH(lengths); i++) {
        memcpy((char *)DATAPTR(SD), (char *)DATAPTR(vec), INTEGER(lengths)[i] * SIZEOF(tmp));
        SETLENGTH(SD, INTEGER(lengths)[i]);
        // do some computation here.. ex: mean(SD)
    }
    snprintf(after_address, 32, "%p", (void *)SD);
    SET_STRING_ELT(ans, 0, mkChar(before_address));
    SET_STRING_ELT(ans, 1, mkChar(after_address));
    UNPROTECT(2);
    return(ans);
}

Here vec is equivalent to any data.table dt and SD is equivalent to .SD and lengths is the length of each group. This is just a dummy program. Basically for each value of lengths, say n, the first n elements are copied from vec on to SD. Then one can compute whatever one wants on this SD (which is not done here). For our purposes, the address of SD before and after operation using SETLENGTH are being returned, to illustrate that there's no copy being made by SETLENGTH.

Save this file as test.c and then compile it as follows from terminal:

R CMD SHLIB -o test.so test.c

Now, open a new R-session, go to the path where test.so exists and then type:

dyn.load("test.so")
require(data.table)
set.seed(45)
max_len <- as.integer(1e6)
lengths <- as.integer(sample(4:(max_len)/10, max_len/10))
gc()
vec <- 1:max_len
for (i in 1:100) {
    SD <- vec[1:max(lengths)]
    bla <- .Call("test", vec, SD, lengths)
    print(gc())
}

Note that for each i here, .SD will be allocated a different memory location and that's being replicated here by assigning SD for each i.

By running this code, you'll find that 1) the two values returned are identical for each i to that of address(SD) and 2) Vcells used Mb keeps increasing. Now, remove all variables from the workspace with rm(list=ls()) and then do gc(), you'll find that not all memory is being restored/freed.

Initial:

          used (Mb) gc trigger (Mb) max used (Mb)
Ncells  332708 17.8     597831 32.0   467875 25.0
Vcells 1033531  7.9    2327578 17.8  2313676 17.7

After 100 runs:

          used (Mb) gc trigger (Mb) max used (Mb)
Ncells  332912 17.8     597831 32.0   467875 25.0
Vcells 2631370 20.1    4202816 32.1  2765872 21.2

After rm(list=ls()) and gc():

          used (Mb) gc trigger (Mb) max used (Mb)
Ncells  341275 18.3     597831 32.0   467875 25.0
Vcells 2061531 15.8    4202816 32.1  3121469 23.9

If you remove the line SETLENGTH(SD, ...) from the C-code, and run it again, you'll find that there's no change in the Vcells.

Now as to why SETLENGTH on grouping with non-identical group lengths has this effect, ~~I'm still trying to understand~~ - check out the link in the edit above.

Memory leak in data.table grouped assignment by reference

Tags:

r

data.table

ytsaig

1 Answers

Arun

Recent Activity

Donate For Us

Memory leak in data.table grouped assignment by reference

Tags:

r

data.table

ytsaig

1 Answers

Arun

Related questions

Recent Activity

Donate For Us