Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Do table slices take up memory in R?

If I take a slice of a table using, say the column names, does R allocate memory to hold the slice in a new location? Specifically, I have a table with columns depth1 and depth2, among others. I want to add columns which contain the max and min of the two. I have 2 approaches:

dd = dat[,c("depth1","depth2")]
dat$mindepth = apply(dd,1,min)
dat$maxdepth = apply(dd,1,max)
remove(dd)

or

dat$mindepth = apply(dat[,c("depth1","depth2")],1,min)
dat$maxdepth = apply(dat[,c("depth1","depth2")],1,max)

If I am not using up new memory, I'd rather take the slice only once, otherwise I would like save the reallocation. Which one is better? Memory issues can be critical when dealing with large datasets so please don't downvote this with the root of all evil meme.

like image 534
highBandWidth Avatar asked Mar 16 '11 22:03

highBandWidth


1 Answers

I know this doesn't actually answer the main thrust of the question (@hadley has done that and deserves credit), but there are other options to those you suggest. Here you could use pmin() and pmax() as another solution, and using with() or within() we can do it without explicit subsetting to create a dd.

R> set.seed(1)
R> dat <- data.frame(depth1 = runif(10), depth2 = runif(10))
R> dat <- within(dat, mindepth <- pmin(depth1, depth2))
R> dat <- within(dat, maxdepth <- pmax(depth1, depth2))
R> 
R> dat
       depth1    depth2   mindepth  maxdepth
1  0.26550866 0.2059746 0.20597457 0.2655087
2  0.37212390 0.1765568 0.17655675 0.3721239
3  0.57285336 0.6870228 0.57285336 0.6870228
4  0.90820779 0.3841037 0.38410372 0.9082078
5  0.20168193 0.7698414 0.20168193 0.7698414
6  0.89838968 0.4976992 0.49769924 0.8983897
7  0.94467527 0.7176185 0.71761851 0.9446753
8  0.66079779 0.9919061 0.66079779 0.9919061
9  0.62911404 0.3800352 0.38003518 0.6291140
10 0.06178627 0.7774452 0.06178627 0.7774452

We can look at how much copying goes on with tracemem() but only if your R was compiled with the following configure option activated --enable-memory-profiling.

R> set.seed(1)
R> dat <- data.frame(depth1 = runif(10), depth2 = runif(10))
R> tracemem(dat)
[1] "<0x2641cd8>"
R> dat <- within(dat, mindepth <- pmin(depth1, depth2))
tracemem[0x2641cd8 -> 0x2641a00]: within.data.frame within 
tracemem[0x2641a00 -> 0x2641878]: [<-.data.frame [<- within.data.frame within 
R> tracemem(dat)
[1] "<0x2657bc8>"
R> dat <- within(dat, maxdepth <- pmax(depth1, depth2))
tracemem[0x2657bc8 -> 0x2c765d8]: within.data.frame within 
tracemem[0x2c765d8 -> 0x2c764b8]: [<-.data.frame [<- within.data.frame within

So we see that R copied dat twice during each within() call. Compare that with your two suggestions:

R> set.seed(1)
R> dat <- data.frame(depth1 = runif(10), depth2 = runif(10))
R> tracemem(dat)
[1] "<0x2e1ddd0>"
R> dd <- dat[,c("depth1","depth2")]
R> tracemem(dd)
[1] "<0x2df01a0>"
R> dat$mindepth = apply(dd,1,min)
tracemem[0x2df01a0 -> 0x2cf97d8]: as.matrix.data.frame as.matrix apply 
tracemem[0x2e1ddd0 -> 0x2cc0ab0]: 
tracemem[0x2cc0ab0 -> 0x2cc0b20]: $<-.data.frame $<- 
tracemem[0x2cc0b20 -> 0x2cc0bc8]: $<-.data.frame $<- 
R> tracemem(dat)
[1] "<0x26b93c8>"
R> dat$maxdepth = apply(dd,1,max)
tracemem[0x2df01a0 -> 0x2cc0e30]: as.matrix.data.frame as.matrix apply 
tracemem[0x26b93c8 -> 0x26742c8]: 
tracemem[0x26742c8 -> 0x2674358]: $<-.data.frame $<- 
tracemem[0x2674358 -> 0x2674478]: $<-.data.frame $<-

Here, dd is copied once in each call to apply because apply() converts dd to a matrix before proceeding. The final three lines in the each block of tracemem output indicates three copies of dat are being made to insert the new column.

What about your second option?

R> set.seed(1)
R> dat <- data.frame(depth1 = runif(10), depth2 = runif(10))
R> tracemem(dat)
[1] "<0x268bc88>"
R> dat$mindepth <- apply(dat[,c("depth1","depth2")],1,min)
tracemem[0x268bc88 -> 0x26376b0]: 
tracemem[0x26376b0 -> 0x2637720]: $<-.data.frame $<- 
tracemem[0x2637720 -> 0x2637790]: $<-.data.frame $<- 
R> tracemem(dat)
[1] "<0x2466d40>"
R> dat$maxdepth <- apply(dat[,c("depth1","depth2")],1,max)
tracemem[0x2466d40 -> 0x22ae0d8]: 
tracemem[0x22ae0d8 -> 0x22ae1f8]: $<-.data.frame $<- 
tracemem[0x22ae1f8 -> 0x22ae318]: $<-.data.frame $<-

Here this version avoids the copy involved in setting up dd, but in all other respects is similar to your previous suggestion.

Can we do any better? Yes, and one simple way is to use the within() option I started with but execute both statements to create new mindepth and maxdepth variables in the one call to within():

R> set.seed(1)
R> dat <- data.frame(depth1 = runif(10), depth2 = runif(10))
R> tracemem(dat)
[1] "<0x21c4158>"
R> dat <- within(dat, { mindepth <- pmin(depth1, depth2)
+                      maxdepth <- pmax(depth1, depth2) })
tracemem[0x21c4158 -> 0x21c44a0]: within.data.frame within 
tracemem[0x21c44a0 -> 0x21c4628]: [<-.data.frame [<- within.data.frame within

In this version we only invoke two copies of dat compared to the 4 copies of the original within() version.

What about if we coerce dat to a matrix and then do the insertions?

R> set.seed(1)
R> dat <- data.frame(depth1 = runif(10), depth2 = runif(10))
R> tracemem(dat)
[1] "<0x1f29c70>"
R> mat <- as.matrix.data.frame(dat)
tracemem[0x1f29c70 -> 0x1f09768]: as.matrix.data.frame 
R> tracemem(mat)
[1] "<0x245ff30>"
R> mat <- cbind(mat, pmin(mat[,1], mat[,2]), pmax(mat[,1], mat[,2]))
R>

That is an improvement as we only incur the cost of the single copy of dat when coercing to a matrix. I cheated a bit by calling the as.matrix.data.frame() method directly. If we'd just used as.matrix() we'd have incurred another copy of mat.

This highlights one of the reasons why matrices are so much faster to use than data frames.

like image 188
Gavin Simpson Avatar answered Nov 18 '22 16:11

Gavin Simpson