If I take a slice of a table using, say the column names, does R allocate memory to hold the slice in a new location? Specifically, I have a table with columns depth1 and depth2, among others. I want to add columns which contain the max and min of the two. I have 2 approaches:
dd = dat[,c("depth1","depth2")]
dat$mindepth = apply(dd,1,min)
dat$maxdepth = apply(dd,1,max)
remove(dd)
or
dat$mindepth = apply(dat[,c("depth1","depth2")],1,min)
dat$maxdepth = apply(dat[,c("depth1","depth2")],1,max)
If I am not using up new memory, I'd rather take the slice only once, otherwise I would like save the reallocation. Which one is better? Memory issues can be critical when dealing with large datasets so please don't downvote this with the root of all evil meme.
I know this doesn't actually answer the main thrust of the question (@hadley has done that and deserves credit), but there are other options to those you suggest. Here you could use pmin()
and pmax()
as another solution, and using with()
or within()
we can do it without explicit subsetting to create a dd
.
R> set.seed(1)
R> dat <- data.frame(depth1 = runif(10), depth2 = runif(10))
R> dat <- within(dat, mindepth <- pmin(depth1, depth2))
R> dat <- within(dat, maxdepth <- pmax(depth1, depth2))
R>
R> dat
depth1 depth2 mindepth maxdepth
1 0.26550866 0.2059746 0.20597457 0.2655087
2 0.37212390 0.1765568 0.17655675 0.3721239
3 0.57285336 0.6870228 0.57285336 0.6870228
4 0.90820779 0.3841037 0.38410372 0.9082078
5 0.20168193 0.7698414 0.20168193 0.7698414
6 0.89838968 0.4976992 0.49769924 0.8983897
7 0.94467527 0.7176185 0.71761851 0.9446753
8 0.66079779 0.9919061 0.66079779 0.9919061
9 0.62911404 0.3800352 0.38003518 0.6291140
10 0.06178627 0.7774452 0.06178627 0.7774452
We can look at how much copying goes on with tracemem()
but only if your R was compiled with the following configure option activated --enable-memory-profiling
.
R> set.seed(1)
R> dat <- data.frame(depth1 = runif(10), depth2 = runif(10))
R> tracemem(dat)
[1] "<0x2641cd8>"
R> dat <- within(dat, mindepth <- pmin(depth1, depth2))
tracemem[0x2641cd8 -> 0x2641a00]: within.data.frame within
tracemem[0x2641a00 -> 0x2641878]: [<-.data.frame [<- within.data.frame within
R> tracemem(dat)
[1] "<0x2657bc8>"
R> dat <- within(dat, maxdepth <- pmax(depth1, depth2))
tracemem[0x2657bc8 -> 0x2c765d8]: within.data.frame within
tracemem[0x2c765d8 -> 0x2c764b8]: [<-.data.frame [<- within.data.frame within
So we see that R copied dat
twice during each within()
call. Compare that with your two suggestions:
R> set.seed(1)
R> dat <- data.frame(depth1 = runif(10), depth2 = runif(10))
R> tracemem(dat)
[1] "<0x2e1ddd0>"
R> dd <- dat[,c("depth1","depth2")]
R> tracemem(dd)
[1] "<0x2df01a0>"
R> dat$mindepth = apply(dd,1,min)
tracemem[0x2df01a0 -> 0x2cf97d8]: as.matrix.data.frame as.matrix apply
tracemem[0x2e1ddd0 -> 0x2cc0ab0]:
tracemem[0x2cc0ab0 -> 0x2cc0b20]: $<-.data.frame $<-
tracemem[0x2cc0b20 -> 0x2cc0bc8]: $<-.data.frame $<-
R> tracemem(dat)
[1] "<0x26b93c8>"
R> dat$maxdepth = apply(dd,1,max)
tracemem[0x2df01a0 -> 0x2cc0e30]: as.matrix.data.frame as.matrix apply
tracemem[0x26b93c8 -> 0x26742c8]:
tracemem[0x26742c8 -> 0x2674358]: $<-.data.frame $<-
tracemem[0x2674358 -> 0x2674478]: $<-.data.frame $<-
Here, dd
is copied once in each call to apply
because apply()
converts dd
to a matrix before proceeding. The final three lines in the each block of tracemem
output indicates three copies of dat
are being made to insert the new column.
What about your second option?
R> set.seed(1)
R> dat <- data.frame(depth1 = runif(10), depth2 = runif(10))
R> tracemem(dat)
[1] "<0x268bc88>"
R> dat$mindepth <- apply(dat[,c("depth1","depth2")],1,min)
tracemem[0x268bc88 -> 0x26376b0]:
tracemem[0x26376b0 -> 0x2637720]: $<-.data.frame $<-
tracemem[0x2637720 -> 0x2637790]: $<-.data.frame $<-
R> tracemem(dat)
[1] "<0x2466d40>"
R> dat$maxdepth <- apply(dat[,c("depth1","depth2")],1,max)
tracemem[0x2466d40 -> 0x22ae0d8]:
tracemem[0x22ae0d8 -> 0x22ae1f8]: $<-.data.frame $<-
tracemem[0x22ae1f8 -> 0x22ae318]: $<-.data.frame $<-
Here this version avoids the copy involved in setting up dd
, but in all other respects is similar to your previous suggestion.
Can we do any better? Yes, and one simple way is to use the within()
option I started with but execute both statements to create new mindepth
and maxdepth
variables in the one call to within()
:
R> set.seed(1)
R> dat <- data.frame(depth1 = runif(10), depth2 = runif(10))
R> tracemem(dat)
[1] "<0x21c4158>"
R> dat <- within(dat, { mindepth <- pmin(depth1, depth2)
+ maxdepth <- pmax(depth1, depth2) })
tracemem[0x21c4158 -> 0x21c44a0]: within.data.frame within
tracemem[0x21c44a0 -> 0x21c4628]: [<-.data.frame [<- within.data.frame within
In this version we only invoke two copies of dat
compared to the 4 copies of the original within()
version.
What about if we coerce dat
to a matrix and then do the insertions?
R> set.seed(1)
R> dat <- data.frame(depth1 = runif(10), depth2 = runif(10))
R> tracemem(dat)
[1] "<0x1f29c70>"
R> mat <- as.matrix.data.frame(dat)
tracemem[0x1f29c70 -> 0x1f09768]: as.matrix.data.frame
R> tracemem(mat)
[1] "<0x245ff30>"
R> mat <- cbind(mat, pmin(mat[,1], mat[,2]), pmax(mat[,1], mat[,2]))
R>
That is an improvement as we only incur the cost of the single copy of dat
when coercing to a matrix. I cheated a bit by calling the as.matrix.data.frame()
method directly. If we'd just used as.matrix()
we'd have incurred another copy of mat
.
This highlights one of the reasons why matrices are so much faster to use than data frames.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With