I was interested in the memory usage of matrices in R when I observed something strange. In a loop, I made grow up the number of columns of a matrix and computed, for each step, the object size like this:
x <- 10
size <- matrix(1:x, x, 2)
for (i in 1:x){
m <- matrix(1, 2, i)
size[i,2] <- object.size(m)
}
Which gives
plot(size[,1], size[,2], xlab="n columns", ylab="memory")
It seems that matrices with 2 rows and 5, 6, 7 or 8 columns use the exact same memory. How can we explain that?
To understand what's going on here, you need to know a little bit about the memory overhead associated with objects in R. Every object, even an object with no data, has 40 bytes of data associated with it:
x0 <- numeric()
object.size(x0)
# 40 bytes
This memory is used to store the type of the object (as returned by typeof()
), and other metadata needed for memory management.
After ignoring this overhead, you might expect that the memory usage of a vector is proportional to the length of the vector. Let's check that out with a couple of plots:
sizes <- sapply(0:50, function(n) object.size(seq_len(n)))
plot(c(0, 50), c(0, max(sizes)), xlab = "Length", ylab = "Bytes",
type = "n")
abline(h = 40, col = "grey80")
abline(h = 40 + 128, col = "grey80")
abline(a = 40, b = 4, col = "grey90", lwd = 4)
lines(sizes, type = "s")
It looks like memory usage is roughly proportional to the length of the vector, but there is a big discontinuity at 168 bytes and small discontinuities every few steps. The big discontinuity is because R has two storage pools for vectors: small vectors, managed by R, and big vectors, managed by the OS (This is a performance optimisation because allocating lots of small amounts of memory is expensive). Small vectors can only be 8, 16, 32, 48, 64 or 128 bytes long, which once we remove the 40 byte overhead, is exactly what we see:
sizes - 40
# [1] 0 8 8 16 16 32 32 32 32 48 48 48 48 64 64 64 64 128 128 128 128
# [22] 128 128 128 128 128 128 128 128 128 128 128 128 136 136 144 144 152 152 160 160 168
# [43] 168 176 176 184 184 192 192 200 200
The step from 64 to 128 causes the big step, then once we've crossed into the big vector pool, vectors are allocated in chunks of 8 bytes (memory comes in units of a certain size, and R can't ask for half a unit):
# diff(sizes)
# [1] 8 0 8 0 16 0 0 0 16 0 0 0 16 0 0 0 64 0 0 0 0 0 0 0 0 0 0 0
# [29] 0 0 0 0 8 0 8 0 8 0 8 0 8 0 8 0 8 0 8 0 8 0
So how does this behaviour correspond to what you see with matrices? Well, first we need to look at the overhead associated with a matrix:
xv <- numeric()
xm <- matrix(xv)
object.size(xm)
# 200 bytes
object.size(xm) - object.size(xv)
# 160 bytes
So a matrix needs an extra 160 bytes of storage compared to a vector. Why 160 bytes? It's because a matrix has a dim
attribute containing two integers, and attributes are stored in a pairlist
(an older version of list()
):
object.size(pairlist(dims = c(1L, 1L)))
# 160 bytes
If we re-draw the previous plot using matrices instead of vectors, and increase all constants on the y-axis by 160, you can see the discontinuity corresponds exactly to the jump from the small vector pool to the big vector pool:
msizes <- sapply(0:50, function(n) object.size(as.matrix(seq_len(n))))
plot(c(0, 50), c(160, max(msizes)), xlab = "Length", ylab = "Bytes",
type = "n")
abline(h = 40 + 160, col = "grey80")
abline(h = 40 + 160 + 128, col = "grey80")
abline(a = 40 + 160, b = 4, col = "grey90", lwd = 4)
lines(msizes, type = "s")
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With