Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why an empty matrix is 208 bytes? [duplicate]

I was interested in the memory usage of matrices in R when I observed something strange. In a loop, I made grow up the number of columns of a matrix and computed, for each step, the object size like this:

x <- 10
size <- matrix(1:x, x, 2)

for (i in 1:x){
  m  <- matrix(1, 2, i)
  size[i,2] <- object.size(m)
}

Which gives

plot(size[,1], size[,2], xlab="n columns", ylab="memory")

enter image description here

It seems that matrices with 2 rows and 5, 6, 7 or 8 columns use the exact same memory. How can we explain that?

like image 623
DJack Avatar asked Nov 17 '22 23:11

DJack


1 Answers

To understand what's going on here, you need to know a little bit about the memory overhead associated with objects in R. Every object, even an object with no data, has 40 bytes of data associated with it:

x0 <- numeric()
object.size(x0)
# 40 bytes

This memory is used to store the type of the object (as returned by typeof()), and other metadata needed for memory management.

After ignoring this overhead, you might expect that the memory usage of a vector is proportional to the length of the vector. Let's check that out with a couple of plots:

sizes <- sapply(0:50, function(n) object.size(seq_len(n)))
plot(c(0, 50), c(0, max(sizes)), xlab = "Length", ylab = "Bytes", 
  type = "n")
abline(h = 40, col = "grey80")
abline(h = 40 + 128, col = "grey80")
abline(a = 40, b = 4, col = "grey90", lwd = 4)
lines(sizes, type = "s")

Memory usage of vectors

It looks like memory usage is roughly proportional to the length of the vector, but there is a big discontinuity at 168 bytes and small discontinuities every few steps. The big discontinuity is because R has two storage pools for vectors: small vectors, managed by R, and big vectors, managed by the OS (This is a performance optimisation because allocating lots of small amounts of memory is expensive). Small vectors can only be 8, 16, 32, 48, 64 or 128 bytes long, which once we remove the 40 byte overhead, is exactly what we see:

sizes - 40
#  [1]   0   8   8  16  16  32  32  32  32  48  48  48  48  64  64  64  64 128 128 128 128
# [22] 128 128 128 128 128 128 128 128 128 128 128 128 136 136 144 144 152 152 160 160 168
# [43] 168 176 176 184 184 192 192 200 200

The step from 64 to 128 causes the big step, then once we've crossed into the big vector pool, vectors are allocated in chunks of 8 bytes (memory comes in units of a certain size, and R can't ask for half a unit):

# diff(sizes)
#  [1]  8  0  8  0 16  0  0  0 16  0  0  0 16  0  0  0 64  0  0  0  0  0  0  0  0  0  0  0
# [29]  0  0  0  0  8  0  8  0  8  0  8  0  8  0  8  0  8  0  8  0  8  0

So how does this behaviour correspond to what you see with matrices? Well, first we need to look at the overhead associated with a matrix:

xv <- numeric()
xm <- matrix(xv)

object.size(xm)
# 200 bytes

object.size(xm) - object.size(xv)
# 160 bytes

So a matrix needs an extra 160 bytes of storage compared to a vector. Why 160 bytes? It's because a matrix has a dim attribute containing two integers, and attributes are stored in a pairlist (an older version of list()):

object.size(pairlist(dims = c(1L, 1L)))
# 160 bytes

If we re-draw the previous plot using matrices instead of vectors, and increase all constants on the y-axis by 160, you can see the discontinuity corresponds exactly to the jump from the small vector pool to the big vector pool:

msizes <- sapply(0:50, function(n) object.size(as.matrix(seq_len(n))))
plot(c(0, 50), c(160, max(msizes)), xlab = "Length", ylab = "Bytes", 
  type = "n")
abline(h = 40 + 160, col = "grey80")
abline(h = 40 + 160 + 128, col = "grey80")
abline(a = 40 + 160, b = 4, col = "grey90", lwd = 4)
lines(msizes, type = "s")

Memory usage of matrices

like image 163
hadley Avatar answered Jan 01 '23 20:01

hadley