Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Sizes of integer vectors in R

I had thought that R had a standard overhead for storing objects (24 bytes, it seems, at least for integer vectors), but a simple test revealed that it's more complex than I realized. For instance, taking integer vectors up to length 100 (using random sampling, hoping to avoid any sneaky sequence compression tricks that might be out there), I found that different length vectors could have the same size, as follows:

> N   = 100
> V   = vector(length = 100)
> for(L in 1:N){
+     z = sample(N, L, replace = TRUE)
+     V[L]    = object.size(z)
+ }
> 
> options('width'=88)
> V
  [1]  48  48  56  56  72  72  72  72  88  88  88  88 104 104 104 104 168 168 168 168
 [21] 168 168 168 168 168 168 168 168 168 168 168 168 176 176 184 184 192 192 200 200
 [41] 208 208 216 216 224 224 232 232 240 240 248 248 256 256 264 264 272 272 280 280
 [61] 288 288 296 296 304 304 312 312 320 320 328 328 336 336 344 344 352 352 360 360
 [81] 368 368 376 376 384 384 392 392 400 400 408 408 416 416 424 424 432 432 440 440

I'm very impressed by the 152 values that shows up (observation: 152 = 128 + 24, though 280 = 256 + 24 isn't as prominent). Can someone explain how these allocations arise? I have been unable to find a clear definition in the documentation, though V cells come up.

like image 866
Iterator Avatar asked Aug 16 '11 14:08

Iterator


1 Answers

Even if you try N <- 10000, all values occur exactly twice, except for vectors of length :

  • 5 to 8 (56 bytes)
  • 9 to 12 (72 bytes)
  • 13 to 16 (88 bytes)
  • 17 to 32 (152 bytes)

The fact that the number of bytes occurs twice, comes from the simple fact that the memory is allocated in pieces of 8 bytes (referred to as Vcells in ?gc ) and integers take only 4 bytes.

Next to that, the internal structure of objects in R makes a distinguishment between small and large vectors for allocating memory. Small vectors are allocated in bigger blocks of about 2Kb, whereas larger vectors are allocated individually. The ‘small’ vectors consist of 6 defined classes, based on length, and are able to store vector data of up to 8, 16, 32, 48, 64 and 128 bytes. As an integer takes only 4 bytes, you have 2, 4, 8, 12, 16 and 32 integers you can store in these 6 classes. This explains the pattern you see.

The extra number of bytes is for the header (which forms the Ncells in ?gc). If you're really interested in all this, read through the R Internals manual.

And, as you guessed, the 24 extra bytes are from the headers (or Ncells ). It's in fact a bit more complicated than that, but the exact details can be found in the R internals manual

like image 100
Joris Meys Avatar answered Sep 20 '22 13:09

Joris Meys