For a vector of logical values, why does R allocate 4 bytes, when a bit vector would consume 1 bit per entry? (See this question for examples.)
Now, I realize that R also facilitates storage of NA
values, but couldn't that be done with an additional bit vector? In other words, why isn't it enough to just use a cheap two bit data structure?
For what it's worth, Matlab uses 1 byte for logicals, though it doesn't facilitate NA values. I'm not sure why MathWorks isn't satisfied with one bit functionality, much less a two bit data structure, but they have fancy pants marketers... [I'm gonna milk "two bit" for all it's worth in this question. ;-)]
Update 1. I think that the architecture reasons offered make some sense, but that feels a little ex post facto. I haven't checked 32 bit or 16 bit R to see how large their logicals are - that could lend some support to the idea. It seems, from the R Internals manual that logical vectors (LGLSXP) and integers (INTSXP) are 32 bits on every platform. I can understand a universal size for integers, independent of word size. Similarly, storage of logicals also seems to be independent of word size. But it's so BIG. :)
In addition, if the word size argument is so powerful, it seems strange to me to see Matlab (I think it's a 32 bit Matlab) consume only 1 byte - I wonder if MathWorks chose to be more memory efficient with a tradeoff for programming complexity and some other overhead for finding sub-word objects.
Also, there are certainly other options in are: as Brian Diggs notes, the bit
package facilitates bit vectors, which was very useful for the problem in the question above (an 8X-10X speedup for the task was obtained by converting from 4 byte logical
values to bit vectors). Although speed of accessing memory is important, moving 30-31 extra uninformative bits (from an information theory perspective) is wasteful. For instance, one could use something like the memory tricks used for integers described here - grab a bunch of extra memory (V cells) and then process things at the bit level (a la bit()
). Why not do that and save 30 bits (1 for the value, 1 for NA
) for a long vector?
To the extent that my RAM and computational speed are affected by booleans, I intend to switch over to using bit
, but that's because a 97% savings in space matters in some cases. :)
I think that the answer to this question will come from someone with a deeper understanding of R's design or internals. The best example is that Matlab uses a different size for their logical, and memory word sizes wouldn't be the answer in that case. Python may be similar to R, for what it's worth.
A related way to phrase this might be: why would LGLSXP
be 4 bytes on all platforms? (Is CHARSXP
typically smaller, and wouldn't that work as well? Why not go even smaller, and just over-allocate?) (Updated The idea of using CHARSXP
is likely bogus, because operations on CHARSXP
aren't as fully useful as those for integers, such as sum
. Using the same data structure as characters might save space, but would constrain which existing methods could operate on it. A more appropriate consideration is the use of smaller integers, as discussed below.)
Update 2. There have been some very good and enlightening answers here, especially relative to how one should implement retrieval and processing of booleans for the goals of speed and programming efficiency. I think that Tommy's answer is particularly plausible regarding the why it appears this way in R, which seems to arise from 2 premises:
In order to support addition on a logical vector (note that "logical" is defined by programming language / environment, and is not the same as a boolean), one is best served by reusing code for adding integers. In the case of R, integers consume 4 bytes. In the case of Matlab, the smallest integer is 1 byte (i.e. int8
). This would explain why something different would be a nuisance to write for logicals. [To those not familiar with R, it supports many numerical operations on logicals, such as sum(myVector)
, mean(myVector)
, etc.]
Legacy support makes it exceedingly difficult to do something other than what has been done in R and S-Plus for a long time now. Moreover, I suspect that in the early days of S, S-Plus, and R, if someone was doing a lot of boolean operations, they did them in C, rather than trying to do so much work with logicals in R.
The other answers are fantastic for the purposes of how one might implement better boolean handling - don't naively assume that one can get at any individual bit: it's most efficient to load a word, then mask the bits that are not of interest, as Dervall has described. This is very, very useful advice should one write specialized code for boolean manipulation for R (e.g. my question on cross tabulations): don't iterate over bits, but instead work at the word level.
Thanks to all for a very thorough set of answers and insights.
Why does a System. Boolean take 4 bytes? It just stores one state, either true or false, which could be stored in less space than 4 bytes.
A bool takes in real 1 bit, as you need only 2 different values. However, when you do a sizeof(bool), it returns 1, meaning 1 byte. For practical reasons, the 7 bits remaining are stuffed. you can't store a variable of size less than 1 byte.
Internally, a Boolean variable is a 2-byte value holding –1 (for TRUE) or 0 (for FALSE). Any type of data can be assigned to Boolean variables. When assigning, non-0 values are converted to TRUE , and 0 values are converted to FALSE. When appearing as a structure member, Boolean members require 2 bytes of storage.
When you type a Boolean expression in R, R will output TRUE if the expression is true and FALSE if the expression is false. In R, typing a < b outputs whether a is less than b , typing a > b outputs whether a is greater than b , and typing a == b outputs whether a is equal to b .
Knowing a little something about R and S-Plus, I'd say that R most likely did it to be compatible with S-Plus, and S-Plus most likely did it because it was the easiest thing to do...
Basically, a logical vector is identical to an integer vector, so sum
and other algorithms for integers work pretty much unchanged on logical vectors.
In 64-bit S-Plus, the integers are 64-bit and thus also the logical vectors! That's 8 bytes per logical value...
@Iterator is of course correct that a logical vector should be represented in a more compact form. Since there is already a raw
vector type which is 1-byte, it would seem like a very simple change to use that one for logicals too. And 2 bits per value would of course be even better - I'd probably keep them as two separate bit vectors (TRUE/FALSE and NA/Valid), and the NA bit vector could be NULL if there are no NAs...
Anyway, that's mostly a dream since there are so many RAPI packages (packages that use the R C/FORTRAN APIs) out there that would break...
Without knowing R at all, I suspect for much the same reason as C does, because it's way faster to load a size equal to the processors native word size.
Loading a single bit would be slow, especially from a bitfield since you'd have to mask out the bits that do not apply to your particular query. With a whole word you can just load it in a registry and be done with it. Since the size difference usually is not a problem the default implementation is to use a word sized variable. If the user wants something else there is always the option to do the bit-shifting required manually.
Concerning packing, at least on some processors it will cause a fault to read from a non-aligned address. So while you might declare a structure with a single byte
in it surrounded by two int
the byte
might be padded to be 4 bytes in size regardless. Again, I don't know anything about R in particular, but I suspect the behaviour might be the same for performance reasons.
Addressing a single byte in an array is quite more involved, say you have an array bitfield
and want to address bit x
in it, the code would be something like this:
bit b = (bitfield[x/8] >> (x % 8)) & 1
to obtain either 0 or 1 for the bit you requested. In comparison to the straightforward array addressing of from a boolean array obtaining value number x: bool a = array[x]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With