Why do logicals (booleans) in R require 4 bytes?

Tags:

For a vector of logical values, why does R allocate 4 bytes, when a bit vector would consume 1 bit per entry? (See this question for examples.)

Now, I realize that R also facilitates storage of NA values, but couldn't that be done with an additional bit vector? In other words, why isn't it enough to just use a cheap two bit data structure?

For what it's worth, Matlab uses 1 byte for logicals, though it doesn't facilitate NA values. I'm not sure why MathWorks isn't satisfied with one bit functionality, much less a two bit data structure, but they have fancy pants marketers... [I'm gonna milk "two bit" for all it's worth in this question. ;-)]

Update 1. I think that the architecture reasons offered make some sense, but that feels a little ex post facto. ~~I haven't checked 32 bit or 16 bit R to see how large their logicals are - that could lend some support to the idea.~~ It seems, from the R Internals manual that logical vectors (LGLSXP) and integers (INTSXP) are 32 bits on every platform. I can understand a universal size for integers, independent of word size. Similarly, storage of logicals also seems to be independent of word size. But it's so BIG. :)

In addition, if the word size argument is so powerful, it seems strange to me to see Matlab (I think it's a 32 bit Matlab) consume only 1 byte - I wonder if MathWorks chose to be more memory efficient with a tradeoff for programming complexity and some other overhead for finding sub-word objects.

Also, there are certainly other options in are: as Brian Diggs notes, the bit package facilitates bit vectors, which was very useful for the problem in the question above (an 8X-10X speedup for the task was obtained by converting from 4 byte logical values to bit vectors). Although speed of accessing memory is important, moving 30-31 extra uninformative bits (from an information theory perspective) is wasteful. For instance, one could use something like the memory tricks used for integers described here - grab a bunch of extra memory (V cells) and then process things at the bit level (a la bit()). Why not do that and save 30 bits (1 for the value, 1 for NA) for a long vector?

To the extent that my RAM and computational speed are affected by booleans, I intend to switch over to using bit, but that's because a 97% savings in space matters in some cases. :)

I think that the answer to this question will come from someone with a deeper understanding of R's design or internals. The best example is that Matlab uses a different size for their logical, and memory word sizes wouldn't be the answer in that case. Python may be similar to R, for what it's worth.

A related way to phrase this might be: why would LGLSXP be 4 bytes on all platforms? (Is CHARSXP typically smaller, and wouldn't that work as well? Why not go even smaller, and just over-allocate?) (Updated The idea of using CHARSXP is likely bogus, because operations on CHARSXP aren't as fully useful as those for integers, such as sum. Using the same data structure as characters might save space, but would constrain which existing methods could operate on it. A more appropriate consideration is the use of smaller integers, as discussed below.)

Update 2. There have been some very good and enlightening answers here, especially relative to how one should implement retrieval and processing of booleans for the goals of speed and programming efficiency. I think that Tommy's answer is particularly plausible regarding the why it appears this way in R, which seems to arise from 2 premises:

In order to support addition on a logical vector (note that "logical" is defined by programming language / environment, and is not the same as a boolean), one is best served by reusing code for adding integers. In the case of R, integers consume 4 bytes. In the case of Matlab, the smallest integer is 1 byte (i.e. int8). This would explain why something different would be a nuisance to write for logicals. [To those not familiar with R, it supports many numerical operations on logicals, such as sum(myVector), mean(myVector), etc.]
Legacy support makes it exceedingly difficult to do something other than what has been done in R and S-Plus for a long time now. Moreover, I suspect that in the early days of S, S-Plus, and R, if someone was doing a lot of boolean operations, they did them in C, rather than trying to do so much work with logicals in R.

The other answers are fantastic for the purposes of how one might implement better boolean handling - don't naively assume that one can get at any individual bit: it's most efficient to load a word, then mask the bits that are not of interest, as Dervall has described. This is very, very useful advice should one write specialized code for boolean manipulation for R (e.g. my question on cross tabulations): don't iterate over bits, but instead work at the word level.

Thanks to all for a very thorough set of answers and insights.

355

asked Feb 07 '12 14:02

Iterator

2 Answers

Knowing a little something about R and S-Plus, I'd say that R most likely did it to be compatible with S-Plus, and S-Plus most likely did it because it was the easiest thing to do...

Basically, a logical vector is identical to an integer vector, so sum and other algorithms for integers work pretty much unchanged on logical vectors.

In 64-bit S-Plus, the integers are 64-bit and thus also the logical vectors! That's 8 bytes per logical value...

@Iterator is of course correct that a logical vector should be represented in a more compact form. Since there is already a raw vector type which is 1-byte, it would seem like a very simple change to use that one for logicals too. And 2 bits per value would of course be even better - I'd probably keep them as two separate bit vectors (TRUE/FALSE and NA/Valid), and the NA bit vector could be NULL if there are no NAs...

Anyway, that's mostly a dream since there are so many RAPI packages (packages that use the R C/FORTRAN APIs) out there that would break...

116

answered Oct 08 '22 02:10

Tommy

Without knowing R at all, I suspect for much the same reason as C does, because it's way faster to load a size equal to the processors native word size.

Loading a single bit would be slow, especially from a bitfield since you'd have to mask out the bits that do not apply to your particular query. With a whole word you can just load it in a registry and be done with it. Since the size difference usually is not a problem the default implementation is to use a word sized variable. If the user wants something else there is always the option to do the bit-shifting required manually.

Concerning packing, at least on some processors it will cause a fault to read from a non-aligned address. So while you might declare a structure with a single byte in it surrounded by two int the byte might be padded to be 4 bytes in size regardless. Again, I don't know anything about R in particular, but I suspect the behaviour might be the same for performance reasons.

Addressing a single byte in an array is quite more involved, say you have an array bitfield and want to address bit x in it, the code would be something like this:

bit b = (bitfield[x/8] >> (x % 8)) & 1

to obtain either 0 or 1 for the bit you requested. In comparison to the straightforward array addressing of from a boolean array obtaining value number x: bool a = array[x]

answered Oct 08 '22 01:10

Dervall

Related questions
                            
                                Easier way to plot the cumulative frequency distribution in ggplot?
                            
                                Add error bars to show standard deviation on a plot in R
                            
                                What programming languages are good for statistics? [closed]
                            
                                Removing html tags from a string in R
                            
                                R-Project no applicable method for 'meta' applied to an object of class "character"
                            
                                Safely creating S3 Generics in R
                            
                                Cluster one-dimensional data optimally? [closed]
                            
                                R - plot human body in 2d [closed]
                            
                                Cartogram + choropleth map in R
                            
                                What are productive ways to debug Rcpp compiled code loaded in R (on OS X Mavericks)?
                            
                                Using lapply to apply a function over list of data frames and saving output to files with different names
                            
                                How do you handle R Data internal to a package?
                            
                                Command to see 'R' path that RStudio is using
                            
                                What's the differences between & and &&, | and || in R? [duplicate]
                            
                                Preventing performance regressions in R
                            
                                Why is enquo + !! preferable to substitute + eval
                            
                                Automatically adjust LaTeX table width to fit pdf using knitr and Rstudio
                            
                                Finding the index inside a vector satisfying a condition
                            
                                R: apt-get install r-cran-foo vs. install.packages("foo")
                            
                                find *all* duplicated records in data.table (not all-but-one)

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Why do logicals (booleans) in R require 4 bytes?

Tags:

memory

r

boolean

Iterator

People also ask

2 Answers

Tommy

Dervall

Recent Activity

Donate For Us