R seems to require four bytes of storage per integer, even for small ones:
> object.size(rep(1L, 10000))
40040 bytes
And, what is more, even for factors:
> object.size(factor(rep(1L, 10000)))
40456 bytes
I think, especially in the latter case this could be handled much better. Is there a solution that would help me reduce the storage requirements for this case to eight or even two bits per row? Perhaps a solution that uses the raw
type internally for storage but behaves like a normal factor otherwise. The bit
package offers this for bits, but I haven't found anything similar for factors.
My data frame with just a few millions of rows is consuming gigabytes, and that's a huge waste of memory and run time (!). Compression will reduce the required disk space, but again at the expense of run time.
Related:
Since you mention raw
(and assuming you have less than 256 factor levels) - you could do the prerequisite conversion operations if memory is your bottleneck and CPU time isn't. For example:
f = factor(rep(1L, 1e5))
object.size(f)
# 400456 bytes
f.raw = as.raw(f)
object.size(f.raw)
#100040 bytes
# to go back:
identical(as.factor(as.integer(f.raw)), f)
#[1] TRUE
You can also save the factor levels separately and recover them if that's something you're interested in doing, but as far as grouping and all that goes you can just do it all with raw
and never go back to factors (except for presentation).
If you have specific use cases where you have trouble with this method, please post it, otherwise I think this should work just fine.
Here's a starting point for your byte.factor
class:
byte.factor = function(f) {
res = as.raw(f)
attr(res, "levels") <- levels(f)
attr(res, "class") <- "byte.factor"
res
}
as.factor.byte.factor = function(b) {
factor(attributes(b)$levels[as.integer(b)], attributes(b)$levels)
}
So you can do things like:
f = factor(c('a','b'), letters)
f
#[1] a b
#Levels: a b c d e f g h i j k l m n o p q r s t u v w x y z
b = byte.factor(f)
b
#[1] 01 02
#attr(,"levels")
# [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" "r" "s"
#[20] "t" "u" "v" "w" "x" "y" "z"
#attr(,"class")
#[1] "byte.factor"
as.factor.byte.factor(b)
#[1] a b
#Levels: a b c d e f g h i j k l m n o p q r s t u v w x y z
Check out how data.table
overrides rbind.data.frame
if you want to make as.factor
generic and just add whatever functions you want to add. Should all be quite straightforward.
One other solution is using ff
. ff
supports the following vmodes/types (see ?vmode
):
‘boolean’ ‘as.boolean’ 1 bit logical without NA
‘logical’ ‘as.logical’ 2 bit logical with NA
‘quad’ ‘as.quad’ 2 bit unsigned integer without NA
‘nibble’ ‘as.nibble’ 4 bit unsigned integer without NA
‘byte’ ‘as.byte’ 8 bit signed integer with NA
‘ubyte’ ‘as.ubyte’ 8 bit unsigned integer without NA
‘short’ ‘as.short’ 16 bit signed integer with NA
‘ushort’ ‘as.ushort’ 16 bit unsigned integer without NA
‘integer’ ‘as.integer’ 32 bit signed integer with NA
‘single’ ‘as.single’ 32 bit float
‘double’ ‘as.double’ 64 bit float
‘complex’ ‘as.complex’ 2x64 bit float
‘raw’ ‘as.raw’ 8 bit unsigned char
‘character’ ‘as.character’ character
For example:
library(ff)
v <- ff(as.factor(sample(letters[1:4], 10000, replace=TRUE)), vmode="byte",
levels=letters[1:4])
This will use only one byte per element. An added advantage/disadvantage is that when the data becomes too large to store into memory it is automatically stored on disk (which of course will affect performance).
However, whatever solution you use, you will probably run into reduced performance. R internally uses integers for factors, so before calling any R-method the data will have to be translated from the compact storage to R's integers, which will cost. Unless, you only use methods specifically written for the compact storage type (these will probably have to be written in c/c++/...).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With