Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Save storage space for small integers or factors with few levels

R seems to require four bytes of storage per integer, even for small ones:

> object.size(rep(1L, 10000))
40040 bytes

And, what is more, even for factors:

> object.size(factor(rep(1L, 10000)))
40456 bytes

I think, especially in the latter case this could be handled much better. Is there a solution that would help me reduce the storage requirements for this case to eight or even two bits per row? Perhaps a solution that uses the raw type internally for storage but behaves like a normal factor otherwise. The bit package offers this for bits, but I haven't found anything similar for factors.

My data frame with just a few millions of rows is consuming gigabytes, and that's a huge waste of memory and run time (!). Compression will reduce the required disk space, but again at the expense of run time.

Related:

  • Why do logicals (booleans) in R require 4 bytes?
  • How can I efficiently construct a very long factor with few levels?
like image 870
krlmlr Avatar asked Jul 18 '13 08:07

krlmlr


2 Answers

Since you mention raw (and assuming you have less than 256 factor levels) - you could do the prerequisite conversion operations if memory is your bottleneck and CPU time isn't. For example:

f = factor(rep(1L, 1e5))
object.size(f)
# 400456 bytes

f.raw = as.raw(f)
object.size(f.raw)
#100040 bytes

# to go back:
identical(as.factor(as.integer(f.raw)), f)
#[1] TRUE

You can also save the factor levels separately and recover them if that's something you're interested in doing, but as far as grouping and all that goes you can just do it all with raw and never go back to factors (except for presentation).

If you have specific use cases where you have trouble with this method, please post it, otherwise I think this should work just fine.


Here's a starting point for your byte.factor class:

byte.factor = function(f) {
  res = as.raw(f)
  attr(res, "levels") <- levels(f)
  attr(res, "class") <- "byte.factor"
  res
}

as.factor.byte.factor = function(b) {
  factor(attributes(b)$levels[as.integer(b)], attributes(b)$levels)
}

So you can do things like:

f = factor(c('a','b'), letters)
f
#[1] a b
#Levels: a b c d e f g h i j k l m n o p q r s t u v w x y z

b = byte.factor(f)
b
#[1] 01 02
#attr(,"levels")
# [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" "r" "s"
#[20] "t" "u" "v" "w" "x" "y" "z"
#attr(,"class")
#[1] "byte.factor"

as.factor.byte.factor(b)
#[1] a b
#Levels: a b c d e f g h i j k l m n o p q r s t u v w x y z

Check out how data.table overrides rbind.data.frame if you want to make as.factor generic and just add whatever functions you want to add. Should all be quite straightforward.

like image 120
eddi Avatar answered Sep 28 '22 13:09

eddi


One other solution is using ff. ff supports the following vmodes/types (see ?vmode):

 ‘boolean’    ‘as.boolean’    1 bit logical without NA           
 ‘logical’    ‘as.logical’    2 bit logical with NA              
 ‘quad’       ‘as.quad’       2 bit unsigned integer without NA  
 ‘nibble’     ‘as.nibble’     4 bit unsigned integer without NA  
 ‘byte’       ‘as.byte’       8 bit signed integer with NA       
 ‘ubyte’      ‘as.ubyte’      8 bit unsigned integer without NA  
 ‘short’      ‘as.short’      16 bit signed integer with NA      
 ‘ushort’     ‘as.ushort’     16 bit unsigned integer without NA 
 ‘integer’    ‘as.integer’    32 bit signed integer with NA      
 ‘single’     ‘as.single’     32 bit float                       
 ‘double’     ‘as.double’     64 bit float                       
 ‘complex’    ‘as.complex’    2x64 bit float                     
 ‘raw’        ‘as.raw’        8 bit unsigned char                
 ‘character’  ‘as.character’  character

For example:

library(ff)
v <- ff(as.factor(sample(letters[1:4], 10000, replace=TRUE)), vmode="byte", 
    levels=letters[1:4])

This will use only one byte per element. An added advantage/disadvantage is that when the data becomes too large to store into memory it is automatically stored on disk (which of course will affect performance).

However, whatever solution you use, you will probably run into reduced performance. R internally uses integers for factors, so before calling any R-method the data will have to be translated from the compact storage to R's integers, which will cost. Unless, you only use methods specifically written for the compact storage type (these will probably have to be written in c/c++/...).

like image 38
Jan van der Laan Avatar answered Sep 28 '22 11:09

Jan van der Laan