Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Rule of thumb for memory size of datasets in R

Are there any rule of thumbs to know when R will have problems to deal with a given dataset in RAM (given a PC configuration)?

For example, I have heard that one rule of thumb is that you should consider 8 bytes for each cell. Then, if I have 1.000.000 observations of 1.000 columns that would be close to 8 GB - hence, in most domestic computers, we probably would have to store the data in the HD and access it in chunks.

Is the above correct? Which rule of thumbs for memory size and usage can we apply beforehand? By that I mean enough memory not only to load the object, but to do some basic operations like some data tidying, some data visualisation and some analysis (regression).

PS: it would be nice to explain how the rule of thumb works, so it is not just a blackbox.

like image 948
Carlos Cinelli Avatar asked Feb 13 '14 12:02

Carlos Cinelli


People also ask

How many GB of data can R handle?

Today, R can address 8 TB of RAM if it runs on 64-bit machines. That is in many situations a sufficient improvement compared to about 2 GB addressable RAM on 32-bit machines. As an alternative, there are packages available that avoid storing data in memory.

How much RAM do you need for R?

R is memory intensive, so it's best to get as much RAM as possible. If you use virtual machines you might have restrictions on how much memory you can allocate to a single instance. In that case we recommend getting as much memory as possible and consider using multiple nodes. Minimum (2 core / 4G).

How do I find the size of a dataset in R?

To find the object size in R, we can use object. size function. For example, if we have a data frame called df then the size of df can be found by using the command object. size(df).

How much memory is used by R to store the objects created for each file?

Neither of those things are true! Those 40 bytes are used to store four components possessed by every object in R: Object metadata (4 bytes). These metadata store the base type (e.g. integer) and information used for debugging and memory management.


3 Answers

The memory footprint of some vectors at different sizes, in bytes.

n <- c(1, 1e3, 1e6)
names(n) <- n
one_hundred_chars <- paste(rep.int(" ", 100), collapse = "")

sapply(
  n,
  function(n)
  {
    strings_of_one_hundred_chars <- replicate(
      n,
      paste(sample(letters, 100, replace = TRUE), collapse = "")
    )
    sapply(
      list(
        Integers                                 = integer(n),
        Floats                                   = numeric(n),
        Logicals                                 = logical(n),
        "Empty strings"                          = character(n),
        "Identical strings, nchar=100"           = rep.int(one_hundred_chars, n),
        "Distinct strings, nchar=100"            = strings_of_one_hundred_chars,
        "Factor of empty strings"                = factor(character(n)),
        "Factor of identical strings, nchar=100" = factor(rep.int(one_hundred_chars, n)),
        "Factor of distinct strings, nchar=100"  = factor(strings_of_one_hundred_chars),
        Raw                                      = raw(n),
        "Empty list"                             = vector("list", n)
      ),
      object.size
    )
  }
)

Some values differ under between 64/32 bit R.

## Under 64-bit R
##                                          1   1000     1e+06
## Integers                                48   4040   4000040
## Floats                                  48   8040   8000040
## Logicals                                48   4040   4000040
## Empty strings                           96   8088   8000088
## Identical strings, nchar=100           216   8208   8000208
## Distinct strings, nchar=100            216 176040 176000040
## Factor of empty strings                464   4456   4000456
## Factor of identical strings, nchar=100 584   4576   4000576
## Factor of distinct strings, nchar=100  584 180400 180000400
## Raw                                     48   1040   1000040
## Empty list                              48   8040   8000040

## Under 32-bit R
##                                          1   1000     1e+06
## Integers                                32   4024   4000024
## Floats                                  32   8024   8000024
## Logicals                                32   4024   4000024
## Empty strings                           64   4056   4000056
## Identical strings, nchar=100           184   4176   4000176
## Distinct strings, nchar=100            184 156024 156000024
## Factor of empty strings                272   4264   4000264
## Factor of identical strings, nchar=100 392   4384   4000384
## Factor of distinct strings, nchar=100  392 160224 160000224
## Raw                                     32   1024   1000024
## Empty list                              32   4024   4000024

Notice that factors have a smaller memory footprint than character vectors when there are lots of repetitions of the same string (but not when they are all unique).

like image 139
5 revs Avatar answered Oct 07 '22 18:10

5 revs


The rule of thumb is correct for numeric vectors. A numeric vector uses 40 bytes to store information about the vector plus 8 for each element in the vector. You can use the object.size() function to see this:

object.size(numeric())  # an empty vector (40 bytes)  
object.size(c(1))       # 48 bytes
object.size(c(1.2, 4))  # 56 bytes

You probably won't just have numeric vectors in you analysis. Matrices grow similar to vectors (this is to be expected since they are just vectors with a dim attribute).

object.size(matrix())           # Not really empty (208 bytes)
object.size(matrix(1:4, 2, 2))  # 216 bytes
object.size(matrix(1:6, 3, 2))  # 232 bytes (2 * 8 more after adding 2 elements)

Data.frames are more complicates (they have more attributes than a simple vector) and so they grow faster:

object.size(data.frame())                  # 560 bytes
object.size(data.frame(x = 1))             # 680 bytes
object.size(data.frame(x = 1:5, y = 1:5))  # 840 bytes

A good reference for memory is Hadley Wickhams Advanced R Programming.

All of this said, remember that in order to do analyses in R, you need some cushion in memory to allow R to copy the data you may be working on.

like image 37
Christopher Louden Avatar answered Oct 07 '22 18:10

Christopher Louden


I cannot really answer your question fully and I strongly suspect that there will be several factors that will affect what works in practice, but if you are just looking at the amount of raw memory a single copy of a given dataset would occupy, you can have a look at the documentation of R internals.

You will see that the amount of memory requires depends on the type of data being held. If you are talking about number data, these would typically be integer or numeric/real data. These in terms are described by the R internal types INTSXP and REALSXP, respectively which are described as follows:

INTSXP

length, truelength followed by a block of C ints (which are 32 bits on all R platforms).

REALSXP

length, truelength followed by a block of C doubles

A double is 64 bits (8 bytes) in length, so your 'rule of thumb' would appear to be roughly correct for a dataset exclusively containing numeric values. Similarly, with integer data, each element would occupy 4 bytes.

like image 1
PhiS Avatar answered Oct 07 '22 18:10

PhiS