Rule of thumb for memory size of datasets in R

Tags:

Are there any rule of thumbs to know when R will have problems to deal with a given dataset in RAM (given a PC configuration)?

For example, I have heard that one rule of thumb is that you should consider 8 bytes for each cell. Then, if I have 1.000.000 observations of 1.000 columns that would be close to 8 GB - hence, in most domestic computers, we probably would have to store the data in the HD and access it in chunks.

Is the above correct? Which rule of thumbs for memory size and usage can we apply beforehand? By that I mean enough memory not only to load the object, but to do some basic operations like some data tidying, some data visualisation and some analysis (regression).

PS: it would be nice to explain how the rule of thumb works, so it is not just a blackbox.

948

asked Feb 13 '14 12:02

Carlos Cinelli

3 Answers

The memory footprint of some vectors at different sizes, in bytes.

n <- c(1, 1e3, 1e6)
names(n) <- n
one_hundred_chars <- paste(rep.int(" ", 100), collapse = "")

sapply(
  n,
  function(n)
  {
    strings_of_one_hundred_chars <- replicate(
      n,
      paste(sample(letters, 100, replace = TRUE), collapse = "")
    )
    sapply(
      list(
        Integers                                 = integer(n),
        Floats                                   = numeric(n),
        Logicals                                 = logical(n),
        "Empty strings"                          = character(n),
        "Identical strings, nchar=100"           = rep.int(one_hundred_chars, n),
        "Distinct strings, nchar=100"            = strings_of_one_hundred_chars,
        "Factor of empty strings"                = factor(character(n)),
        "Factor of identical strings, nchar=100" = factor(rep.int(one_hundred_chars, n)),
        "Factor of distinct strings, nchar=100"  = factor(strings_of_one_hundred_chars),
        Raw                                      = raw(n),
        "Empty list"                             = vector("list", n)
      ),
      object.size
    )
  }
)

Some values differ under between 64/32 bit R.

## Under 64-bit R
##                                          1   1000     1e+06
## Integers                                48   4040   4000040
## Floats                                  48   8040   8000040
## Logicals                                48   4040   4000040
## Empty strings                           96   8088   8000088
## Identical strings, nchar=100           216   8208   8000208
## Distinct strings, nchar=100            216 176040 176000040
## Factor of empty strings                464   4456   4000456
## Factor of identical strings, nchar=100 584   4576   4000576
## Factor of distinct strings, nchar=100  584 180400 180000400
## Raw                                     48   1040   1000040
## Empty list                              48   8040   8000040

## Under 32-bit R
##                                          1   1000     1e+06
## Integers                                32   4024   4000024
## Floats                                  32   8024   8000024
## Logicals                                32   4024   4000024
## Empty strings                           64   4056   4000056
## Identical strings, nchar=100           184   4176   4000176
## Distinct strings, nchar=100            184 156024 156000024
## Factor of empty strings                272   4264   4000264
## Factor of identical strings, nchar=100 392   4384   4000384
## Factor of distinct strings, nchar=100  392 160224 160000224
## Raw                                     32   1024   1000024
## Empty list                              32   4024   4000024

Notice that factors have a smaller memory footprint than character vectors when there are lots of repetitions of the same string (but not when they are all unique).

139

answered Oct 07 '22 18:10

5 revs

The rule of thumb is correct for numeric vectors. A numeric vector uses 40 bytes to store information about the vector plus 8 for each element in the vector. You can use the object.size() function to see this:

object.size(numeric())  # an empty vector (40 bytes)  
object.size(c(1))       # 48 bytes
object.size(c(1.2, 4))  # 56 bytes

You probably won't just have numeric vectors in you analysis. Matrices grow similar to vectors (this is to be expected since they are just vectors with a dim attribute).

object.size(matrix())           # Not really empty (208 bytes)
object.size(matrix(1:4, 2, 2))  # 216 bytes
object.size(matrix(1:6, 3, 2))  # 232 bytes (2 * 8 more after adding 2 elements)

Data.frames are more complicates (they have more attributes than a simple vector) and so they grow faster:

object.size(data.frame())                  # 560 bytes
object.size(data.frame(x = 1))             # 680 bytes
object.size(data.frame(x = 1:5, y = 1:5))  # 840 bytes

A good reference for memory is Hadley Wickhams Advanced R Programming.

All of this said, remember that in order to do analyses in R, you need some cushion in memory to allow R to copy the data you may be working on.

answered Oct 07 '22 18:10

Christopher Louden

I cannot really answer your question fully and I strongly suspect that there will be several factors that will affect what works in practice, but if you are just looking at the amount of raw memory a single copy of a given dataset would occupy, you can have a look at the documentation of R internals.

You will see that the amount of memory requires depends on the type of data being held. If you are talking about number data, these would typically be integer or numeric/real data. These in terms are described by the R internal types INTSXP and REALSXP, respectively which are described as follows:

INTSXP

length, truelength followed by a block of C ints (which are 32 bits on all R platforms).

REALSXP

length, truelength followed by a block of C doubles

A double is 64 bits (8 bytes) in length, so your 'rule of thumb' would appear to be roughly correct for a dataset exclusively containing numeric values. Similarly, with integer data, each element would occupy 4 bytes.

answered Oct 07 '22 18:10

PhiS

Related questions
                            
                                Split string on first two colons
                            
                                How can I view the source code for a particular `predict` function? [duplicate]
                            
                                How can I plot a function in R with complex numbers?
                            
                                How to fix the geom_text label position so it is always on the middle of the plot?
                            
                                Grouped horizontal boxplot with bwplot
                            
                                R clip raster with multiple bands
                            
                                Legend for Random Forest Plot in R
                            
                                Combining first two columns and turn it into row names in R data.frame
                            
                                How to add clustering rectangle in hierarchical heatmap dendogram
                            
                                How to update existing column values in data.table?
                            
                                Python scipy chisquare returns different values than R chisquare
                            
                                How do I convert a n*1 matrix to a n*n diagonal matrix
                            
                                ggplot Multi line plot from same dataframe
                            
                                Filter data.table by multiple columns, dynamically
                            
                                by() giving error when applying mean function over a data frame. What's happening?
                            
                                selecting rows with specific conditions in R
                            
                                drawing dendrogram from pre calculated distance matrix
                            
                                Data Table - Select Value of Column by Name From Another Column
                            
                                Counting the frequency of an element in a data frame [duplicate]
                            
                                Finding number of occurrences of a word in a file using R functions

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Rule of thumb for memory size of datasets in R

Tags:

r

ram

data-analysis

bigdata

Carlos Cinelli

People also ask

3 Answers

5 revs

Christopher Louden

PhiS

Recent Activity

Donate For Us