Are there any rule of thumbs to know when R will have problems to deal with a given dataset in RAM (given a PC configuration)?
For example, I have heard that one rule of thumb is that you should consider 8 bytes for each cell. Then, if I have 1.000.000 observations of 1.000 columns that would be close to 8 GB - hence, in most domestic computers, we probably would have to store the data in the HD and access it in chunks.
Is the above correct? Which rule of thumbs for memory size and usage can we apply beforehand? By that I mean enough memory not only to load the object, but to do some basic operations like some data tidying, some data visualisation and some analysis (regression).
PS: it would be nice to explain how the rule of thumb works, so it is not just a blackbox.
Today, R can address 8 TB of RAM if it runs on 64-bit machines. That is in many situations a sufficient improvement compared to about 2 GB addressable RAM on 32-bit machines. As an alternative, there are packages available that avoid storing data in memory.
R is memory intensive, so it's best to get as much RAM as possible. If you use virtual machines you might have restrictions on how much memory you can allocate to a single instance. In that case we recommend getting as much memory as possible and consider using multiple nodes. Minimum (2 core / 4G).
To find the object size in R, we can use object. size function. For example, if we have a data frame called df then the size of df can be found by using the command object. size(df).
Neither of those things are true! Those 40 bytes are used to store four components possessed by every object in R: Object metadata (4 bytes). These metadata store the base type (e.g. integer) and information used for debugging and memory management.
The memory footprint of some vectors at different sizes, in bytes.
n <- c(1, 1e3, 1e6)
names(n) <- n
one_hundred_chars <- paste(rep.int(" ", 100), collapse = "")
sapply(
n,
function(n)
{
strings_of_one_hundred_chars <- replicate(
n,
paste(sample(letters, 100, replace = TRUE), collapse = "")
)
sapply(
list(
Integers = integer(n),
Floats = numeric(n),
Logicals = logical(n),
"Empty strings" = character(n),
"Identical strings, nchar=100" = rep.int(one_hundred_chars, n),
"Distinct strings, nchar=100" = strings_of_one_hundred_chars,
"Factor of empty strings" = factor(character(n)),
"Factor of identical strings, nchar=100" = factor(rep.int(one_hundred_chars, n)),
"Factor of distinct strings, nchar=100" = factor(strings_of_one_hundred_chars),
Raw = raw(n),
"Empty list" = vector("list", n)
),
object.size
)
}
)
Some values differ under between 64/32 bit R.
## Under 64-bit R
## 1 1000 1e+06
## Integers 48 4040 4000040
## Floats 48 8040 8000040
## Logicals 48 4040 4000040
## Empty strings 96 8088 8000088
## Identical strings, nchar=100 216 8208 8000208
## Distinct strings, nchar=100 216 176040 176000040
## Factor of empty strings 464 4456 4000456
## Factor of identical strings, nchar=100 584 4576 4000576
## Factor of distinct strings, nchar=100 584 180400 180000400
## Raw 48 1040 1000040
## Empty list 48 8040 8000040
## Under 32-bit R
## 1 1000 1e+06
## Integers 32 4024 4000024
## Floats 32 8024 8000024
## Logicals 32 4024 4000024
## Empty strings 64 4056 4000056
## Identical strings, nchar=100 184 4176 4000176
## Distinct strings, nchar=100 184 156024 156000024
## Factor of empty strings 272 4264 4000264
## Factor of identical strings, nchar=100 392 4384 4000384
## Factor of distinct strings, nchar=100 392 160224 160000224
## Raw 32 1024 1000024
## Empty list 32 4024 4000024
Notice that factors have a smaller memory footprint than character vectors when there are lots of repetitions of the same string (but not when they are all unique).
The rule of thumb is correct for numeric vectors. A numeric vector uses 40 bytes to store information about the vector plus 8 for each element in the vector. You can use the object.size()
function to see this:
object.size(numeric()) # an empty vector (40 bytes)
object.size(c(1)) # 48 bytes
object.size(c(1.2, 4)) # 56 bytes
You probably won't just have numeric vectors in you analysis. Matrices grow similar to vectors (this is to be expected since they are just vectors with a dim
attribute).
object.size(matrix()) # Not really empty (208 bytes)
object.size(matrix(1:4, 2, 2)) # 216 bytes
object.size(matrix(1:6, 3, 2)) # 232 bytes (2 * 8 more after adding 2 elements)
Data.frames are more complicates (they have more attributes than a simple vector) and so they grow faster:
object.size(data.frame()) # 560 bytes
object.size(data.frame(x = 1)) # 680 bytes
object.size(data.frame(x = 1:5, y = 1:5)) # 840 bytes
A good reference for memory is Hadley Wickhams Advanced R Programming.
All of this said, remember that in order to do analyses in R, you need some cushion in memory to allow R to copy the data you may be working on.
I cannot really answer your question fully and I strongly suspect that there will be several factors that will affect what works in practice, but if you are just looking at the amount of raw memory a single copy of a given dataset would occupy, you can have a look at the documentation of R internals.
You will see that the amount of memory requires depends on the type of data being held. If you are talking about number data, these would typically be integer
or numeric
/real
data. These in terms are described by the R internal types INTSXP
and REALSXP
, respectively which are described as follows:
INTSXP
length
,truelength
followed by a block of Cint
s (which are 32 bits on all R platforms).
REALSXP
length
,truelength
followed by a block of Cdouble
s
A double
is 64 bits (8 bytes) in length, so your 'rule of thumb' would appear to be roughly correct for a dataset exclusively containing numeric
values. Similarly, with integer data, each element would occupy 4 bytes.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With