Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is there a way to guess the size of data.frame based on rows, columns and variable types?

I am expecting to generate a lot of data and then catch it R. How can I estimate the size of the data.frame (and thus memory needed) by the number of rows, number of columns and variable types?

Example.

If I have 10000 rows and 150 columns out of which 120 are numeric, 20 are strings and 10 are factor level, what is the size of the data frame I can expect? Will the results change depending on the data stored in the columns (as in max(nchar(column)))?

> m <- matrix(1,nrow=1e5,ncol=150)
> m <- as.data.frame(m)
> object.size(m)
120009920 bytes
> a=object.size(m)/(nrow(m)*ncol(m))
> a
8.00066133333333 bytes
> m[,1:150] <- sapply(m[,1:150],as.character)
> b=object.size(m)/(nrow(m)*ncol(m))
> b
4.00098133333333 bytes
> m[,1:150] <- sapply(m[,1:150],as.factor)
> c=object.size(m)/(nrow(m)*ncol(m))
> c
4.00098133333333 bytes
> m <- matrix("ajayajay",nrow=1e5,ncol=150)
> 
> m <- as.data.frame(m)
> object.size(m)
60047120 bytes
> d=object.size(m)/(nrow(m)*ncol(m))
> d
4.00314133333333 bytes
like image 858
Ajay Ohri Avatar asked Jul 23 '15 14:07

Ajay Ohri


People also ask

How do I find the size of a data frame in R?

To find the object size in R, we can use object. size function. For example, if we have a data frame called df then the size of df can be found by using the command object. size(df).

What is the size of a DataFrame?

The long answer is the size limit for pandas DataFrames is 100 gigabytes (GB) of memory instead of a set number of cells. In effect, this benchmark is so large that it would take an extraordinarily large data set to reach it.

How do you check dimensions in R?

To get the dimension of the array in R, use the dim() function.


3 Answers

You can simulate an object and compute an estimation of the memory that is being used to store it as an R object using object.size:

m <- matrix(1,nrow=1e5,ncol=150)
m <- as.data.frame(m)
m[,1:20] <- sapply(m[,1:20],as.character)
m[,29:30] <- sapply(m[,29:30],as.factor)
object.size(m)
120017224 bytes
print(object.size(m),units="Gb")
0.1 Gb
like image 155
agstudy Avatar answered Oct 22 '22 20:10

agstudy


You could create dummy variables that store examples of the data you will be storing in the dataframe.

Then use object.size() to find their size and multiply with the rows and columns accordingly.

like image 20
Buzz Lightyear Avatar answered Oct 22 '22 21:10

Buzz Lightyear


Check out pryr package as well. It has object_size which may be slightly better for you. From the advanced R

This function is better than the built-in object.size() because it accounts for shared elements within an object and includes the size of environments.

You also need to account for the size of attributes as well as the column types etc.

object.size(attributes(m))
like image 22
Rorschach Avatar answered Oct 22 '22 21:10

Rorschach