Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

faster alternative to object.size?

Tags:

r

Does there exist a faster way to identify an object size other than object.size (or a method to have it execute more quickly)?

start.time <- Sys.time()
object.size(DB.raw)
#  5361302280 bytes
Sys.time() - start.time
#  Time difference of 1.644485 mins  <~~~  A minute and a half simply
                                           to show the size

print.dims(DB.raw)
#  43,581,894 rows X 15 cols 

I'm wondering also why it takes so long to compute the object size? Presumably for each column it has to traverse each row to find the total size for that column?

like image 984
Ricardo Saporta Avatar asked Feb 14 '23 21:02

Ricardo Saporta


2 Answers

On a Windows box, you might be able to get a pretty close estimate using gc() and memory.size() before and after creating DB.raw.

gc()
x <- memory.size()
# DB.raw created, extra variables rm'ed
gc()
memory.size() - x # estimate of DB.raw size in Mb
# object.size(DB.raw) / 1048576 # for comparison
like image 167
Matthew Plourde Avatar answered Feb 20 '23 11:02

Matthew Plourde


The most likely reason it is taking a long time is because you have character objects. This seems to be because it needs to count the number of characters in order to determine the size, although I'm not sure.

x<-rep(paste0(letters[1:3],collapse=""),1e8)
system.time(object.size(x))
##  user  system elapsed 
## 1.608   0.592   2.202 
x<-rep(0.5,1e9)
system.time(object.size(x))
## user  system elapsed 
## 0.000   0.000   0.001 

We can see longer strings taking up more space (at least in some cases) like this:

> x<-replicate(1e5,paste0(letters[sample(26,3)],collapse=""))
> x1<-replicate(1e5,paste0(letters[sample(26,2)],collapse=""))
> object.size(x)
1547544 bytes
> object.size(x1)
831240 bytes

I can't think of any way around this, if you need an exact size. However, you can get a highly accurate estimate of the size by sampling a large number of rows and calling object.size() on the sample to get an estimate of the size per row, and then multiplying by the total number of rows you have.

For example:

estObjectSize<-function(x,n=1e5){
  length(x)*object.size(sample(x,n))/n
}
x0<-sapply(1:20,function(x) paste0(letters[1:x],collapse=""))
x<-x0[sample(20,1e8,T)]

> system.time(size<-object.size(x))
   user  system elapsed 
  1.632   0.856   2.495 
> system.time(estSize<-estObjectSize(x))
   user  system elapsed 
  0.012   0.000   0.013 
> size
800001184 bytes
> estSize
801184000 bytes

You have to tweak the code a bit to get it to work for a data frame, but this is the idea.

To add: it looks like the number of bytes per character to store an array of strings depends on a few things, including string interning and excess allocated buffer memory used during string construction. It's certainly not as simple as multiplying by the number of strings, and it is not surprising that it should take longer:

> bytesPerString<-sapply(1:20,
+   function(x)
+       object.size(replicate(1e5,paste0(letters[sample(26,x)],collapse="")))/1e5)
> bytesPerString
 [1]  8.01288  8.31240 15.47928 49.87848 55.71144 55.98552 55.99848 64.00040
 [9] 64.00040 64.00040 64.00040 64.00040 64.00040 64.00040 64.00040 80.00040
[17] 80.00040 80.00040 80.00040 80.00040
> bytesPerChar<-(bytesPerString-8)/(1:20+1)
> bytesPerChar
 [1] 0.0064400 0.1041333 1.8698200 8.3756960 7.9519067 6.8550743 5.9998100
 [8] 6.2222667 5.6000400 5.0909455 4.6667000 4.3077231 4.0000286 3.7333600
[15] 3.5000250 4.2353176 4.0000222 3.7894947 3.6000200 3.4285905
like image 23
mrip Avatar answered Feb 20 '23 11:02

mrip