I've checked several related questions such is this How to load data quickly into R? I'm quoting specific part of the most rated answer <blockquote> It depends on what you want to do and how you process the data further. In any case, loading from a binary R object is always going to be faster, provided you always need the same dataset. The limiting speed here is the speed of your harddrive, not R. The binary form is the internal representation of the dataframe in the workspace, so there is no transformation needed anymore </blockquote> I really thought that. However, life is about experimenting. I have a 1.22 GB file containing an igraph object. That's said, i don't think what I found here is related to the object class, mainly because you can load('file.RData') even before you call "library". Disks in this server are pretty cool. As you can check in the reading time to memory <pre class="prettyprint"><code>user@machine data$ pv mygraph.RData > /dev/null 1.22GB 0:00:03 [ 384MB/s] [==================================>] 100% ` </code></pre> However when I load this data from R <pre class="prettyprint"><code>>system.time(load('mygraph.RData')) user system elapsed 178.533 16.490 202.662 </code></pre> So it seems loading *.RData files is 60 times slower than disk limits, which should mean R actually does something while "load". I've got the same feeling using differentes R versions with different hardware, it's just this time I got patience to make benchmarking (mainly because with such a cool disk storage, it was terrible how long the load actually takes) Any ideas on how to overcome this? <hr> After ideas in answers <pre class="prettyprint"><code>save(g,file="test.RData",compress=F) </code></pre> Now the file is 3.1GB against 1.22GB before. In my case, loading uncompress is a bit faster (disk is not my bottleneck by far) <pre class="prettyprint"><code>> system.time(load('test.RData')) user system elapsed 126.254 2.701 128.974 </code></pre> Reading the uncompressed file to memory takes like 12 seconds, so I confirm most the time is spent in setting the enviroment I'll be back with RDS results, sounds like interesting <hr> Here we are, as prommised <pre class="prettyprint"><code>system.time(saveRDS(g,file="test2.RData",compress=F)) user system elapsed 7.714 2.820 18.112 </code></pre> And I get a 3.1GB just like "save" uncompressed, although md5sum is different, probably because <code>save</code> also stores the object name Now reading... <pre class="prettyprint"><code>> system.time(a<-readRDS('test2.RData')) user system elapsed 41.902 2.166 44.077 </code></pre> So combining both ideas (uncompress and RDS) runs 5 times faster. Thanks for your contributions!

<code>save</code> compresses by default, so it takes extra time to uncompress the file. Then it takes a bit longer to load the larger file into memory. Your <code>pv</code> example is just copying the compressed data to memory, which isn't very useful to you. ;-) UPDATE: I tested my theory and it was incorrect (at least on my Windows XP machine with 3.3Ghz CPU and 7200RPM HDD). Loading compressed files is faster (probably because it reduces disk I/O). The extra time is spent in <code>RestoreToEnv</code> (in <code>saveload.c</code>) and/or <code>R_Unserialize</code> (in <code>serialize.c</code>). So you could make loading faster by changing those files, or maybe by using <code>saveRDS</code> to individually save the objects in <code>myGraph.RData</code> then somehow using <code>loadRDS</code> across multiple R processes to load the data into shared memory...

Speed up RData load

Tags:

io

r

I've checked several related questions such is this

How to load data quickly into R?

I'm quoting specific part of the most rated answer

It depends on what you want to do and how you process the data further. In any case, loading from a binary R object is always going to be faster, provided you always need the same dataset. The limiting speed here is the speed of your harddrive, not R. The binary form is the internal representation of the dataframe in the workspace, so there is no transformation needed anymore

I really thought that. However, life is about experimenting. I have a 1.22 GB file containing an igraph object. That's said, i don't think what I found here is related to the object class, mainly because you can load('file.RData') even before you call "library".

Disks in this server are pretty cool. As you can check in the reading time to memory

user@machine data$ pv mygraph.RData > /dev/null 1.22GB 0:00:03 [ 384MB/s] [==================================>] 100% `

However when I load this data from R

>system.time(load('mygraph.RData'))    user  system   elapsed  178.533  16.490   202.662

So it seems loading *.RData files is 60 times slower than disk limits, which should mean R actually does something while "load".

I've got the same feeling using differentes R versions with different hardware, it's just this time I got patience to make benchmarking (mainly because with such a cool disk storage, it was terrible how long the load actually takes)

Any ideas on how to overcome this?

After ideas in answers

save(g,file="test.RData",compress=F)

Now the file is 3.1GB against 1.22GB before. In my case, loading uncompress is a bit faster (disk is not my bottleneck by far)

> system.time(load('test.RData')) user  system elapsed  126.254   2.701 128.974

Reading the uncompressed file to memory takes like 12 seconds, so I confirm most the time is spent in setting the enviroment

I'll be back with RDS results, sounds like interesting

Here we are, as prommised

system.time(saveRDS(g,file="test2.RData",compress=F)) user  system elapsed  7.714   2.820  18.112

And I get a 3.1GB just like "save" uncompressed, although md5sum is different, probably because save also stores the object name

Now reading...

> system.time(a<-readRDS('test2.RData')) user  system elapsed  41.902   2.166  44.077

So combining both ideas (uncompress and RDS) runs 5 times faster. Thanks for your contributions!

592

asked Jul 19 '12 11:07

cyague

2 Answers

save compresses by default, so it takes extra time to uncompress the file. Then it takes a bit longer to load the larger file into memory. Your pv example is just copying the compressed data to memory, which isn't very useful to you. ;-)

UPDATE:

I tested my theory and it was incorrect (at least on my Windows XP machine with 3.3Ghz CPU and 7200RPM HDD). Loading compressed files is faster (probably because it reduces disk I/O).

The extra time is spent in RestoreToEnv (in saveload.c) and/or R_Unserialize (in serialize.c). So you could make loading faster by changing those files, or maybe by using saveRDS to individually save the objects in myGraph.RData then somehow using loadRDS across multiple R processes to load the data into shared memory...

174

answered Sep 30 '22 13:09

Joshua Ulrich

For variables that big, I suspect that most of the time is taken up inside the internal C code (http://svn.r-project.org/R/trunk/src/main/saveload.c). You can run some profiling to see if I'm right. (All the R code in the load function does is check that your file is non-empty and hasn't been corrupted.

As well as reading the variables into memory, they (amongst other things) need to be stored inside an R environment.

The only obvious way of getting a big speedup in loading variables would be to rewrite the code in a parallel way to allow simultaneous loading of variables. This presumably requires a substantial rewrite of R's internals, so don't hold your breath for such a feature.

answered Sep 30 '22 13:09

Richie Cotton

Related questions
                            
                                Rmarkdown font size and header
                            
                                How to maintain size of ggplot with long labels
                            
                                Moving variance in R
                            
                                How can I extract elements from lists of lists in R?
                            
                                How do you change library location in R? [duplicate]
                            
                                Efficient way to find repeated runs of rows, remove, & count
                            
                                dplyr filter with condition on multiple columns
                            
                                How to square all the values in a vector in R?
                            
                                Grep in R using OR and NOT
                            
                                Specify height and width of ggplot graph in Rmarkdown knitr output
                            
                                Extracting unique numbers from string in R
                            
                                Using dplyr to conditionally replace values in a column
                            
                                How can I plot a tree (and squirrels) in R?
                            
                                Find position of first value greater than X in a vector
                            
                                How to remove all the NA from a Vector? [duplicate]
                            
                                Counting non NAs in a data frame; getting answer as a vector
                            
                                How can a test script inform R CMD check that it should emit a custom message?
                            
                                Extrafont and ggsave: Characters end up on top of another
                            
                                Why is R slowing down as time goes on, when the computations are the same?
                            
                                Code folding for individual chunks in R Markdown?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With