Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

When does it become beneficial to store data out of memory in RStudio?

I am working with a large dataset of 8Gb (HIGGS dataset). When looking at the vignette for the dbplyr package (see vignette('dbplyr')) I came across this line,

(If your data fits in memory there is no advantage to putting it in a database: it will only be slower and more frustrating.)

The HIGGS dataset does fit in memory on my machine, my questions are:

  1. Is this always true? And if not, when is it not true?
  2. More generally are there any performance benefits to keeping the data out of memory, even if it does fit, and why?

edit: After looking at the link provided by @Waldi: RAM 100x faster than HDD, an additional question is how does this change for a SSD?

like image 844
ztkpat001 Avatar asked Dec 02 '25 10:12

ztkpat001


1 Answers

R is memory intensive, so it’s best to get as much RAM as possible. the amount of RAM you have can limit the size of data set you can analyse.

Adding a solid state drive (SSD) typically won’t have much impact on the speed of your R – vignette(dbplyr) since R loads object into RAM. However, the reduction in boot time and increase in your overall productivity since I/0 is much faster makes an SSD drive a wonderful purchase.

library(benchmarkme) is package benchmarkme to assess your CPUs number crunching ability. CPU cores is another area you would like to explore for big data performances. The more the cores the better, if you are using CPU.

library(Multidplyr) is a backend for dplyr that partitions a data frame across multiple cores. This minimizes time spent moving data around, and maximizes parallel performance.

like image 93
linkonabe Avatar answered Dec 05 '25 03:12

linkonabe



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!