Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What a good way to get in-memory cache with data.table

Tags:

r

data.table

Let's say I have a 4GB dataset on a server with 32 GB.

I can read all of that into R, make a data.table global variable and have all of my functions use that global as a kind of in-memory data-base. However, when I exit R and restart, I have to read that from disk again. Even with smart disk cacheing strategies (save/load or R.cache) I have 10 seconds delay or so getting that data in. Copying that data takes about 4 seconds.

Is there a good way to cache this in memory that survives the exit of an R session?

A couple of things comes to mind, RServe, redis/Rredis, Memcache, multicore ... Shiny-Server and Rstudio-Server also seem to have ways of solving this problem.

But then again, it seems to me that perhaps data.table could provide this functionality since it appears to move data outside of R's memory block anyway. That would be ideal in that it wouldn't require any data copying, restructuring etc.

Update:

I ran some more detailed tests and I agree with the comment below that I probably don't have much to complain about.

But here are some numbers that others might find useful. I have a 32GB server. I created a data.table of 4GB size. According to gc() and also looking at top, it appeared to use about 15GB peak memory and that includes making one copy of the data. That's pretty good I think.

I wrote to disk with save(), deleted the object and used load() to remake it. This took 17 seconds and 10 seconds respectively.

I did the same with the R.cache package and this was actually slower. 23 and 14 seconds.

However both of those reload times are quite fast. The load() method gave me 357 MB/s transfer rate. By comparison, a copy took 4.6 seconds. This is a virtual server. Not sure what kind of storage it has or how much that read speed is influenced by the cache.

like image 923
Dave31415 Avatar asked Mar 27 '13 22:03

Dave31415


1 Answers

Very true: data.table hasn't got to on-disk tables yet. In the meantime some options are :

  • Don't exit R. Leave it running on a server and use svSocket to evalServer() to it, as the video on the data.table homepage demonstrates. Or the other similar options you mentioned.

  • Use a database for persistency such as SQL or any other noSQL database.

  • If you have large delimited files then some people have recently reported that fread() appears (much) faster than load(). But experiment with compress=FALSE. Also, we've just pushed fwrite to the most current development version (1.9.7, use devtools::install_github("Rdatatable/data.table") to install), which has some reported write times on par with native save.

  • Packages ff, bigmemory and sqldf, too. See the HPC Task View, the "Large memory and out-of-memory data" section.

In enterprises where data.table is being used, my guess is that it is mostly being fed with data from some other persistent database, currently. Those enterprises probably :

  • use 64bit with say 16GB, 64GB or 128GB of RAM. RAM is cheap these days. (But I realise this doesn't address persistency.)

The internals have been written with on-disk tables in mind. But don't hold your breath!

like image 99
Matt Dowle Avatar answered Sep 21 '22 13:09

Matt Dowle