Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Best practices for storing and using data frames too large for memory?

Tags:

I'm working with a large data frame, and have run up against RAM limits. At this point, I probably need to work with a serialized version on the disk. There are a few packages to support out-of-memory operations, but I'm not sure which one will suit my needs. I'd prefer to keep everything in data frames, so the ff package looks encouraging, but there are still compatibility problems that I can't work around.

What's the first tool to reach for when you realize that your data has reached out-of-memory scale?

like image 213
MW Frost Avatar asked Dec 09 '09 18:12

MW Frost


People also ask

How is large data stored in memory?

Use in-process in-memory database like H2 keeping in mind its own limitations (H2 also even can rely on own in-memory file system) Use off-process memory storage like Memcached with corresponding Java client. Set up RAM disk (or use tmpfs, or something like that) and work with memory as with a file system from Java.

How do you handle data that doesn't fit your machine's RAM?

Money-costing solution: One possible solution is to buy a new computer with a more robust CPU and larger RAM that is capable of handling the entire dataset. Or, rent a cloud or a virtual memory and then create some clustering arrangement to handle the workload.

Is DataFrame stored in memory?

You can work with datasets that are much larger than memory, as long as each partition (a regular pandas DataFrame) fits in memory.


2 Answers

You probably want to look at these packages:

  • ff for 'flat-file' storage and very efficient retrieval (can do data.frames; different data types)
  • bigmemory for out-of-R-memory but still in RAM (or file-backed) use (can only do matrices; same data type)
  • biglm for out-of-memory model fitting with lm() and glm()-style models.

and also see the High-Performance Computing task view.

like image 132
Dirk Eddelbuettel Avatar answered Oct 07 '22 18:10

Dirk Eddelbuettel


I would say the disk.frame is good candidate for these type of tasks. I am the primary author of the package.

Unlike ff and bigmemory which restricts what data types can be easily handled, it tries to "mimic" data.frames and provide dplyr verbs for manipulating the data.

like image 24
xiaodai Avatar answered Oct 07 '22 16:10

xiaodai