I'm working with a large data frame, and have run up against RAM limits. At this point, I probably need to work with a serialized version on the disk. There are a few packages to support out-of-memory operations, but I'm not sure which one will suit my needs. I'd prefer to keep everything in data frames, so the ff
package looks encouraging, but there are still compatibility problems that I can't work around.
What's the first tool to reach for when you realize that your data has reached out-of-memory scale?
Use in-process in-memory database like H2 keeping in mind its own limitations (H2 also even can rely on own in-memory file system) Use off-process memory storage like Memcached with corresponding Java client. Set up RAM disk (or use tmpfs, or something like that) and work with memory as with a file system from Java.
Money-costing solution: One possible solution is to buy a new computer with a more robust CPU and larger RAM that is capable of handling the entire dataset. Or, rent a cloud or a virtual memory and then create some clustering arrangement to handle the workload.
You can work with datasets that are much larger than memory, as long as each partition (a regular pandas DataFrame) fits in memory.
You probably want to look at these packages:
lm()
and glm()
-style models.and also see the High-Performance Computing task view.
I would say the disk.frame is good candidate for these type of tasks. I am the primary author of the package.
Unlike ff
and bigmemory
which restricts what data types can be easily handled, it tries to "mimic" data.frame
s and provide dplyr
verbs for manipulating the data.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With