Best practices for storing and using data frames too large for memory?

Tags:

I'm working with a large data frame, and have run up against RAM limits. At this point, I probably need to work with a serialized version on the disk. There are a few packages to support out-of-memory operations, but I'm not sure which one will suit my needs. I'd prefer to keep everything in data frames, so the ff package looks encouraging, but there are still compatibility problems that I can't work around.

What's the first tool to reach for when you realize that your data has reached out-of-memory scale?

213

asked Dec 09 '09 18:12

MW Frost

2 Answers

You probably want to look at these packages:

ff for 'flat-file' storage and very efficient retrieval (can do data.frames; different data types)
bigmemory for out-of-R-memory but still in RAM (or file-backed) use (can only do matrices; same data type)
biglm for out-of-memory model fitting with lm() and glm()-style models.

and also see the High-Performance Computing task view.

132

answered Oct 07 '22 18:10

Dirk Eddelbuettel

I would say the disk.frame is good candidate for these type of tasks. I am the primary author of the package.

Unlike ff and bigmemory which restricts what data types can be easily handled, it tries to "mimic" data.frames and provide dplyr verbs for manipulating the data.

answered Oct 07 '22 16:10

xiaodai

Related questions
                            
                                Custom (interactive) shell with Python [closed]
                            
                                Is the resolution problem in OSGi NP-Complete?
                            
                                &amp; or &#38; what should be used for & (ampersand) if we are using UTF-8 in an XHTML document?
                            
                                How to use Process.Start() or equivalent with Mono on a Mac and pass in arguments
                            
                                Team Build: The path 'Path' is already mapped in workspace 'workspace' error even after deleting all workspaces on build agent
                            
                                Capture subprocess output [duplicate]
                            
                                How to check if resultset has one row or more?
                            
                                Green Exceptions?
                            
                                XSS attack to bypass htmlspecialchars() function in value attribute
                            
                                When are temporaries created as part of a function call destroyed?
                            
                                Microsoft Azure Storage vs. Azure SQL Database
                            
                                How to setup visual studio for cross platform c++ development

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With