Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Am I using plyr right? I seem to be using way too much memory

Tags:

r

data.table

plyr

I have the following, somewhat large dataset:

 > dim(dset)
 [1] 422105     25
 > class(dset)
 [1] "data.frame"
 > 

Without doing anything, the R process seems to take about 1GB of RAM.

I am trying to run the following code:

  dset <- ddply(dset, .(tic), transform,
                date.min <- min(date),
                date.max <- max(date),
                daterange <- max(date) - min(date),
                .parallel = TRUE)

Running that code, RAM usage skyrockets. It completely saturated 60GB's of RAM, running on a 32 core machine. What am I doing wrong?

like image 401
stevejb Avatar asked Dec 10 '11 02:12

stevejb


People also ask

Why is my Internet browser taking up so much memory?

Your computer uses RAM as a cache to store things it may need again soon—in the case of web browsers, that could be web pages or other resources used by plug-ins and extensions. That way, when you go back to that web page or use that extension again, it'll load faster. This is a good thing.

Why is Firefox taking up so much memory?

Firefox may use more system resources if it's left open for long periods of time. A workaround for this is to periodically restart Firefox. You can configure Firefox to save your tabs and windows so that when you start it again, you can start where you left off.

How do I stop Firefox from using so much memory?

Check for Resource-Hogging Extensions and Themes This process requires loading Firefox in Safe Mode. Type about:support in the address bar and press Enter or Return. Select Troubleshoot Mode to restart Firefox without any extensions or themes. Use Firefox as usual while checking your memory and CPU percentage.


4 Answers

If performance is an issue, it might be a good idea to switch to using data.tables from the package of the same name. They are fast. You'd do something roughly equivalent something like this:

library(data.table)
dat <- data.frame(x = runif(100),
                  dt = seq.Date(as.Date('2010-01-01'),as.Date('2011-01-01'),length.out = 100),
                  grp = rep(letters[1:4],each = 25))

dt <- as.data.table(dat)
key(dt) <- "grp"

dt[,mutate(.SD,date.min = min(dt),
               date.max = max(dt),
               daterange = max(dt) - min(dt)), by = grp]
like image 185
joran Avatar answered Oct 01 '22 11:10

joran


Here's an alternative application of data.table to the problem, illustrating how blazing-fast it can be. (Note: this uses dset, the data.frame constructed by Brian Diggs in his answer, except with 30000 rather than 10 levels of tic).

(The reason this is much faster than @joran's solution, is that it avoids the use of .SD, instead using the columns directly. The style's a bit different than plyr, but it typically buys huge speed-ups. For another example, see the data.table Wiki which: (a) includes this as recommendation #1 ; and (b) shows a 50X speedup for code that drops the .SD).

library(data.table)
system.time({
    dt <- data.table(dset, key="tic")
    # Summarize by groups and store results in a summary data.table
    sumdt <- dt[ ,list(min.date=min(date), max.date=max(date)), by="tic"]
    sumdt[, daterange:= max.date-min.date]
    # Merge the summary data.table back into dt, based on key
    dt <- dt[sumdt]
})
# ELAPSED TIME IN SECONDS
# user  system elapsed 
# 1.45    0.25    1.77 
like image 29
Josh O'Brien Avatar answered Oct 01 '22 12:10

Josh O'Brien


A couple of things come to mind.

First, I would write it as:

dset <- ddply(dset, .(tic), summarise,
                date.min = min(date),
                date.max = max(date),
                daterange = max(date) - min(date),
                .parallel = TRUE)

Well, actually, I would probably avoid double calculating min/max date and write

dset <- ddply(dset, .(tic), function(DF) {
              mutate(summarise(DF, date.min = min(date),
                               date.max = max(date)),
                     daterange = date.max - date.min)},
              .parallel = TRUE)

but that's not the main point you are asking about.

With a dummy data set of your dimensions

n <- 422105
dset <- data.frame(date=as.Date("2000-01-01")+sample(3650, n, replace=TRUE),
    tic = factor(sample(10, n, replace=TRUE)))
for (i in 3:25) {
    dset[i] <- rnorm(n)
}

this ran comfortably (sub 1 minute) on my laptop. In fact the plyr step took less time than creating the dummy data set. So it couldn't have been swapping to the size you saw.

A second possibility is if there are a large number of unique values of tic. That could increase the size needed. However when I tried it increasing the possible number of unique tic values to 1000, it didn't really slow down.

Finally, it could be something in the parallelization. I don't have a parallel backend registered for foreach, so it was just doing a serial approach. Perhaps that is causing your memory explosion.

like image 39
Brian Diggs Avatar answered Oct 01 '22 11:10

Brian Diggs


Are there many numbers of factor levels in the data frame? I've found that this type of excessive memory usage is common in adply and possibly other plyr functions, but can be remedied by removing unnecessary factors and levels. If the large data frame was read into R, make sure stringsAsFactors is set to FALSE in the import:

dat = read.csv(header=TRUE, sep="\t", file="dat.tsv", stringsAsFactors=FALSE)

Then assign the factors you actually need.

I haven't look into Hadley's source yet to discover why.

like image 24
Nathan Siemers Avatar answered Oct 01 '22 13:10

Nathan Siemers