I have the following, somewhat large dataset: <pre class="prettyprint"><code> > dim(dset) [1] 422105 25 > class(dset) [1] "data.frame" > </code></pre> Without doing anything, the R process seems to take about 1GB of RAM. I am trying to run the following code: <pre class="prettyprint"><code> dset <- ddply(dset, .(tic), transform, date.min <- min(date), date.max <- max(date), daterange <- max(date) - min(date), .parallel = TRUE) </code></pre> Running that code, RAM usage skyrockets. It completely saturated 60GB's of RAM, running on a 32 core machine. What am I doing wrong?

If performance is an issue, it might be a good idea to switch to using <code>data.table</code>s from the package of the same name. They are fast. You'd do something roughly equivalent something like this: <pre class="prettyprint"><code>library(data.table) dat <- data.frame(x = runif(100), dt = seq.Date(as.Date('2010-01-01'),as.Date('2011-01-01'),length.out = 100), grp = rep(letters[1:4],each = 25)) dt <- as.data.table(dat) key(dt) <- "grp" dt[,mutate(.SD,date.min = min(dt), date.max = max(dt), daterange = max(dt) - min(dt)), by = grp] </code></pre>

Here's an alternative application of <code>data.table</code> to the problem, illustrating how blazing-fast it can be. (Note: this uses <code>dset</code>, the <code>data.frame</code> constructed by Brian Diggs in his answer, except with 30000 rather than 10 levels of <code>tic</code>). (The reason this is much faster than @joran's solution, is that it avoids the use of <code>.SD</code>, instead using the columns directly. The style's a bit different than <code>plyr</code>, but it typically buys huge speed-ups. For another example, see the <code>data.table</code> Wiki which: (a) includes this as recommendation #1 ; and (b) shows a 50X speedup for code that drops the <code>.SD</code>). <pre class="prettyprint"><code>library(data.table) system.time({ dt <- data.table(dset, key="tic") # Summarize by groups and store results in a summary data.table sumdt <- dt[ ,list(min.date=min(date), max.date=max(date)), by="tic"] sumdt[, daterange:= max.date-min.date] # Merge the summary data.table back into dt, based on key dt <- dt[sumdt] }) # ELAPSED TIME IN SECONDS # user system elapsed # 1.45 0.25 1.77 </code></pre>

Am I using plyr right? I seem to be using way too much memory

I have the following, somewhat large dataset:

 > dim(dset)
 [1] 422105     25
 > class(dset)
 [1] "data.frame"
 >

Without doing anything, the R process seems to take about 1GB of RAM.

I am trying to run the following code:

  dset <- ddply(dset, .(tic), transform,
                date.min <- min(date),
                date.max <- max(date),
                daterange <- max(date) - min(date),
                .parallel = TRUE)

Running that code, RAM usage skyrockets. It completely saturated 60GB's of RAM, running on a 32 core machine. What am I doing wrong?

Why is my Internet browser taking up so much memory?

Your computer uses RAM as a cache to store things it may need again soon—in the case of web browsers, that could be web pages or other resources used by plug-ins and extensions. That way, when you go back to that web page or use that extension again, it'll load faster. This is a good thing.

Why is Firefox taking up so much memory?

Firefox may use more system resources if it's left open for long periods of time. A workaround for this is to periodically restart Firefox. You can configure Firefox to save your tabs and windows so that when you start it again, you can start where you left off.

How do I stop Firefox from using so much memory?

Check for Resource-Hogging Extensions and Themes This process requires loading Firefox in Safe Mode. Type about:support in the address bar and press Enter or Return. Select Troubleshoot Mode to restart Firefox without any extensions or themes. Use Firefox as usual while checking your memory and CPU percentage.

If performance is an issue, it might be a good idea to switch to using data.tables from the package of the same name. They are fast. You'd do something roughly equivalent something like this:

library(data.table)
dat <- data.frame(x = runif(100),
                  dt = seq.Date(as.Date('2010-01-01'),as.Date('2011-01-01'),length.out = 100),
                  grp = rep(letters[1:4],each = 25))

dt <- as.data.table(dat)
key(dt) <- "grp"

dt[,mutate(.SD,date.min = min(dt),
               date.max = max(dt),
               daterange = max(dt) - min(dt)), by = grp]

Here's an alternative application of data.table to the problem, illustrating how blazing-fast it can be. (Note: this uses dset, the data.frame constructed by Brian Diggs in his answer, except with 30000 rather than 10 levels of tic).

(The reason this is much faster than @joran's solution, is that it avoids the use of .SD, instead using the columns directly. The style's a bit different than plyr, but it typically buys huge speed-ups. For another example, see the data.table Wiki which: (a) includes this as recommendation #1 ; and (b) shows a 50X speedup for code that drops the .SD).

library(data.table)
system.time({
    dt <- data.table(dset, key="tic")
    # Summarize by groups and store results in a summary data.table
    sumdt <- dt[ ,list(min.date=min(date), max.date=max(date)), by="tic"]
    sumdt[, daterange:= max.date-min.date]
    # Merge the summary data.table back into dt, based on key
    dt <- dt[sumdt]
})
# ELAPSED TIME IN SECONDS
# user  system elapsed 
# 1.45    0.25    1.77

A couple of things come to mind.

First, I would write it as:

dset <- ddply(dset, .(tic), summarise,
                date.min = min(date),
                date.max = max(date),
                daterange = max(date) - min(date),
                .parallel = TRUE)

Well, actually, I would probably avoid double calculating min/max date and write

dset <- ddply(dset, .(tic), function(DF) {
              mutate(summarise(DF, date.min = min(date),
                               date.max = max(date)),
                     daterange = date.max - date.min)},
              .parallel = TRUE)

but that's not the main point you are asking about.

With a dummy data set of your dimensions

n <- 422105
dset <- data.frame(date=as.Date("2000-01-01")+sample(3650, n, replace=TRUE),
    tic = factor(sample(10, n, replace=TRUE)))
for (i in 3:25) {
    dset[i] <- rnorm(n)
}

this ran comfortably (sub 1 minute) on my laptop. In fact the plyr step took less time than creating the dummy data set. So it couldn't have been swapping to the size you saw.

A second possibility is if there are a large number of unique values of tic. That could increase the size needed. However when I tried it increasing the possible number of unique tic values to 1000, it didn't really slow down.

Finally, it could be something in the parallelization. I don't have a parallel backend registered for foreach, so it was just doing a serial approach. Perhaps that is causing your memory explosion.

Are there many numbers of factor levels in the data frame? I've found that this type of excessive memory usage is common in adply and possibly other plyr functions, but can be remedied by removing unnecessary factors and levels. If the large data frame was read into R, make sure stringsAsFactors is set to FALSE in the import:

dat = read.csv(header=TRUE, sep="\t", file="dat.tsv", stringsAsFactors=FALSE)

Then assign the factors you actually need.

I haven't look into Hadley's source yet to discover why.

Am I using plyr right? I seem to be using way too much memory

Tags:

r

data.table

plyr

stevejb

People also ask

4 Answers

joran

Josh O'Brien

Brian Diggs

Nathan Siemers

Recent Activity

Donate For Us

Am I using plyr right? I seem to be using way too much memory

Tags:

r

data.table

plyr

stevejb

People also ask

4 Answers

joran

Josh O'Brien

Brian Diggs

Nathan Siemers

Related questions

Recent Activity

Donate For Us