I have the following, somewhat large dataset:
> dim(dset)
[1] 422105 25
> class(dset)
[1] "data.frame"
>
Without doing anything, the R process seems to take about 1GB of RAM.
I am trying to run the following code:
dset <- ddply(dset, .(tic), transform,
date.min <- min(date),
date.max <- max(date),
daterange <- max(date) - min(date),
.parallel = TRUE)
Running that code, RAM usage skyrockets. It completely saturated 60GB's of RAM, running on a 32 core machine. What am I doing wrong?
Your computer uses RAM as a cache to store things it may need again soon—in the case of web browsers, that could be web pages or other resources used by plug-ins and extensions. That way, when you go back to that web page or use that extension again, it'll load faster. This is a good thing.
Firefox may use more system resources if it's left open for long periods of time. A workaround for this is to periodically restart Firefox. You can configure Firefox to save your tabs and windows so that when you start it again, you can start where you left off.
Check for Resource-Hogging Extensions and Themes This process requires loading Firefox in Safe Mode. Type about:support in the address bar and press Enter or Return. Select Troubleshoot Mode to restart Firefox without any extensions or themes. Use Firefox as usual while checking your memory and CPU percentage.
If performance is an issue, it might be a good idea to switch to using data.table
s from the package of the same name. They are fast. You'd do something roughly equivalent something like this:
library(data.table)
dat <- data.frame(x = runif(100),
dt = seq.Date(as.Date('2010-01-01'),as.Date('2011-01-01'),length.out = 100),
grp = rep(letters[1:4],each = 25))
dt <- as.data.table(dat)
key(dt) <- "grp"
dt[,mutate(.SD,date.min = min(dt),
date.max = max(dt),
daterange = max(dt) - min(dt)), by = grp]
Here's an alternative application of data.table
to the problem, illustrating how blazing-fast it can be. (Note: this uses dset
, the data.frame
constructed by Brian Diggs in his answer, except with 30000 rather than 10 levels of tic
).
(The reason this is much faster than @joran's solution, is that it avoids the use of .SD
, instead using the columns directly. The style's a bit different than plyr
, but it typically buys huge speed-ups. For another example, see the data.table
Wiki which: (a) includes this as recommendation #1 ; and (b) shows a 50X speedup for code that drops the .SD
).
library(data.table)
system.time({
dt <- data.table(dset, key="tic")
# Summarize by groups and store results in a summary data.table
sumdt <- dt[ ,list(min.date=min(date), max.date=max(date)), by="tic"]
sumdt[, daterange:= max.date-min.date]
# Merge the summary data.table back into dt, based on key
dt <- dt[sumdt]
})
# ELAPSED TIME IN SECONDS
# user system elapsed
# 1.45 0.25 1.77
A couple of things come to mind.
First, I would write it as:
dset <- ddply(dset, .(tic), summarise,
date.min = min(date),
date.max = max(date),
daterange = max(date) - min(date),
.parallel = TRUE)
Well, actually, I would probably avoid double calculating min/max date and write
dset <- ddply(dset, .(tic), function(DF) {
mutate(summarise(DF, date.min = min(date),
date.max = max(date)),
daterange = date.max - date.min)},
.parallel = TRUE)
but that's not the main point you are asking about.
With a dummy data set of your dimensions
n <- 422105
dset <- data.frame(date=as.Date("2000-01-01")+sample(3650, n, replace=TRUE),
tic = factor(sample(10, n, replace=TRUE)))
for (i in 3:25) {
dset[i] <- rnorm(n)
}
this ran comfortably (sub 1 minute) on my laptop. In fact the plyr
step took less time than creating the dummy data set. So it couldn't have been swapping to the size you saw.
A second possibility is if there are a large number of unique values of tic
. That could increase the size needed. However when I tried it increasing the possible number of unique tic
values to 1000, it didn't really slow down.
Finally, it could be something in the parallelization. I don't have a parallel backend registered for foreach
, so it was just doing a serial approach. Perhaps that is causing your memory explosion.
Are there many numbers of factor levels in the data frame? I've found that this type of excessive memory usage is common in adply and possibly other plyr functions, but can be remedied by removing unnecessary factors and levels. If the large data frame was read into R, make sure stringsAsFactors is set to FALSE in the import:
dat = read.csv(header=TRUE, sep="\t", file="dat.tsv", stringsAsFactors=FALSE)
Then assign the factors you actually need.
I haven't look into Hadley's source yet to discover why.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With