I have to do extensive data manipulation on a big data set (using data.table, RStudio mostly). I would like to monitor run time for each of my step without explicitly call system.time() on each step.
Is there a package or an easy way to show run time by default on each step?
Thank you.
It's not exactly what you're asking for, but I've written time_file
(https://gist.github.com/4183595) which source()
s an R file, and runs the code, then rewrites the file, inserting comments containing how long each top-level statement took to run.
i.e. time_file()
turns this:
{
load_all("~/documents/plyr/plyr")
load_all("~/documents/plyr/dplyr")
library(data.table)
data("baseball", package = "plyr")
vars <- list(n = quote(length(id)), m = quote(n + 1))
}
# Baseline case: use ddply
a <- ddply(baseball, "id", summarise, n = length(id))
# New summary method: ~20x faster
b <- summarise_by(baseball, group("id"), vars)
# But still not as fast as specialised count, which is basically id + tabulate
# so maybe able to eke out a little more with a C loop ?
count(baseball, "id")
into this:
{
load_all("~/documents/plyr/plyr")
load_all("~/documents/plyr/dplyr")
library(data.table)
data("baseball", package = "plyr")
vars <- list(n = quote(length(id)), m = quote(n + 1))
}
# Baseline case: use ddply
a <- ddply(baseball, "id", summarise, n = length(id))
#: user system elapsed
#: 0.451 0.003 0.453
# New summary method: ~20x faster
b <- summarise_by(baseball, group("id"), vars)
#: user system elapsed
#: 0.029 0.000 0.029
# But still not as fast as specialised count, which is basically id + tabulate
# so maybe able to eke out a little more with a C loop ?
count(baseball, "id")
#: user system elapsed
#: 0.008 0.000 0.008
It doesn't time code inside a top-level {
block, so you can choose not to time stuff you're not interested in.
I don't think there's anyway to automatically add timing as a top-level effect without somehow modifying the way that you run the code - i.e. using something like time_file
instead of source
.
You might wonder the effect that timing every top-level operation has on the overall speed of your code. Well, that's easy to answer with a microbenchmark ;)
library(microbenchmark)
microbenchmark(
runif(1e4),
system.time(runif(1e4)),
system.time(runif(1e4), gc = FALSE)
)
So timing adds relatively little overhead (20µs on my computer), but the default gc adds about 27 ms per call. So unless you have thousands of top-level calls, you're unlikely to see much impact.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With