Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Return system.time by default

Tags:

r

I have to do extensive data manipulation on a big data set (using data.table, RStudio mostly). I would like to monitor run time for each of my step without explicitly call system.time() on each step.

Is there a package or an easy way to show run time by default on each step?

Thank you.

like image 472
AdamNYC Avatar asked Dec 01 '12 17:12

AdamNYC


1 Answers

It's not exactly what you're asking for, but I've written time_file (https://gist.github.com/4183595) which source()s an R file, and runs the code, then rewrites the file, inserting comments containing how long each top-level statement took to run.

i.e. time_file() turns this:

{
  load_all("~/documents/plyr/plyr")
  load_all("~/documents/plyr/dplyr")
  library(data.table)
  data("baseball", package = "plyr")
  vars <- list(n = quote(length(id)), m = quote(n + 1))
}

# Baseline case: use ddply
a <- ddply(baseball, "id", summarise, n = length(id))

# New summary method: ~20x faster
b <- summarise_by(baseball, group("id"), vars)

# But still not as fast as specialised count, which is basically id + tabulate
# so maybe able to eke out a little more with a C loop ?
count(baseball, "id")

into this:

{
  load_all("~/documents/plyr/plyr")
  load_all("~/documents/plyr/dplyr")
  library(data.table)
  data("baseball", package = "plyr")
  vars <- list(n = quote(length(id)), m = quote(n + 1))
}

# Baseline case: use ddply
a <- ddply(baseball, "id", summarise, n = length(id))
#:    user  system elapsed
#:   0.451   0.003   0.453

# New summary method: ~20x faster
b <- summarise_by(baseball, group("id"), vars)
#:    user  system elapsed
#:   0.029   0.000   0.029

# But still not as fast as specialised count, which is basically id + tabulate
# so maybe able to eke out a little more with a C loop ?
count(baseball, "id")
#:    user  system elapsed
#:   0.008   0.000   0.008

It doesn't time code inside a top-level { block, so you can choose not to time stuff you're not interested in.

I don't think there's anyway to automatically add timing as a top-level effect without somehow modifying the way that you run the code - i.e. using something like time_file instead of source.

You might wonder the effect that timing every top-level operation has on the overall speed of your code. Well, that's easy to answer with a microbenchmark ;)

library(microbenchmark)
microbenchmark(
  runif(1e4), 
  system.time(runif(1e4)),
  system.time(runif(1e4), gc = FALSE)
)

So timing adds relatively little overhead (20µs on my computer), but the default gc adds about 27 ms per call. So unless you have thousands of top-level calls, you're unlikely to see much impact.

like image 160
hadley Avatar answered Sep 22 '22 05:09

hadley