Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Preventing performance regressions in R

What is a good workflow for detecting performance regressions in R packages? Ideally, I'm looking for something that integrates with R CMD check that alerts me when I have introduced a significant performance regression in my code.

What is a good workflow in general? What other languages provide good tools? Is it something that can be built on top unit testing, or that is usually done separately?

like image 260
hadley Avatar asked Dec 11 '11 15:12

hadley


People also ask

How to do a linear regression in R?

A step-by-step guide to linear regression in R. 1 Step 1: Load the data into R. Follow these four steps for each dataset: 2 Step 2: Make sure your data meet the assumptions. 3 Step 3: Perform the linear regression analysis. 4 Step 4: Check for homoscedasticity. 5 Step 5: Visualize the results with a graph. More items

When to use a logistic regression in R?

A logistic model is used when the response variable has categorical values such as 0 or 1. For example, a student will pass/fail, a mail is spam or not, determining the images, etc. In this article, we’ll discuss regression analysis, types of regression, and implementation of logistic regression in R programming.

How to reduce memory usage in R programming?

Inadequate memory can slow down your code as well as make it impossible to execute programs when allocated with complex vectors. One way to do this is to write small functions and run them instead of running everything directly in a working environment. Wait! Have you checked – R Graphic Devices Tutorial 2. The Dreaded for Loop

Why is R so slow?

R’s syntax is very flexible, making it convenient at the cost of performance. R is indeed slow when compared to many other scripting languages, but there are a few tricks which can make our R code run faster: Use a matrix instead of a data frame whenever possible as data frame cause problem in many cases.


1 Answers

This is a very challenging question, and one that I'm frequently dealing with, as I swap out different code in a package to speed things up. Sometimes a performance regression comes along with a change in algorithms or implementation, but it may also arise due to changes in the data structures used.

What is a good workflow for detecting performance regressions in R packages?

In my case, I tend to have very specific use cases that I'm trying to speed up, with different fixed data sets. As Spacedman wrote, it's important to have a fixed computing system, but that's almost infeasible: sometimes a shared computer may have other processes that slow things down 10-20%, even when it looks quite idle.

My steps:

  1. Standardize the platform (e.g. one or a few machines, a particular virtual machine, or a virtual machine + specific infrastructure, a la Amazon's EC2 instance types).
  2. Standardize the data set that will be used for speed testing.
  3. Create scripts and fixed intermediate data output (i.e. saved to .rdat files) that involve very minimal data transformations. My focus is on some kind of modeling, rather than data manipulation or transformation. This means that I want to give exactly the same block of data to the modeling functions. If, however, data transformation is the goal, then be sure that the pre-transformed/manipulated data is as close as possible to standard across tests of different versions of the package. (See this question for examples of memoization, cacheing, etc., that can be used to standardize or speed up non-focal computations. It references several packages by the OP.)
  4. Repeat tests multiple times.
  5. Scale the results relative to fixed benchmarks, e.g. the time to perform a linear regression, to sort a matrix, etc. This can allow for "local" or transient variations in infrastructure, such as may be due to I/O, the memory system, dependent packages, etc.
  6. Examine the profiling output as vigorously as possible (see this question for some insights, also referencing tools from the OP).

    Ideally, I'm looking for something that integrates with R CMD check that alerts me when I have introduced a significant performance regression in my code.

    Unfortunately, I don't have an answer for this.

    What is a good workflow in general?

    For me, it's quite similar to general dynamic code testing: is the output (execution time in this case) reproducible, optimal, and transparent? Transparency comes from understanding what affects the overall time. This is where Mike Dunlavey's suggestions are important, but I prefer to go further, with a line profiler.

    Regarding a line profiler, see my previous question, which refers to options in Python and Matlab for other examples. It's most important to examine clock time, but also very important to track memory allocation, number of times the line is executed, and call stack depth.

    What other languages provide good tools?

    Almost all other languages have better tools. :) Interpreted languages like Python and Matlab have the good & possibly familiar examples of tools that can be adapted for this purpose. Although dynamic analysis is very important, static analysis can help identify where there may be some serious problems. Matlab has a great static analyzer that can report when objects (e.g. vectors, matrices) are growing inside of loops, for instance. It is terrible to find this only via dynamic analysis - you've already wasted execution time to discover something like this, and it's not always discernible if your execution context is pretty simple (e.g. just a few iterations, or small objects).

    As far as language-agnostic methods, you can look at:

    1. Valgrind & cachegrind
    2. Monitoring of disk I/O, dirty buffers, etc.
    3. Monitoring of RAM (Cachegrind is helpful, but you could just monitor RAM allocation, and lots of details about RAM usage)
    4. Usage of multiple cores

    Is it something that can be built on top unit testing, or that is usually done separately?

    This is hard to answer. For static analysis, it can occur before unit testing. For dynamic analysis, one may want to add more tests. Think of it as sequential design (i.e. from an experimental design framework): if the execution costs appear to be, within some statistical allowances for variation, the same, then no further tests are needed. If, however, method B seems to have an average execution cost greater than method A, then one should perform more intensive tests.


Update 1: If I may be so bold, there's another question that I'd recommend including, which is: "What are some gotchas in comparing the execution time of two versions of a package?" This is analogous to assuming that two programs that implement the same algorithm should have the same intermediate objects. That's not exactly true (see this question - not that I'm promoting my own questions, here - it's just hard work to make things better and faster...leading to multiple SO questions on this topic :)). In a similar way, two executions of the same code can differ in time consumed due to factors other than the implementation.

So, some gotchas that can occur, either within the same language or across languages, within the same execution instance or across "identical" instances, which can affect runtime:

  1. Garbage collection - different implementations or languages can hit garbage collection under different circumstances. This can make two executions appear different, though it can be very dependent on context, parameters, data sets, etc. The GC-obsessive execution will look slower.
  2. Cacheing at the level of the disk, motherboard (e.g. L1, L2, L3 caches), or other levels (e.g. memoization). Often, the first execution will pay a penalty.
  3. Dynamic voltage scaling - This one sucks. When there is a problem, this may be one of the hardest beasties to find, since it can go away quickly. It looks like cacheing, but it isn't.
  4. Any job priority manager that you don't know about.
  5. One method uses multiple cores or does some clever stuff about how work is parceled among cores or CPUs. For instance, getting a process locked to a core can be useful in some scenarios. One execution of an R package may be luckier in this regard, another package may be very clever...
  6. Unused variables, excessive data transfer, dirty caches, unflushed buffers, ... the list goes on.

The key result is: Ideally, how should we test for differences in expected values, subject to the randomness created due to order effects? Well, pretty simple: go back to experimental design. :)

When the empirical differences in execution times are different from the "expected" differences, it's great to have enabled additional system and execution monitoring so that we don't have to re-run the experiments until we're blue in the face.

like image 91
Iterator Avatar answered Oct 04 '22 18:10

Iterator