Ensuring reproducibility in an R environment

Tags:

r

I work in a computational biology lab, where we have several folks working on multiple projects, mostly in R (which is what I care about for this post). In the past, people would simply develop their code for each project, which may or may not involve boilerplate code copied over from previous projects. One thing that I've pushed over the years was to bring some centralized structure to this mess and have people identify common patterns such that we can turn these repeated/common blocks of code into packages for all of the many reasons one might think that is a good thing to do. So now our folks are using a mix of centralized packages/routines within their project specific scripts.

There's one gotcha here. We have a mandate from the powers that be that every script for every project need to be 100% reproducible over time to the best of our ability (and this includes 100% of all code we have direct access to, including our packages). That is, if I call function foo in package bar with parameter A to get result X today, 4 years from now I should get the exact same result. (erroneous output due to bugs is excepted here)

The topic of reproducibility has come up now and then in R within various circles, but typically it seems to be discussed in terms of reproducibility of process (e.g. vignettes). This is not the same thing - I can run a vignette today and then run the same code 6 months from now using updated packages and receive wildly different results.

The solution that's been agreed upon (which I'm not a fan of) is that if a function or package needs to be changed in a non-backwards compatible change that it simply gets a new name. Thus, if we needed to radically change function foo(), it'd be called foo2(), and if that needs a radical change it gets called foo3(). This ensures that any script that called foo() will always get the original result, while allowing things to march forward within the package repository. It works, but I really dislike this - it seems aesthetically extremely cluttered, and I worry that it will lead to mass confusion over time having packages bar, bar2, bar3, bar4 ... functions foo1, foo2, foo3, etc.

The problem is that I haven't come up with an alternate solution that's really better. One possibility would be to note version numbers of packages, R, etc and make sure those are loaded, but that has multiple problems - not the least of which is that it relies on proper package versioning discipline and that's prone to error. Also, this alternative was already rejected ;) Ideally what we'd have is some sort of notion of devel & release as most of these changes tend to happen earlier on and then level off with changes happening much less frequently. OTOH what devel really means here is "not actually in a package yet" (which we do), but it can be hard to determine exactly at what point is the right one to transport stuff over. Invariably the moment you think you're safe, that's when you realize you're not.

So with all this in mind, I'm curious if anyone else out there has dealt with similar situations, and how they might have resolved things.

edit: just to be clear, by non-backwards compatible, I'm not just talking about APIs and such, but also outputs for a given set of inputs.

591

asked Nov 03 '10 23:11

geoffjentry

3 Answers

This is indeed an important thing to think about and I think ultimately requires the institutionalization of a couple of different processes.

Version Control (svn, git, bzr, cvs, etc)
Unit Tests

My first reaction is that you need to institutionalize some sort of code management system. This will make it easier, because the old version of foo() is still available, if you really want it. From what you have said, it sounds like you need to package up your common functions and institute some sort of a release schedule. Scripts which require backward compatibility must include the package name and release information. This way it is possible to ALWAYS obtain foo() exactly as it was when the script was written. You should also make sure people only use official release versions in their work, because otherwise this could become quite a pain.

I agree, having a collection of foo:foo99 is doomed to failure. But at least it will be a gloriously confusing failure. Aesthetics aside, it will drive you all bonkers. If foo2() is an improvement (more accurate, faster, etc) of foo(), then it should be called foo() and released for use according to your company-wide release schedule. If it does something different, it is no longer foo(). It might be fooo() or superFoo() or fooMe(), but it ain't foo().

Finally, you need to start testing your functions. (Unit Tests) For each function that is published and made available for others, you should have a clearly defined test suite. Unless someone fixes a bug in foo(), the results should stay the same. If someone fixes a bug, then the results should be more accurate and will probably more desirable in most cases. If you do need to reproduce the old, incorrect, results, you can dig out an old version of foo() from your version control system. By instituting rigorous unit tests, you will know if/when the results of foo have changed. This knowledge should help minimize the number of foo() functions you need. Rather than create a version every time someone tweaks something, you can test the new version to see whether or not the results conform to expectations. But, this is tricky, because you have to make sure that your tests cover anything the function is ever likely to see, including bizarre edge cases. In a research setting, I would imagine that could become a challenge.

141

answered Sep 19 '22 22:09

Choens

I'm not sure about integrating it with R, but Sumatra might be worth looking into. It appears to allow you to keep track of code and results. So if you need to go back an re-run that simulation from 4 years ago, the code should be there.

answered Sep 20 '22 22:09

kmm

Well, ask yourself how you would do that in any other language. There's really nothing more to it than good bookkeeping I'm afraid:

record version numbers of all software involved
put the code in manageable chunks, say in packages.
make sure you have all software/packages involved still available in 5 years.

R can easily be made portable, including all installed packages. Keep a portable version of R together with the used packages, the code and the data on a CD-ROM for each analysis, and you're sure you can reproduce whenever you want. OK, you miss the OS, but can't have them all. In any case, if the OS makes a difference important enough to call the analysis not reproducible, the problem is very likely your analysis. You don't want to tell anybody your result is dependent on the version of Windows you use, do you?

PS : please get into peoples head that they should never ever in their life copy-paste code. They should wrap it in functions and use those. A whole lot easier and far less error-prone. I mean, what's the difference between copying

x <- read.table("sometable")
y <- ColSums(x)/4.3

and adjusting the values, or typing

myfun <- function(i,j){
  x <- read.table(i)
  y <- ColSums(x)/j
}

Saves you and a lot of other people a whole lot of copy-paste trouble. (How so, object not found? What object?)

answered Sep 20 '22 22:09

Joris Meys

Related questions
                            
                                Conditional assignment of one variable to the value of one of two other variables
                            
                                rstudio - is it possible to run a code in the background
                            
                                Favicon in Shiny
                            
                                Is there a way to have conditional markdown chunk execution in Rmarkdown?
                            
                                R programming - Adding extra column to existing matrix
                            
                                If file exists in folder read it else skip the processing part
                            
                                grab n letter words don't count apostrophes regex
                            
                                Changing Column Names in a List of Data Frames in R
                            
                                simplest python equivalent to R's grepl
                            
                                How to increase size of the points in ggplot2, similar to cex in base plots?
                            
                                Multiple ROC curves in one plot ROCR
                            
                                Get the min of two columns
                            
                                Changing tick intervals when x axis values are dates
                            
                                Index value for matrix in R?
                            
                                how to create a list in R from two vectors (one would be the keys, the other the values)?
                            
                                How to loop through a list in R
                            
                                Difference between r-base and r-recommended packages
                            
                                How can I generate a GUID in R?
                            
                                Reduce size of legend area in barplot
                            
                                R and Leaflet: How to arrange label text across multiple lines

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With