Strategies for repeating large chunk of analysis

Question

I find myself in the position of having completed a large chunk of analysis and now need to repeat the analysis with slightly different input assumptions.

The analysis, in this case, involves cluster analysis, plotting several graphs, and exporting cluster ids and other variables of interest. The key point is that it is an extensive analysis, and needs to be repeated and compared only twice.

I considered:

Creating a function. This isn't ideal, because then I have to modify my code to know whether I am evaluating in the function or parent environments. This additional effort seems excessive, makes it harder to debug and may introduce side-effects.
Wrap it in a for-loop. Again, not ideal, because then I have to create indexing variables, which can also introduce side-effects.
Creating some pre-amble code, wrapping the analysis in a separate file and source it. This works, but seems very ugly and sub-optimal.

The objective of the analysis is to finish with a set of objects (in a list, or in separate output files) that I can analyse further for differences.

What is a good strategy for dealing with this type of problem?

Nick Sabbe · Accepted Answer

Making code reusable takes some time, effort and holds a few extra challenges like you mention yourself.

The question whether to invest is probably the key issue in informatics (if not in a lot of other fields): do I write a script to rename 50 files in a similar fashion, or do I go ahead and rename them manually.

The answer, I believe, is highly personal and even then, different case by case. If you are easy on the programming, you may sooner decide to go the reuse route, as the effort for you will be relatively low (and even then, programmers typically like to learn new tricks, so that's a hidden, often counterproductive motivation).

That said, in your particular case: I'd go with the sourcing option: since you plan to reuse the code only 2 times more, a greater effort would probably go wasted (you indicate the analysis to be rather extensive). So what if it's not an elegant solution? Nobody is ever going to see you do it, and everybody will be happy with the swift results.

If it turns out in a year or so that the reuse is higher than expected, you can then still invest. And by that time, you will also have (at least) three cases for which you can compare the results from the rewritten and funky reusable version of your code with your current results.

If/when I do know up front that I'm going to reuse code, I try to keep that in mind while developing it. Either way I hardly ever write code that is not in a function (well, barring the two-liners for SO and other out-of-the-box analyses): I find this makes it easier for me to structure my thoughts.

Iterator · Answer

If at all possible, set parameters that differ between sets/runs/experiments in an external parameter file. Then, you can source the code, call a function, even utilize a package, but the operations are determined by a small set of externally defined parameters.

For instance, JSON works very well for this and the RJSONIO and rjson packages allow you to load the file into a list. Suppose you load it into a list called parametersNN.json. An example is as follows:

{
 "Version": "20110701a",
 "Initialization":
 {
   "indices": [1,2,3,4,5,6,7,8,9,10],
   "step_size": 0.05
 },
 "Stopping":
 {
   "tolerance": 0.01,
   "iterations": 100
 }
}

Save that as "parameters01.json" and load as:

library(RJSONIO)
Params <- fromJSON("parameters.json")

and you're off and running. (NB: I like to use unique version #s within my parameters files, just so that I can identify the set later, if I'm looking at the "parameters" list within R.) Just call your script and point to the parameters file, e.g.:

Rscript --vanilla MyScript.R parameters01.json

then, within the program, identify the parameters file from the commandArgs() function.

Later, you can break out code into functions and packages, but this is probably the easiest way to make a vanilla script generalizeable in the short term, and it's a good practice for the long-term, as code should be separated from the specification of run/dataset/experiment-dependent parameters.

Edit: to be more precise, I would even specify input and output directories or files (or naming patterns/prefixes) in the JSON. This makes it very clear how one set of parameters led to one particular output set. Everything in between is just code that runs with a given parametrization, but the code shouldn't really change much, should it?

Update: Three months, and many thousands of runs, wiser than my previous answer, I'd say that the external storage of parameters in JSON is useful for 1-1000 different runs. When the parameters or configurations number in the thousands and up, it's better to switch to using a database for configuration management. Each configuration may originate in a JSON (or XML), but being able to grapple with different parameter layouts requires a larger scale solution, for which a database like SQLite (via RSQLite) is a fine solution.

I realize this answer is overkill for the original question - how to repeat work only a couple of times, with a few parameter changes, but when scaling up to hundreds or thousands of parameter changes in ongoing research, more extensive tools are necessary. :)

Matt Bannert · Answer

I like to work with combination of a little shell script, a pdf cropping program and Sweave in those cases. That gives you back nice reports and encourages you to source. Typically I work with several files, almost like creating a package (at least I think it feels like that :) . I have a separate file for the data juggling and separate files for different types of analysis, such as descriptiveStats.R, regressions.R for example.

btw here's my little shell script,

 #!/bin/sh
 R CMD Sweave docSweave.Rnw
 for file in `ls pdfs`;
 do pdfcrop  pdfs/"$file" pdfs/"$file"
 done
 pdflatex docSweave.tex
 open docSweave.pdf

The Sweave file typically sources the R files mentioned above when needed. I am not sure whether that's what you looking for, but that's my strategy so far. I at least I believe creating transparent, reproducible reports is what helps to follow at least A strategy.

Harald Brendel · Answer

Your third option is not so bad. I do this in many cases. You can build a bit more structure by putting the results of your pre-ample code in environments and attach the one you want to use for further analysis. An example:

    setup1 <- local({
          x <- rnorm(50, mean=2.0)
          y <- rnorm(50, mean=1.0)
          environment()
          # ...
        })

    setup2 <- local({
          x <- rnorm(50, mean=1.8)
          y <- rnorm(50, mean=1.5)
          environment()
          # ...
        })

attach(setup1) and run/source your analysis code

plot(x, y)
t.test(x, y, paired = T, var.equal = T)
...

When finished, detach(setup1) and attach the second one.

Now, at least you can easily switch between setups. Helped me a few times.

Strategies for repeating large chunk of analysis

Tags:

r

Andrie

4 Answers

Nick Sabbe

Iterator

Matt Bannert

Harald Brendel

Recent Activity

Donate For Us

Strategies for repeating large chunk of analysis

Tags:

r

Andrie

4 Answers

Nick Sabbe

Iterator

Matt Bannert

Harald Brendel

Related questions

Recent Activity

Donate For Us