Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do you combine "Revision Control" with "Workflow" for R?

Tags:

I remember coming across R users writing that they use "Revision control" (e.g: "Source control"), and I am curious to know: How do you combine "Revision control" with your statistical analysis workflow?

Two (very) interesting discussions talk about how to deal with the workflow. But neither of them refer to the revision control element:

  • How to organize large R programs?
  • Workflow for statistical analysis and report writing

A Long Update To The Question: Following some of the people's answers, and Dirk's question in the comment, I would like to direct my question a bit more.

After reading the Wiki article about "revision control" (which I was previously not familiar with), it was clear to me that when using revision control, what one does is to build a development structure of his code. This structure either leads to a "final product" or to several branches.

When building something like, let's say, a website. There is usually one end product you work towards (the website), with some prototypes along the way.

But when doing a statistical analysis, the work (to my view) is different. Sometimes you know where you want to get to. But more often, you explore. Explore cleaning the dataset. Explore different methods for statistical analysis, and ask various questions of your data (and I am writing this, knowing how Frank Harrell, and other experience statisticians feels about Data dredging).

That is why the workflow question with statistical programming is (in my view) a serious and deep question, raising many issues, The simpler ones are technical:

  • Which revision control software do you use (and why) ?
  • Which IDE do you use(and why) ? The more interesting question are about work process:
  • How do you structure your files?
  • What do you keep as a separate file and what as a revision? or asking in a different way - What should be a "branch" and what should be a "sub project" in your code? For example: When starting to explore your data, should a plot be creating and then erased because it didn't lead any where (but kept as a revision) or should there be a backup file of that path?

How you solve this tension was my initial curiosity. The second question is "what might I be missing?". What rules (of thumb) should one follow so to avoid common pitfalls doing statistical programming with version control?

In my intuition, I feel that statistical programming is inherently different then software development (I am writing this without being a real expert in statistical programming, and even less so in software development). That's way I am unsure which of the lessons I have read here about version control would be applicable.

Thanks a lot, Tal

like image 819
Tal Galili Avatar asked Feb 18 '10 06:02

Tal Galili


2 Answers

My workflow is not that different than Bernd's. I usually have a main directory where I put all my *.R code files. As soon as I have more than about 5 lines in a text file I start version control, in my case git. Most of my work is not in a team context meaning that I'm the only one changing my code. As soon as I make a substantive change (yes that is subjective) I do a check in. I agree with Dirk that this process is orthogonal to the workflow.

I use Eclipse + StatET and while there is a plugin for git in Eclipse (EGit and probably others), I don't use it. I'm in Windows and just use git-gui for Windows. Here's some more options.

There's a lot of room for personal idiosyncrasies in version control, but I recommend this one tip as a best practice: if you report results to others (i.e. journal article, your team, management in your firm) ALWAYS do a version control check in right before running results that go out to others. Invariably, 3 months later someone will look at your results and ask some question about the code which you can't answer unless you know the EXACT state of the code when you produced those results. So make it a practice and put in the comments "this is the version of the code that I used for 4th quarter financials" or whatever your use case is.

Also keep in mind that version control is no replacement for a good backup plan. My motto is: "3 copies. 2 geographies. 1 mind at peace."

EDIT (Feb 24, 2010): Joel Spolsky, one of the founders of Stack Overflow, just released a highly visual and very cool intro to Mercurial. This tutorial alone may be reason to adopt Mercurial if you have not already chosen a revision control system. I think when it comes to Git vs. Mercurial the most important advice is to chose one and use it. Maybe use what your friends/coworkers use or use the one with the best tutorial. But just use one already! ;)

like image 67
JD Long Avatar answered Sep 25 '22 17:09

JD Long


Rather than focusing on revision control in particular, it sounds like you're really asking a bigger question about how statistical analysis compares to software development. That's an interesting question. Here are some thoughts:

Data analysis can be more like an art than a science. In a sense, you might want to look for inspiration to the process that an author would follow when writing a book more than the process that a software developer would follow. On the other hand, I have yet to encounter a software project that followed a straight line. And even at a theoretical level, there is a great amount of variance in software development methodologies. Of these, given that a statistical analysis can be a discovery process (i.e. one that can't be fully planned up front), it would make sense to follow something like an agile methodology (much more so that something like the waterfall methodology). In other words, you need to plan for your analysis to be iterative and self-reflective.

That said, I think the notion that statistical analysis is purely exploratory with no goal in mind is potentially problematic. That can lead to the point where you are 5 steps past your eureka moment, and have no way to get back to it. There is always a goal of some sort, even if the goal itself is changing. Moreover, if there is no goal, how will you know when you've reached the end?

One approach is to start off with one R file as you start a project (or a set of files like in the Josh and Bernd examples), and progressively add to it (so that it grows in size) as you make discoveries. This is also especially true when you have data that needs to be kept as part of the analysis. This file should be version controlled regularly to ensure that you can always step backwards if you make mistakes (allowing to incremental gains). Version control systems are immensely helpful in development not just because they ensure that you don't lose things, but also because they provide you with a timeline. And tag your check-ins so that you know what's in them at a glance, and note major milestones. I love JD's point about checking in before submitting something.

Once you have reached your final set of conclusions, it's often best to create a final version of your file that summarizes your analysis from start to end. You might even consider putting this into a Sweave document so that it's fully self-contained and literate.

You should also give serious thought to what others around you are doing. Nothing makes me cringe more than to see people reinventing the wheel, especially when it means extra work for the group as a whole to integrate with.

Your decisions about which version control system to use, which IDE, etc. (implementation issues) are ultimately extremely low on the totem pole in relation to the overall project management. Just use any one of them properly and you're already 95% of the way there, and the differences between them are small in comparison to the alternative of using nothing.

Lastly, if you are using something like github, google code, or R-forge, you will note something that they all have in common: a suite of tools beyond just a version control system. Namely, you should consider using things like the issue tracking system and the wiki to document progress and log open issues/tasks. The more organized you are with your analysis, the greater the likelihood of success.

like image 25
Shane Avatar answered Sep 25 '22 17:09

Shane