I write scientific research code, specifically in bioinformatics. Of course, in science, results should be reproducible. People who are not involved in a project on a regular basis and don't understand the infrastructure in detail may legitimately want to see my code to reproduce results. The problem is that making code self contained enough to easily give/explain to such a person seems to severely limit the amount of reuse that's possible.
It's often convenient to factor out functionality that's used in several related projects into a personal library, but it's not convenient to dump said library with 5,000 lines of (admittedly poorly documented, since it's not intended to be production/release quality) code that have nothing to do with the problem at hand on someone who wants to reproduce a result real quick.
It's often convenient to have a set of a few key libraries installed on your system and readily available for use without thinking twice, but it's not convenient to explain to someone who's primarily a scientist, not a programmer how you set all this stuff up. This is especially true if you don't remember some of the details yourself. (Note, though that the details in question are technical minutiae that have nothing to do with the science.)
It's often convenient to keep all the code for several related facets of a research project in one big program with tons of options rather than writing completely self-contained code for each slight variation/thing you tried, but again, it's not convenient to dump all this on, or explain all this to, someone who just wants to reproduce a result.
What are some ways to deal with these issues so that I can reuse code, but still allow someone who wants to reproduce my results to get my code up and running with a reasonable amount of effort? Observe that at the core of my question is the possibility of creating reusable libraries of code that is not very mature.
I think one way to answer this question is to consider how other tools out in the scientific programming world do it. I'm going to make this answer a community wiki, and people can add to it with codes they know about; then maybe we can end up with a list of ideas and examples we can all use for these sorts of things.
The "bazillion options" approach
The UNIX small-tools approach - build lots of little tools that can be strung together to make complex tools. Works well if what your tools do can be decomposed that way.
Build a tool out of routines: here the program is distributed as a kit that comes with a script (and some examples) that built a problem-specific application out of the bits and pieces.
This should've been comments, but I cannot put them whole in that small box...
I write scientific research code, specifically in bioinformatics. Of course, in science, results should be reproducible. People who are not involved in a project on a regular basis and don't understand the infrastructure
You're talking about the infrastructure here, programming-wise, right?
in detail may legitimately want to see my code to reproduce results. The problem is that making code self contained enough to easily give/explain to such a person seems to severely limit the amount of reuse that's possible.
I don't understand. Why wouldn't they be able to reproduce results? Or did you mean to say they wish to reuse your programs?
It's often convenient to factor out functionality that's used in several related projects into a personal library, but it's not convenient to dump said library with 5,000 lines of (admittedly poorly documented, since it's not intended to be production/release quality) code that have nothing to do with the problem at hand on someone who wants to reproduce a result real quick.
(apart from the "result reproducing" but that may be a language issue on my side); Ask yourself how many people are actually going to use your libraries. If, like it is in many cases, only one or two, then I don't see it reasonable to change it for their sake.
I usually make libraries for my private use in a way that suits my way of thinking. Adjusting it to them, purely for the sake of their convenience (i.e. without getting paid specifically for that, which I'm assume you're not) is actually another way of them saying "I didn't feel like writing my own, and I don't feel like thinking how you composed yours, so go and restructure it so I can easily use it without thinking".
It's often convenient to have a set of a few key libraries installed on your system and readily available for use without thinking twice, but it's not convenient to explain to someone who's primarily a scientist, not a programmer how you set all this stuff up. This is especially true if you don't remember some of the details yourself. (Note, though that the details in question are technical minutiae that have nothing to do with the science.)
It's often convenient to keep all the code for several related facets of a research project in one big program with tons of options rather than writing completely self-contained code for each slight variation/thing you tried, but again, it's not convenient to dump all this on, or explain all this to, someone who just wants to reproduce a result.
Of course. The problem with "scientific coding" (how I dislike that expression) is that the program is just a tool in the process of working on something else, meaning you're making it without actually wishing to self-contain it, since it is expected to be modified as work goes on.
What are some ways to deal with these issues so that I can reuse code, but still allow someone who wants to reproduce my results to get my code up and running with a reasonable amount of effort?
Branching the code in VCS for specific cases, and then giving someone the version which was closest to what they're needing always worked for me.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With