Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Scientific Computing: Balancing Self-Contained-ness and Reuse? [closed]

I write scientific research code, specifically in bioinformatics. Of course, in science, results should be reproducible. People who are not involved in a project on a regular basis and don't understand the infrastructure in detail may legitimately want to see my code to reproduce results. The problem is that making code self contained enough to easily give/explain to such a person seems to severely limit the amount of reuse that's possible.

  • It's often convenient to factor out functionality that's used in several related projects into a personal library, but it's not convenient to dump said library with 5,000 lines of (admittedly poorly documented, since it's not intended to be production/release quality) code that have nothing to do with the problem at hand on someone who wants to reproduce a result real quick.

  • It's often convenient to have a set of a few key libraries installed on your system and readily available for use without thinking twice, but it's not convenient to explain to someone who's primarily a scientist, not a programmer how you set all this stuff up. This is especially true if you don't remember some of the details yourself. (Note, though that the details in question are technical minutiae that have nothing to do with the science.)

  • It's often convenient to keep all the code for several related facets of a research project in one big program with tons of options rather than writing completely self-contained code for each slight variation/thing you tried, but again, it's not convenient to dump all this on, or explain all this to, someone who just wants to reproduce a result.

What are some ways to deal with these issues so that I can reuse code, but still allow someone who wants to reproduce my results to get my code up and running with a reasonable amount of effort? Observe that at the core of my question is the possibility of creating reusable libraries of code that is not very mature.

like image 539
dsimcha Avatar asked Mar 02 '11 00:03

dsimcha


2 Answers

I think one way to answer this question is to consider how other tools out in the scientific programming world do it. I'm going to make this answer a community wiki, and people can add to it with codes they know about; then maybe we can end up with a list of ideas and examples we can all use for these sorts of things.

  1. The "bazillion options" approach

    1. GUIs with lots of menus and sub-menus:
    2. Command line tools with many arguments, hopefully many of them optional
      • Lots and lots! Tools that use PETSc use this to control their linear algebra
    3. Tools, Command-line or otherwise, that have configuration files with lots of arguments that are hopefully optional
      • The Gadget SPH code takes this approach;
      • So does MESA for stellar evolution
      • So does the ADIOS parallel I/O library, which uses XML for the configuration file.
  2. The UNIX small-tools approach - build lots of little tools that can be strung together to make complex tools. Works well if what your tools do can be decomposed that way.

    • Molecular dynamics package Gromacs
    • The NEMO stellar dynamics toolbox
    • Many visualization packages also sort of work this way; within the GUI, one defines a pipeline of small tools. ParaView, OpenDX, VisIT
    • For general python computations, Ruffus can be used to organize the small tools into a larger workflow
  3. Build a tool out of routines: here the program is distributed as a kit that comes with a script (and some examples) that built a problem-specific application out of the bits and pieces.

    • The FLASH code is one that does this.
  4. Exposing the functionality as one or more libraries that can be linked in:
    • Tools, often mathematical in nature, such as FFTW, PETSc, GSL...
  5. Related to 3+4: A plugin-type approach where a tool (often, but not always, a GUI) exposes plugin functionality that can be easily incorporated into a larger workflow
    • Lots of Visualization packages, like ParaView
  6. Related to 2: Instead of the tools being called at the command line, the tool has its own command line in which one can call many individual routines; having ones own command line allows you to exercise a bit more control over the environment than just leaving it to the shell (but of couse, requires more work).
    • The venerable n-body visualization tool Tipsy
    • Lots of general analysis tools - Octave, SciLab, IDL
like image 120
2 revs Avatar answered Dec 26 '22 21:12

2 revs


This should've been comments, but I cannot put them whole in that small box...

I write scientific research code, specifically in bioinformatics. Of course, in science, results should be reproducible. People who are not involved in a project on a regular basis and don't understand the infrastructure

You're talking about the infrastructure here, programming-wise, right?

in detail may legitimately want to see my code to reproduce results. The problem is that making code self contained enough to easily give/explain to such a person seems to severely limit the amount of reuse that's possible.

I don't understand. Why wouldn't they be able to reproduce results? Or did you mean to say they wish to reuse your programs?

It's often convenient to factor out functionality that's used in several related projects into a personal library, but it's not convenient to dump said library with 5,000 lines of (admittedly poorly documented, since it's not intended to be production/release quality) code that have nothing to do with the problem at hand on someone who wants to reproduce a result real quick.

(apart from the "result reproducing" but that may be a language issue on my side); Ask yourself how many people are actually going to use your libraries. If, like it is in many cases, only one or two, then I don't see it reasonable to change it for their sake.

I usually make libraries for my private use in a way that suits my way of thinking. Adjusting it to them, purely for the sake of their convenience (i.e. without getting paid specifically for that, which I'm assume you're not) is actually another way of them saying "I didn't feel like writing my own, and I don't feel like thinking how you composed yours, so go and restructure it so I can easily use it without thinking".

It's often convenient to have a set of a few key libraries installed on your system and readily available for use without thinking twice, but it's not convenient to explain to someone who's primarily a scientist, not a programmer how you set all this stuff up. This is especially true if you don't remember some of the details yourself. (Note, though that the details in question are technical minutiae that have nothing to do with the science.)

It's often convenient to keep all the code for several related facets of a research project in one big program with tons of options rather than writing completely self-contained code for each slight variation/thing you tried, but again, it's not convenient to dump all this on, or explain all this to, someone who just wants to reproduce a result.

Of course. The problem with "scientific coding" (how I dislike that expression) is that the program is just a tool in the process of working on something else, meaning you're making it without actually wishing to self-contain it, since it is expected to be modified as work goes on.

What are some ways to deal with these issues so that I can reuse code, but still allow someone who wants to reproduce my results to get my code up and running with a reasonable amount of effort?

Branching the code in VCS for specific cases, and then giving someone the version which was closest to what they're needing always worked for me.

like image 45
Rook Avatar answered Dec 26 '22 23:12

Rook