Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

'make'-like dependency-tracking library?

Tags:

r

makefile

There are many nice things to like about Makefiles, and many pains in the butt.

In the course of doing various project (I'm a research scientist, "data scientist", or whatever) I often find myself starting out with a few data objects on disk, generating various artifacts from those, generating artifacts from those artifacts, and so on.

It would be nice if I could just say "this object depends on these other objects", and "this object is created in the following manner from these objects", and then ask a Make-like framework to handle the details of actually building them, figuring out which objects need to be updated, farming out work to multiple processors (like Make's -j option), and so on. Makefiles can do all this - but the huge problem is that all the actions have to be written as shell commands. This is not convenient if I'm working in R or Perl or another similar environment. Furthermore, a strong assumption in Make is that all targets are files - there are some exceptions and workarounds, but if my targets are e.g. rows in a database, that would be pretty painful.

To be clear, I'm not after a software-build system. I'm interested in something that (more generally?) deals with dependency webs of artifacts.

Anyone know of a framework for these kinds of dependency webs? Seems like it could be a nice tool for doing data science, & visually showing how results were generated, etc.

One extremely interesting example I saw recently was IncPy, but it looks like it hasn't been touched in quite a while, and it's very closely coupled with Python. It's probably also much more ambitious than I'm hoping for, which is why it has to be so closely coupled with Python.

Sorry for the vague question, let me know if some clarification would be helpful.

like image 624
Ken Williams Avatar asked Dec 12 '12 19:12

Ken Williams


3 Answers

A new system called "Drake" was announced today that targets this exact situation: http://blog.factual.com/introducing-drake-a-kind-of-make-for-data . Looks very promising, though I haven't actually tried it yet.

like image 178
Ken Williams Avatar answered Oct 27 '22 17:10

Ken Williams


This question is several years old, but I thought adding a link to remake here would be relevant.

From the GitHub repository:

The idea here is to re-imagine a set of ideas from make but built for R. Rather than having a series of calls to different instances of R (as happens if you run make on R scripts), the idea is to define pieces of a pipeline within an R session. Rather than being language agnostic (like make must be), remake is unapologetically R focussed.

It is not on CRAN yet, and I haven't tried it, but it looks very interesting.

like image 36
Guillaume Avatar answered Oct 27 '22 18:10

Guillaume


I would give Bazel a try for this. It is primarily a software build system, but with its genrule type of artifacts it can perform pretty arbitrary file generation, too.

Bazel is very extendable, using its Python-like Starlark language which should be far easier to use for complicated tasks than make. You can start by writing simple genrule steps by hand, then refactor common patterns into macros, and if things become more complicated even write your own rules. So you should be able to express your individual transformations at a high level that models how you think about them, then turn that representation into lower level constructs using something that feels like a proper programming language.

Where make depends on timestamps, Bazel checks fingerprints. So if at any one step produces the same output even though one of its inputs changed, then subsequent steps won't need to get re-computed again. If some of your data processing steps project or filter data, there might be a high probability of this kind of thing happening.

I see your question is tagged for R, even though it doesn't mention it much. Under the hood, R computations would in Bazel still boil down to R CMD invocations on the shell. But you could have complicated muliti-line commands assembled in complicated ways, to read your inputs, process them and store the outputs. If the cost of initialization of the R binary is a concern, Rserve might help although using it would make the setup depend on a locally accessible Rserve instance I believe. Even with that I see nothing that would avoid the cost of storing the data to file, and loading it back from file. If you want something that avoids that cost by keeping things in memory between steps, then you'd be looking into a very R-specific tool, not a generic tool like you requested.

In terms of “visually showing how results were generated”, bazel query --output graph can be used to generate a graphviz dot file of the dependency graph.

Disclaimer: I'm currently working at Google, which internally uses a variant of Bazel called Blaze. Actually Bazel is the open-source released version of Blaze. I'm very familiar with using Blaze, but not with setting up Bazel from scratch.

like image 32
MvG Avatar answered Oct 27 '22 19:10

MvG