Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is the state of the art way to handle what makefiles do for python data analysis?

I have a program that is a DAG which process and cleans certain files, combines them, then does additional calculations. I want a way to run the whole analysis pipeline, and re-run if anything changes, but without having to re-process every single component.

I read about Makefiles and thought that it sounds like the perfect solution. I am also aware that it is probably outdated and that better alternatives probably exist, but I generally only find large lists of work flow scheduler tools that are not quite suited to this purpose, as far as I can tell (e.g., Airflow, Luigi, Nextflow, Dagobah, etc., etc.)

It seems like many of these are overkill with schedulers, GUIs, etc. which I don't really need. I just want one file that does the following:

  • makes it obvious what all of the python scripts are that need to run
  • shows file dependencies so that a full re-run will only redo parts where something has been changed upstream
  • has the potential for some parallelization (not very necessary)
  • doesn't have too much boilerplate

Makefile example:

.PHONY : dats
dats : isles.dat abyss.dat

isles.dat : books/isles.txt
    python countwords.py books/isles.txt isles.dat

abyss.dat : books/abyss.txt
    python countwords.py books/abyss.txt abyss.dat

.PHONY : clean
clean :
    rm -f *.dat

Is this the best procedure to run something like this in python or is there a better way?

like image 530
teepee Avatar asked Nov 08 '19 00:11

teepee


People also ask

What are Makefiles in Python?

Makefile essentially keeps your project up to date by rebuilding only the necessary parts of your source code whose children are out of date. It can also automatize compilation, builds and testing. In this context, a child is a library or a chunk of code which is essential for its parent's code to run.

How do I use Makefile?

The make utility requires a file, Makefile (or makefile ), which defines set of tasks to be executed. You may have used make to compile a program from source code. Most open source projects use make to compile a final executable binary, which can then be installed using make install .

What is Makefile target?

A simple makefile consists of “rules” with the following shape: target … : prerequisites … recipe … … A target is usually the name of a file that is generated by a program; examples of targets are executable or object files. A target can also be the name of an action to carry out, such as ' clean ' (see Phony Targets).


2 Answers

DVC (Data Version Control) includes a modern re-implementation and extension of make that is particularly suited to data-science pipelines (see here).

Handling pipelines in DVC has important benefits over make for many scenarios, such as relying on file checksum rather than modification-time. On the contrary, make is simpler in some sense, and it has a powerful macro mechanism. Still, there are elements in the syntax of makefiles that are quite subtle (e.g., multiple outputs, intermediate files), and make generally doesn't support whitespace in filenames.

like image 52
amka66 Avatar answered Nov 15 '22 19:11

amka66


Is this the best procedure to run something like this in python or is there a better way?

"Best" is surely in the eye of the beholder. However, if the make-based approach presented in the question is satisfactorily representative of the problem then it is a good way. make implementations are very widely available, and their behavior is well understood and generally well-suited to problems such as the one presented.

There are other build tools that compete with make, some written in Python, and there are undoubtedly some more esoteric software frameworks that could be applied to the task. Nevertheless, if you want to focus on doing the work instead of on building the framework to do the work, then I don't see any reason to look past the make-based solution you already have.

like image 40
John Bollinger Avatar answered Nov 15 '22 20:11

John Bollinger