I have a program that is a DAG which process and cleans certain files, combines them, then does additional calculations. I want a way to run the whole analysis pipeline, and re-run if anything changes, but without having to re-process every single component.
I read about Makefiles and thought that it sounds like the perfect solution. I am also aware that it is probably outdated and that better alternatives probably exist, but I generally only find large lists of work flow scheduler tools that are not quite suited to this purpose, as far as I can tell (e.g., Airflow, Luigi, Nextflow, Dagobah, etc., etc.)
It seems like many of these are overkill with schedulers, GUIs, etc. which I don't really need. I just want one file that does the following:
Makefile example:
.PHONY : dats
dats : isles.dat abyss.dat
isles.dat : books/isles.txt
python countwords.py books/isles.txt isles.dat
abyss.dat : books/abyss.txt
python countwords.py books/abyss.txt abyss.dat
.PHONY : clean
clean :
rm -f *.dat
Is this the best procedure to run something like this in python or is there a better way?
Makefile essentially keeps your project up to date by rebuilding only the necessary parts of your source code whose children are out of date. It can also automatize compilation, builds and testing. In this context, a child is a library or a chunk of code which is essential for its parent's code to run.
The make utility requires a file, Makefile (or makefile ), which defines set of tasks to be executed. You may have used make to compile a program from source code. Most open source projects use make to compile a final executable binary, which can then be installed using make install .
A simple makefile consists of “rules” with the following shape: target … : prerequisites … recipe … … A target is usually the name of a file that is generated by a program; examples of targets are executable or object files. A target can also be the name of an action to carry out, such as ' clean ' (see Phony Targets).
DVC (Data Version Control) includes a modern re-implementation and extension of make
that is particularly suited to data-science pipelines (see here).
Handling pipelines in DVC has important benefits over make
for many scenarios, such as relying on file checksum rather than modification-time. On the contrary, make
is simpler in some sense, and it has a powerful macro mechanism. Still, there are elements in the syntax of makefiles that are quite subtle (e.g., multiple outputs, intermediate files), and make
generally doesn't support whitespace in filenames.
Is this the best procedure to run something like this in python or is there a better way?
"Best" is surely in the eye of the beholder. However, if the make
-based approach presented in the question is satisfactorily representative of the problem then it is a good way. make
implementations are very widely available, and their behavior is well understood and generally well-suited to problems such as the one presented.
There are other build tools that compete with make
, some written in Python, and there are undoubtedly some more esoteric software frameworks that could be applied to the task. Nevertheless, if you want to focus on doing the work instead of on building the framework to do the work, then I don't see any reason to look past the make
-based solution you already have.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With