What is the state of the art way to handle what makefiles do for python data analysis?

Tags:

I have a program that is a DAG which process and cleans certain files, combines them, then does additional calculations. I want a way to run the whole analysis pipeline, and re-run if anything changes, but without having to re-process every single component.

I read about Makefiles and thought that it sounds like the perfect solution. I am also aware that it is probably outdated and that better alternatives probably exist, but I generally only find large lists of work flow scheduler tools that are not quite suited to this purpose, as far as I can tell (e.g., Airflow, Luigi, Nextflow, Dagobah, etc., etc.)

It seems like many of these are overkill with schedulers, GUIs, etc. which I don't really need. I just want one file that does the following:

makes it obvious what all of the python scripts are that need to run
shows file dependencies so that a full re-run will only redo parts where something has been changed upstream
has the potential for some parallelization (not very necessary)
doesn't have too much boilerplate

Makefile example:

.PHONY : dats
dats : isles.dat abyss.dat

isles.dat : books/isles.txt
    python countwords.py books/isles.txt isles.dat

abyss.dat : books/abyss.txt
    python countwords.py books/abyss.txt abyss.dat

.PHONY : clean
clean :
    rm -f *.dat

Is this the best procedure to run something like this in python or is there a better way?

530

asked Nov 08 '19 00:11

teepee

2 Answers

DVC (Data Version Control) includes a modern re-implementation and extension of make that is particularly suited to data-science pipelines (see here).

Handling pipelines in DVC has important benefits over make for many scenarios, such as relying on file checksum rather than modification-time. On the contrary, make is simpler in some sense, and it has a powerful macro mechanism. Still, there are elements in the syntax of makefiles that are quite subtle (e.g., multiple outputs, intermediate files), and make generally doesn't support whitespace in filenames.

answered Nov 15 '22 19:11

amka66

Is this the best procedure to run something like this in python or is there a better way?

"Best" is surely in the eye of the beholder. However, if the make-based approach presented in the question is satisfactorily representative of the problem then it is a good way. make implementations are very widely available, and their behavior is well understood and generally well-suited to problems such as the one presented.

There are other build tools that compete with make, some written in Python, and there are undoubtedly some more esoteric software frameworks that could be applied to the task. Nevertheless, if you want to focus on doing the work instead of on building the framework to do the work, then I don't see any reason to look past the make-based solution you already have.

answered Nov 15 '22 20:11

John Bollinger

Related questions
                            
                                Sum pattern across array
                            
                                In what situations should you actually use generators in python?
                            
                                How to read an ORC file stored locally in Python Pandas?
                            
                                spacy fails to run with error: 'cymem.cymem' has no attribute 'PyMalloc'
                            
                                What is the backward process of max operation in deep learning?
                            
                                Counting the amount of times a boolean goes from True to False in a column
                            
                                Keras: rescale=1./255 vs preprocessing_function=preprocess_input - which one to use?
                            
                                Pandas to_sql - Increase table's index when appending DataFrame
                            
                                looking for python library which can perform levenshtein/other edit distance at word-level
                            
                                ValueError: Unknown activation function: my_custom_activation_function
                            
                                How do I get python2.7 and 3.7 both installed in an alpine docker image
                            
                                What exactly the shear do in ImageDataGenerator of Keras?
                            
                                In Altair, how to set the size of the connected points in a line chart?
                            
                                Conda environment: Print licenses of installed packages
                            
                                Fill in same amount of characters where other column is NaN
                            
                                What are the command line arguments passed to grpc_tools.protoc
                            
                                Tasks linger in celery amqp when publisher is terminated
                            
                                How to create a sheet under a specific folder with google API for python?
                            
                                Port XGBoost model trained in python to another system written in C/C++
                            
                                How to make a new line in django messages.error

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

What is the state of the art way to handle what makefiles do for python data analysis?

Tags:

python

python-3.x

makefile

scheduled-tasks

teepee

People also ask

2 Answers

amka66

John Bollinger

Recent Activity

Donate For Us