Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Processing a simple workflow in Python

I am working on a code which takes a dataset and runs some algorithms on it.

User uploads a dataset, and then selects which algorithms will be run on this dataset and creates a workflow like this:

workflow = 
{0: {'dataset': 'some dataset'},
 1: {'algorithm1': "parameters"},
 2: {'algorithm2': "parameters"},
 3: {'algorithm3': "parameters"}
}

Which means I'll take workflow[0] as my dataset, and I will run algorithm1 on it. Then, I will take its results and I will run algorithm2 on this results as my new dataset. And I will take the new results and run algorithm3 on it. It goes like this until the last item and there is no length limit for this workflow.

I am writing this in Python. Can you suggest some strategies about processing this workflow?

like image 830
Stephen T. Avatar asked Jan 24 '10 11:01

Stephen T.


4 Answers

You want to run a pipeline on some dataset. That sounds like a reduce operation (fold in some languages). No need for anything complicated:

result = reduce(lambda data, (aname, p): algo_by_name(aname)(p, data), workflow)

This assumes workflow looks like (text-oriented so you can load it with YAML/JSON):

workflow = ['data', ('algo0', {}), ('algo1', {'param': value}), … ]

And that your algorithms look like:

def algo0(p, data):
    …
    return output_data.filename

algo_by_name takes a name and gives you an algo function; for example:

def algo_by_name(name):
    return {'algo0': algo0, 'algo1': algo1, }[name]

(old edit: if you want a framework for writing pipelines, you could use Ruffus. It's like a make tool, but with progress support and pretty flow charts.)

like image 70
Tobu Avatar answered Sep 21 '22 20:09

Tobu


If each algorithm works on each element on dataset, map() would be an elegant option:

dataset=workflow[0]
for algorithm in workflow[1:]:
    dataset=map(algorithm, dataset)

e.g. for the square roots of odd numbers only, use,

>>> algo1=lambda x:0 if x%2==0 else x
>>> algo2=lambda x:x*x
>>> dataset=range(10)
>>> workflow=(dataset, algo1, algo2)
>>> for algo in workflow[1:]:
    dataset=map(algo, dataset)
>>> dataset
[0, 1, 0, 9, 0, 25, 0, 49, 0, 81]
like image 21
Adam Matan Avatar answered Sep 23 '22 20:09

Adam Matan


The way you want to do it seems sound to me, or you need to post more informations about what you are trying to accomplish.

And advice: I would put the workflow structure in a list with tuples rather than a dictionary

workflow = [ ('dataset', 'some dataset'),
             ('algorithm1', "parameters"),
             ('algorithm2', "parameters"),
             ('algorithm3', "parameters")]
like image 24
fabrizioM Avatar answered Sep 25 '22 20:09

fabrizioM


Define a Dataset class that tracks... data... for your set. Define methods in this class. Something like this:

class Dataset:
    # Some member fields here that define your data, and a constructor

    def algorithm1(self, param1, param2, param3):
        # Update member fields based on algorithm

    def algorithm2(self, param1, param2):
        # More updating/processing

Now, iterate over your "workflow" dict. For the first entry, simply instantiate your Dataset class.

myDataset = Dataset() # Whatever actual construction you need to do

For each subsequent entry...

  • Extract the key/value somehow (I'd recommend changing your workflow data structure if possible, dict is inconvenient here)
  • Parse the param string to a tuple of arguments (this step is up to you).
  • Assuming you now have the string algorithm and the tuple params for the current iteration...

    getattr(myDataset, algorithm)(*params)

  • This will call the function on myDataset with the name specified by "algorithm" with the argument list contained in "params".

like image 43
Sapph Avatar answered Sep 24 '22 20:09

Sapph