I am working on a code which takes a dataset and runs some algorithms on it.
User uploads a dataset, and then selects which algorithms will be run on this dataset and creates a workflow like this:
workflow =
{0: {'dataset': 'some dataset'},
1: {'algorithm1': "parameters"},
2: {'algorithm2': "parameters"},
3: {'algorithm3': "parameters"}
}
Which means I'll take workflow[0]
as my dataset, and I will run algorithm1
on it. Then, I will take its results and I will run algorithm2
on this results as my new dataset. And I will take the new results and run algorithm3
on it. It goes like this until the last item and there is no length limit for this workflow.
I am writing this in Python. Can you suggest some strategies about processing this workflow?
You want to run a pipeline on some dataset. That sounds like a reduce operation (fold in some languages). No need for anything complicated:
result = reduce(lambda data, (aname, p): algo_by_name(aname)(p, data), workflow)
This assumes workflow looks like (text-oriented so you can load it with YAML/JSON):
workflow = ['data', ('algo0', {}), ('algo1', {'param': value}), … ]
And that your algorithms look like:
def algo0(p, data):
…
return output_data.filename
algo_by_name takes a name and gives you an algo function; for example:
def algo_by_name(name):
return {'algo0': algo0, 'algo1': algo1, }[name]
(old edit: if you want a framework for writing pipelines, you could use Ruffus. It's like a make tool, but with progress support and pretty flow charts.)
If each algorithm
works on each element on dataset
, map()
would be an elegant option:
dataset=workflow[0]
for algorithm in workflow[1:]:
dataset=map(algorithm, dataset)
e.g. for the square roots of odd numbers only, use,
>>> algo1=lambda x:0 if x%2==0 else x
>>> algo2=lambda x:x*x
>>> dataset=range(10)
>>> workflow=(dataset, algo1, algo2)
>>> for algo in workflow[1:]:
dataset=map(algo, dataset)
>>> dataset
[0, 1, 0, 9, 0, 25, 0, 49, 0, 81]
The way you want to do it seems sound to me, or you need to post more informations about what you are trying to accomplish.
And advice: I would put the workflow structure in a list with tuples rather than a dictionary
workflow = [ ('dataset', 'some dataset'),
('algorithm1', "parameters"),
('algorithm2', "parameters"),
('algorithm3', "parameters")]
Define a Dataset
class that tracks... data... for your set. Define methods in this class. Something like this:
class Dataset:
# Some member fields here that define your data, and a constructor
def algorithm1(self, param1, param2, param3):
# Update member fields based on algorithm
def algorithm2(self, param1, param2):
# More updating/processing
Now, iterate over your "workflow" dict. For the first entry, simply instantiate your Dataset
class.
myDataset = Dataset() # Whatever actual construction you need to do
For each subsequent entry...
dict
is inconvenient here)Assuming you now have the string algorithm
and the tuple params
for the current iteration...
getattr(myDataset, algorithm)(*params)
This will call the function on myDataset
with the name specified by "algorithm" with the argument list contained in "params".
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With