Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Dataflow computing in python

I have n (typically n < 10 but it should scale) processes running on different machines and communicating through amqp using RabbitMQ. Processes are typically long running and may be implemented in any language (though most are java/python).

Each process requires a number of inputs (numbers/strings) and produces a number of outputs (also just numbers or strings). Executing a process happens asynchronously: sending a message on its input queue and waiting for a callback to be triggered by the output queue.

Ideally the user specifies some inputs and desired outputs and the system should:

  • detect which processes are needed and generate the dependency graph
  • topologically sort the graph and execute it, node transitions will need to be event driven

A node should fire if its input is ready, allowing parallelism per branch. I can assume no cycles for now, but eventually there will be cycles (e.g., two processes may need to iterate until the output no longer changes).

This should be a known problem from (data)flow programming (discussed here before) and I want to avoid re-inventing the wheel. I would prefer a python solution and a search leads to Trellis and Pypes. Trellis is no longer developed but seems to support cycles, while pypes does not. Also not sure how actively developed pypes is.

Further searches reveal a whole list of event based programming frameworks, none of which I am particularly knowledgeable about. There are of course workflow environments like Taverna and KNIME, but that seems overkill.

Does anybody have any experience tackling this type of problem or with the libraries mentioned?

Edit: Other libraries I found are:

  • Stream
  • zflow
  • pyf
  • javafbp (Java)
like image 705
dgorissen Avatar asked Mar 28 '11 16:03

dgorissen


2 Answers

python.org has a Wiki page on "Flow Based Programming" -- http://wiki.python.org/moin/FlowBasedProgramming

like image 62
David Stolarsky Avatar answered Sep 19 '22 17:09

David Stolarsky


The bottom line is that if you can reinvent the wheel in a small number of lines of code ( a few hundred) which you completely understand and can document, then do it.

This is an area where the abstractions used are not that hard to implement given some basic foundation tools. RabbitMQ is such a tool. Node.js is another. There are lots of libraries around that implement useful ways to manages dataflows, workflows, finite state machines, etc., but they have a lot of overlap and they tend to be incomplete. Probably the original developer just built enough to get over his initial problem, and since this type of programming was not that popular, there was not the critical mass to keep development going.

There is a lot to be said for ranking all the possible solutions by popularity, picking the most popular one, and putting your effort into making it work (while sharing your work, of course).

like image 22
Michael Dillon Avatar answered Sep 17 '22 17:09

Michael Dillon