Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Dataflow processing

I have a class of computations that seems to naturally take a graph structure. The graph is far from linear, as there are multiple inputs as well as nodes that fan out and nodes that require the result of several other nodes. In all of these computations there are possibly several sinks, too. No cycles are ever present, though. Input nodes are updated (not necessarily one at a time) and I have their values flow through the (at this point purely conceptual) graph. Nodes retain state as inputs change, and the computations have to be running sequentially with respect to the inputs.

As I have to write such computations quite frequently and I am reluctant of writing ad-hoc code each time, I have tried writing a small library to make it easy to piece together such computations by writing classes for the various vertices. My code, however, is fairly inelegant and it does not take any advantage of the parallel structure of these computations. While each vertex is usually lightweight, computations can get quite complex and "wide". To make the problem even more complicated, the inputs for these computations are updated very frequently in a loop. Luckily the problems are of small-enough scale that I can handle them on a single node.

Has anyone ever dealt with anything similar? What ideas/approaches/tools would you recommend?

like image 483
em70 Avatar asked Nov 18 '15 03:11

em70


1 Answers

Apache Storm: Reliable, realtime Stream-Processing on distributed hardware

This sounds like a problem for which Apache Storm (Open Source) would be perfect: http://storm.apache.org/

Apache Storm is about real-time streaming computation, which processes singe tuples (data points) one at a time. Storm guarantees that every tuple is being processed at least once. With Storm Trident you are able to abstract Storm further and get exactly-once semantics.

Apache Storm is a free and open source distributed realtime computation system. Storm makes it easy to reliably process unbounded streams of data, doing for realtime processing what Hadoop did for batch processing.

My company and I have been working with Apache Storm for several years and it is one of the most mature Big Data technologies. Big Data technology is technology that runs in a horizontally distributed manner (on cheap commodity hardware).

API and documentation

The main API is for Java, but there are adapters to Ruby, Python, Javascript, Perl. However you are actually able to use any language: http://storm.apache.org/about/multi-language.html

The documentation is good (though the JavaDoc could use some more details): http://storm.apache.org/documentation.html

Basic Idea - Spouts and Bolts (= graph nodes)

Spouts and Bolts from Apache Storm

Storm has Spouts from which you are able to read data into your so called topology. The topology is this graph that you described. When new tuples come into the spouts, they are send through the topology. Each of the nodes is one of the Storm Bolts.

Use Cases

Storm has many use cases: realtime analytics, online machine learning, continuous computation, distributed RPC, ETL, and more. Storm is fast: a benchmark clocked it at over a million tuples processed per second per node. It is scalable, fault-tolerant, guarantees your data will be processed, and is easy to set up and operate.

like image 167
Make42 Avatar answered Sep 22 '22 06:09

Make42