Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is there a distributed data processing pipeline framework, or a good way to organize one?

I am designing an application that requires of a distributed set of processing workers that need to asynchronously consume and produce data in a specific flow. For example:

  • Component A fetches pages.
  • Component B analyzes pages from A.
  • Component C stores analyzed bits and pieces from B.

There are obviously more than just three components involved.

Further requirements:

  • Each component needs to be a separate process (or set of processes).
  • Producers don't know anything about their consumers. In other words, component A just produces data, not knowing which components consume that data.

This is a kind of data flow solved by topology-oriented systems like Storm. While Storm looks good, I'm skeptical; it's a Java system and it's based on Thrift, neither of which I am a fan of.

I am currently leaning towards a pub/sub-style approach which uses AMQP as the data transport, with HTTP as the protocol for data sharing/storage. This means the AMQP queue model becomes a public API — in other words, a consumer needs to know which AMQP host and queue that the producer uses — which I'm not particularly happy about, but it might be worth the compromise.

Another issue with the AMQP approach is that each component will have to have very similar logic for:

  • Connecting to the queue
  • Handling connection errors
  • Serializing/deserializing data into a common format
  • Running the actual workers (goroutines or forking subprocesses)
  • Dynamic scaling of workers
  • Fault tolerance
  • Node registration
  • Processing metrics
  • Queue throttling
  • Queue prioritization (some workers are less important than others)

…and many other little details that each component will need.

Even if a consumer is logically very simple (think MapReduce jobs, something like splitting text into tokens), there is a lot of boilerplate. Certainly I can do all this myself — I am very familiar with AMQP and queues and everything else — and wrap all this up in a common package shared by all the components, but then I am already on my way to inventing a framework.

Does a good framework exist for this kind of stuff?

Note that I am asking specifically about Go. I want to avoid Hadoop and the whole Java stack.

Edit: Added some points for clarity.

like image 411
Alexander Staubo Avatar asked Mar 16 '13 16:03

Alexander Staubo


People also ask

Which methodology is used for distributed data processing?

Lambda Architecture Lambda is a distributed data processing architecture which leverages both the batch & the real-time streaming data processing approaches to tackle the latency issues arising out of the batch processing approach. It joins the results from both the approaches before presenting it to the end user.

What are data pipeline frameworks?

To explain the above analogy in a more practical way, a data pipeline is a framework built for data which eliminates many manual steps from the data transition process and enables a smooth, scalable, automated flow of data from one station to the next. It starts by defining what, where, and how data is collected.

What is a distributed pipeline?

Modern data pipeline have several things in common. They are all distributed, which simply means the different pieces of the pipeline run on different computing hardware. Due to the distributed nature, they all need a notification mechanism for orchestrate each step of the pipeline.


1 Answers

Because Go has CSP channels, I suggest that Go provides a special opportunity to implement a framework for parallelism that is simple, concise, and yet completely general. It should be possible to do rather better than most existing frameworks with rather less code. Java and the JVM can have nothing like this.

It requires just the implementation of channels using configurable TCP transports. This would consist of

  • a writing channel-end API, including some general specification of the intended server for the reading end
  • a reading channel-end API, including listening port configuration and support for select
  • marshalling/unmarshalling glue to transfer data - probably encoding/gob

A success acceptance test of such a framework should be that a program using channels should be divisible across multiple processors and yet retain the same functional behaviour (even if the performance is different).

There are quite a few existing transport-layer networking projects in Go. Notable is ZeroMQ (0MQ) (gozmq, zmq2, zmq3).

like image 124
Rick-777 Avatar answered Sep 20 '22 01:09

Rick-777