Is there a distributed data processing pipeline framework, or a good way to organize one?

Tags:

I am designing an application that requires of a distributed set of processing workers that need to asynchronously consume and produce data in a specific flow. For example:

Component A fetches pages.
Component B analyzes pages from A.
Component C stores analyzed bits and pieces from B.

There are obviously more than just three components involved.

Further requirements:

Each component needs to be a separate process (or set of processes).
Producers don't know anything about their consumers. In other words, component A just produces data, not knowing which components consume that data.

This is a kind of data flow solved by topology-oriented systems like Storm. While Storm looks good, I'm skeptical; it's a Java system and it's based on Thrift, neither of which I am a fan of.

I am currently leaning towards a pub/sub-style approach which uses AMQP as the data transport, with HTTP as the protocol for data sharing/storage. This means the AMQP queue model becomes a public API — in other words, a consumer needs to know which AMQP host and queue that the producer uses — which I'm not particularly happy about, but it might be worth the compromise.

Another issue with the AMQP approach is that each component will have to have very similar logic for:

Connecting to the queue
Handling connection errors
Serializing/deserializing data into a common format
Running the actual workers (goroutines or forking subprocesses)
Dynamic scaling of workers
Fault tolerance
Node registration
Processing metrics
Queue throttling
Queue prioritization (some workers are less important than others)

…and many other little details that each component will need.

Even if a consumer is logically very simple (think MapReduce jobs, something like splitting text into tokens), there is a lot of boilerplate. Certainly I can do all this myself — I am very familiar with AMQP and queues and everything else — and wrap all this up in a common package shared by all the components, but then I am already on my way to inventing a framework.

Does a good framework exist for this kind of stuff?

Note that I am asking specifically about Go. I want to avoid Hadoop and the whole Java stack.

Edit: Added some points for clarity.

411

asked Mar 16 '13 16:03

Alexander Staubo

1 Answers

Because Go has CSP channels, I suggest that Go provides a special opportunity to implement a framework for parallelism that is simple, concise, and yet completely general. It should be possible to do rather better than most existing frameworks with rather less code. Java and the JVM can have nothing like this.

It requires just the implementation of channels using configurable TCP transports. This would consist of

a writing channel-end API, including some general specification of the intended server for the reading end
a reading channel-end API, including listening port configuration and support for select
marshalling/unmarshalling glue to transfer data - probably encoding/gob

A success acceptance test of such a framework should be that a program using channels should be divisible across multiple processors and yet retain the same functional behaviour (even if the performance is different).

There are quite a few existing transport-layer networking projects in Go. Notable is ZeroMQ (0MQ) (gozmq, zmq2, zmq3).

124

answered Sep 20 '22 01:09

Rick-777

Related questions
                            
                                Hadoop performance
                            
                                MapReduce on AWS
                            
                                PIG - Scalar has more than one row in the output
                            
                                Flatten a dictionary of dictionaries (2 levels deep) of lists
                            
                                manipulating iterator in mapreduce
                            
                                Cassandras Map Reduce Support
                            
                                Hadoop mapreduce streaming from HBase
                            
                                Counting Unique Users using Mapreduce for Java Appengine
                            
                                MapReduce using SQL Server as data source
                            
                                Efficiently Storing the data in Hive

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Is there a distributed data processing pipeline framework, or a good way to organize one?

Tags:

go

distributed-computing

mapreduce

Alexander Staubo

People also ask

1 Answers

Rick-777

Recent Activity

Donate For Us