Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why should I not loop or block in Spout.nextTuple()

Tags:

apache-storm

I saw many code snippets in which a loop was used inside Spout.nextTuple() (for example to read a whole file and emit a tuple for each line):

public void nextTuple() {
    // do other stuff here

    // reader might be BufferedReader that is initialized in open()
    String str;
    while((str = reader.readLine()) != null) {
        _collector.emit(new Values(str));
    }

    // do some more stuff here
}

This code seems to be straight forward, however, I was told that one should not loop inside nextTuple(). The question is why?

like image 702
Matthias J. Sax Avatar asked Sep 13 '15 08:09

Matthias J. Sax


People also ask

Why do we need Apache Storm?

Apache Storm is a free and open source distributed realtime computation system. Apache Storm makes it easy to reliably process unbounded streams of data, doing for realtime processing what Hadoop did for batch processing. Apache Storm is simple, can be used with any programming language, and is a lot of fun to use!

What is bolt in Apache Storm?

Bolt represents a node in the topology having the smallest processing logic and the output of a bolt can be emitted into another bolt as input. Storm keeps the topology always running, until you kill the topology. Apache Storm's main job is to run the topology and will run any number of topology at a given time.

How Apache Storm works?

Apache Storm – released by Twitter, is a distributed open-source framework that helps in the real-time processing of data. Apache Storm works for real-time data just as Hadoop works for batch processing of data (Batch processing is the opposite of real-time.

What is Storm UI?

The Storm UI is a web interface used to manage the state of our cluster. We'll get to this later. Our topology is submitted to the Nimbus daemon on the primary node, and then distributed among the worker processes running on the replica/supervisor nodes.


1 Answers

When a Spout is executed it runs in a single thread. This thread loops "forever" and has multiple duties:

  1. call Spout.nextTuple()
  2. retrieve "acks" and process them
  3. retrieve "fails" and process them
  4. time-out tuples

For this to happen, it is essential, that you do not stay "forever" (ie, loop or block) in nextTuple() but return after emitting a tuple to the system (or just return if no tuple can be emitted, but do not block). Otherwise, the Spout cannot does its work properly. nextTuple() will be called in a loop by Storm. Thus, after ack/fail messages are processed etc. the next call to nextTuple() happens quickly.

Therefore, it is also considered bad practice to emit multiple tuples in a single call to nextTuple(). As long as the code stays in nextTuple(), the spout thread cannot (for example) react on incoming acks. This might lead to unnecessary time-outs because acks cannot be processed timely.

Best practice is to emit a single tuple for each call to nextTuple(). If no tuple is available to be emitted, you should return (without emitting) and not wait until a tuple is available.

like image 50
Matthias J. Sax Avatar answered Sep 21 '22 02:09

Matthias J. Sax