Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Storm max spout pending

This is a question regarding how Storm's max spout pending works. I currently have a spout that reads a file and emits a tuple for each line in the file (I know Storm is not the best solution for dealing with files but I do not have a choice for this problem).

I set the topology.max.spout.pending to 50k to throttle how many tuples get into the topology to be processed. However, I see this number not having any effect in the topology. I see all records in a file being emitted every time. My guess is this might be due to a loop I have in the nextTuple() method that emits all records in a file.

My question is: Does Storm just stop calling nextTuple() for the Spout task when topology.max.spout.pending is reached? Does this mean I should only emit one tuple every time the method is called?

like image 710
Naresh Avatar asked Jun 25 '14 15:06

Naresh


People also ask

What is Max spout pending?

Max spout pending allows you to set a maximum number of tuples that can be unacked at any given time.

What is a storm topology?

A Storm topology is analogous to a MapReduce job. One key difference is that a MapReduce job eventually finishes, whereas a topology runs forever (or until you kill it, of course). A topology is a graph of spouts and bolts that are connected with stream groupings.


1 Answers

Exactly! Storm can only limit your spout with the next command, so if you transmit everything when you receive the first next, there is no way for Storm to throttle your spout.

The Storm developers recommend emitting a single tuple with a single next command. The Storm framework will then throttle your spout as needed to meet the "max spout pending" requirement. If you're emitting a high number of tuples, you can batch your emits to at most a tenth of your max spout pending, to give Storm the chance to throttle.

like image 94
John Gilmore Avatar answered Oct 03 '22 13:10

John Gilmore