We’re using Storm with the Kafka Spout. When we fail messages, we’d like to replay them, but in some cases bad data or code errors will cause messages to always fail a Bolt, so we’ll get into an infinite replay cycle. Obviously we’re fixing errors when we find them, but would like our topology to be generally fault tolerant. How can we ack() a tuple after it’s been replayed more than N times?
Looking through the code for the Kafka Spout, I see that it was designed to retry with an exponential backoff timer and the comments on the PR state:
"The spout does not terminate the retry cycle (it is my conviction that it should not do so, because it cannot report context about the failure that happened to abort the reqeust), it only handles delaying the retries. A bolt in the topology is still expected to eventually call ack() instead of fail() to stop the cycle."
I've seen StackOverflow responses that recommend writing a custom spout, but I'd rather not be stuck maintaining a custom patch of the internals of the Kafka Spout if there's a recommended way to do this in a Bolt.
What’s the right way to do this in a Bolt? I don’t see any state in the tuple that exposes how many times it’s been replayed.
Storm itself does not provide any support for your problem. Thus, a customized solution is the only way to go. Even if you do not want to patch KafkaSpout
, I think, introducing a counter and breaking the replay cycle in it, would be the best approach. As an alternative, you could also inherit from KafkaSpout
and put a counter in your subclass. This is of course somewhat similar to a patch, but might be less intrusive and easier to implement.
If you want to use a Bolt, you could do the following (which also requires some changes to the KafkaSpout
or a subclass of it).
KafkaSpout
via fieldsGrouping
on the ID (to ensure that a tuple that is replayed is streamed to the same bolt instance).HashMap<ID,Counter>
that buffers all tuples and counts the number of (re-)tries. If the counter is smaller than your threshold value, forward the input tuple so it gets processed by the actual topology that follows (of course, you need to anchor the tuple appropriately). If the count is larger than your threshold, ack the tuple to break the cycle and remove its entry from the HashMap
(you might also want to LOG all failed tuples).HashMap
, each time a tuple is acked in KafkaSpout
you need to forward the tuple ID to the bolt so that it can remove the tuple from the HashMap
. Just declare a second output stream for your KafkaSpout
subclass and overwrite Spout.ack(...)
(of course you need to call super.ack(...)
to ensure KafkaSpout
gets the ack, too).This approach might consume a lot of memory though. As an alternative to have an entry for each tuple in the HashMap
you could also use a third stream (that is connected to the bolt as the other two), and forward a tuple ID if a tuple fails (ie, in Spout.fail(...)
). Each time, the bolt receives a "fail" message from this third stream, the counter is increase. As long as no entry is in the HashMap
(or the threshold is not reached), the bolt simply forwards the tuple for processing. This should reduce the used memory but requires some more logic to be implemented in your spout and bolt.
Both approaches have the disadvantage, that each acked tuple results in an additional message to your newly introduces bolt (thus, increasing network traffic). For the second approach, it might seem that you only need to send a "ack" message to the bolt for tuples that failed before. However, you do not know which tuples did fail and which not. If you want to get rid of this network overhead, you could introduce a second HashMap
in KafkaSpout
that buffers the IDs of failed messages. Thus, you can only send an "ack" message if a failed tuple was replayed successfully. Of course, this third approach makes the logic to be implemented even more complex.
Without modifying KafkaSpout
to some extend, I see no solution for your problem. I personally would patch KafkaSpout
or would use the third approach with a HashMap
in KafkaSpout
subclass and the bolt (because it consumed little memory and does not put a lot of additional load on the network compared to the first two solutions).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With