Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Kafka as a message queue for long running tasks

I am wondering if there is something I am missing about my set up to facilitate long running jobs.

For my purposes it is ok to have At most once message delivery, this means it is not required to think about committing offsets (or at least it is ok to commit each message offset upon receiving it).

I have the following in order to achieve the competing consumer pattern:

  • A topic
  • X consumers in the same group
  • P partitions in a topic (where P >= X always)

My problem is that I have messages that can take ~15 minutes (but this may fluctuate by up to 50% lets say) in order to process. In order to avoid consumers having their partition assignments revoked I have increased the value of max.poll.interval.ms to reflect this. However this comes with some negative consequences:

  • if some message exceeds this length of time then in a worst case scenario a the consumer processing this message will have to wait up to the value of max.poll.interval.ms for a rebalance
  • if I need to scale and increase the number of consumers based on load then any new consumers might also have to wait the value of max.poll.interval.ms for a rebalance to occur in order to process any new messages

As it stands at the moment I see that I can proceed as follows:

  • Set max.poll.interval.ms to be a small value and accept that every consumer processing every message will time out and go through the process of having assignments revoked and waiting a small amount of time for a rebalance

However I do not like this, and am considering looking at alternative technology for my message queue as I do not see any obvious way around this. Admittedly I am new to Kafka, and it is just a gut feeling that the above is not desirable. I have used RabbitMQ in the past for these scenarios, however we need Kafka in our architecture for other purposes at the moment and it would be nice not to have to introduce another technology if Kafka can achieve this.

I appreciate any advise that anybody can offer on this subject.

like image 607
Gerard Murphy Avatar asked Jun 14 '19 09:06

Gerard Murphy


People also ask

Can Kafka be used as message queue?

We can use Kafka as a Message Queue or a Messaging System but as a distributed streaming platform Kafka has several other usages for stream processing or storing data.

Can Kafka be used as a scheduler?

How does it work ? Kafka message scheduler is simply using kafka topics. These topics contains all the schedules to trigger. These messages are regular kafka messages with headers and a payload.

How long Kafka will keep the messages in the queue?

Basics. We can notice here that the default retention time is seven days.

Is Kafka a task queue?

Kafka as a Queue For fault-tolerance and scalability, a Kafka topic is divided into partitions. Because each consumer instance in a group processes data from a non-overlapping set of partitions, Consumer Groups enable Kafka to behave like a Queue (within a Kafka topic).


2 Answers

Using Kafka as a Job queue for scheduling long running process is not a good idea as Kafka is not a queue in the strictest sense and semantics for failure handling and retries are limited. Though you might be able to achieve a compromise by playing around with certain configuration for rebalance or timeout, it is likely to remain brittle design. Simple answer is that Kafka was not designed for these kind of usecases.

The idea of max.poll.interval.ms is to prevent livelock situation (see), but in your case, consumer will send a false positive to the Kafka broker and will trigger a rebalance as there is no way to distinguish between a livelock and a legitimate long process.

You should think about the tradeoffs between living with the negative consequences you mentioned Vs. introducing a new technology which helps you to model a job queue in a better way. For a more complex usecase, check out how slack is doing it.

like image 95
senseiwu Avatar answered Oct 26 '22 09:10

senseiwu


The way we got around the issues we were having was as suggested in the comments. We decided to decouple the message processing from the consumer polling.

On each worker/consumer there were 2 threads, one for doing the actual processing and the other for phoning home to Kafka periodically.

We also did some work with trying to reduce the processing times for messages. However some messages still take time that can be measured in minutes. This has worked for us now for some time with no issues.

Thanks for this suggestions in comments @Donal

like image 33
Gerard Murphy Avatar answered Oct 26 '22 10:10

Gerard Murphy