Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to achieve reliability with RabbitMQ?

My data is stored in many repositories and we expect a set of tasks(aka. jobs) that suppose to process this data. Each job demands access to one or two data repositories. Tasks are expected to run for up to 8 hours for large files and few milliseconds for small ones. It is important that the jobs are executed exactly once and they are not missed.

We need to set up more agents running in containers so they execute the tasks. At startup, each agent is granted access to a set of repositories. Each agent should run only jobs that can fulfill. As an example, it makes no sense to assign a job that needs access to "R1" and "R2" repositories to an agent that only have access to "R2", "R3", "R4" and "R5".

It seems RabbitMQ is a great candidate for this scenario. But I feel it is not reliable for following reasons:

  • It can deliver the same message twice.
  • It might crash, so messages might get lost.
  • Some agents might start at a later point it time and the jobs might get lost.

Should I use Redis to avoid processing the same message twice?

To achieve excellent reliability, should I run a process that re-populates the queue from time to time?

Are "topic" exchanges a good solution for directing the messages only to the agents that can process them? If so, how to deal with the case when the message was sent before the corresponding agent started?

Of course, if you think other technologies are better equipped for this job than AMQP, feel free to recommend them.

like image 696
user3429660 Avatar asked Mar 07 '23 18:03

user3429660


1 Answers

Let's first summarize the situation:

  • You have an arbitrary number of unknown-length jobs
  • Jobs must be processed exactly once
  • Certain jobs can only be run on certain machines

On the face of it, this seems to be a moderately-challenging job shop scheduling problem. However, that's not what you're asking. Instead, your question seems to gear toward how to ensure that jobs are only processed once, and you're looking for RabbitMQ to provide that answer.

So let's be clear. RabbitMQ is not able to provide that answer, but neither is any other message queue. There are two reasons for this: first, a message queue is not a job, it is a holding place for a job. The actual job is something that represents a change of state in your system. The message queue is only responsible for delivery of the job, not processing of the job.

Second, a message broker can only really make one of two delivery guarantees. While you can leverage at-most-once (via auto-ack) and at-least-once (via the mandatory/immediate flags) delivery, these two options are mathematically mutually exclusive.

Takeaway #1: It is clear that looking for a solution in the delivery mechanism, rather than the processing mechanism, is futile.

But, there is a solution.

Idempotency is the property of a process whereby a repeated application of the process will result in the same state. The output of the process is the same regardless of the state the system was in at the beginning of the process. A simple example involves a light switch. Suppose you tell someone to flip a light switch 100 times, and the person does it. Even assuming you knew the switch was initially off, can you make any guarantees about the state of the switch at the end of the 100th flip? No - because nothing in the world is perfectly reliable.

However, suppose you tweak this a bit, and say "flip the switch to the up position." Now, you have a defined end state that results from the command. At the end of the process, the switch is to be "up". A person receiving this command multiple times can easily observe the state of the switch and take no action should the switch already be in the correct state.

If you define your behavior in terms of the results it achieves as opposed to the process that achieves it, you will be much better positioned to have an idempotent system. Thus, an at-least-once delivery mechanism, which is trivially available in RabbitMQ, will work for you 100% of the time.

Takeaway #2: Define your behavior in terms of the result, not the process.

The final question is, how to do this. There are many ways, but in none of these ways is the message system the state container. All computer systems rely upon some sort of persistent storage mechanism (file, database, punch cards?) to store and retrieve the system state. You should rely upon the messages to provide cues as to (1) what needs to be done and (2) when it needs to be done, but not (3) how it needs to be done. You'll have to figure out #3 by examining the current state prior to beginning work triggered by a message.

Takeaway #3: Do not use message queues as a state container. Use a database.

like image 196
theMayer Avatar answered Mar 15 '23 19:03

theMayer