Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Debouncing messages from SQS

I need to build a thin service whose job is to:

Send update notification webhooks, via HTTP post to clients, notifying them that a job has been updated.

A job with ID XYZ, consists of 10 - 1000 smaller parts, of which we need to update them of their status. A job of 1000 parts, might take 2 minutes We only want to update every 10s or so (so 12 times we would send this update webhook).

I was planning on queueing the update messages via SQS, from the worker units, and then dequeuing, and performing a debounce and sending the webhook. My issue is, I have no idea how to perform a debounce on a particular identifier XYZ.

Does anyone have any thoughts or experiences to share? How to perform a debounce?

like image 969
Dominic Bou-Samra Avatar asked Jan 06 '23 00:01

Dominic Bou-Samra


1 Answers

In a strict sense, I suspect "debouncing" is not precisely the right term here, since it technically refers to the avoidance of misinterpreting an intended single action as being multiple actions (pressing track advance ⏭ once, but inadvertently skipping ahead two songs instead of just one) because the button "bounced" -- and what you're wanting to avoid is not true duplicate messages but unnecessary chattiness that could be consolidated down to a single action representing the latest or most relevant from a collection of triggers.

This is very much like debouncing, of course, since it boils down fundamentally to a task of "remembering' the last time you did a specific thing, and not allowing an identical or similar thing to happen again until a certain time has elapsed.

The relevant data structure you are probably looking for is an associative array, which goes by numerous other names, like hash map and stores keys (like the job identifier) and values (like the last timestamp when you touched a job (fired an event).

But it gets a little tricker when you become aware of the next thing and decide not to act on it right away, but then you have no guarantee of whether or when there may be any more subsequent related things, so you need timers so that you don't leave the last thing undone, or delay it inappropriately. And you need to clean up the memory structure when a particular job is known to be done, and also periodically sweep it for abandoned keys, to handle the case of jobs that never properly finished, for whatever reason... or you'll eventually run out of memory. And you have to be aware of the order of arrival of events, particularly if there's a final "done" event, since anything received after that, for whatever reason, is no longer relevant.

Certainly, this is all doable, but it can get messy, as you have doubtlessly concluded.

A recently added capability in SQS might help somewhat in this regard: FIFO queues.

Important: at the risk of being misunderstood, I am not suggesting that a FIFO queue is a magic bullet, or that it is either necessary or sufficient. Rather, I am suggesting that it has some features that will probably help achieve or simplify the task, even though some of these features are not even directly related to the strict-ordered FIFO behavior of a FIFO queue.

Each message in a FIFO queue has a MessageGroupId property, which is an opaque alphanumeric string that you specify when sending each message to the queue. This could represent the specific job ID -- thereby ensuring grouped-delivery and in-order-delivery of queue messages for the same job. Useful, no?

When receiving messages from the queue, all the messages you receive in one batch should (according to the docs) be messages with the same MessageGroupId... so if you set this to a string uniquely representing the job, this means that if multiple messages for a given job are in the queue, you'll get many or all of them together, and you'll get them in the order they were sent to the queue -- which means you can review and potentially discard all but the last one, fire a notification, delete the messages from the queue, and go back to the top of the loop and poll SQS again.¹

This seems like it greatly simplifies the process, but doesn't necessarily fully solve the core issue of the problem as you have described it -- because the next batch could be for the same job so you'd send another notification almost immedately. On the other hand, maybe that's okay, because since you could potentially read up to 10 messages² from the queue for the same job, in one request, you still have the potential to eliminate up to 90% of the unnecessary messages. Your worker will only deliver one notification to the remote endpoint at a time.

You could also leverage another SQS capability by changing the message visibility timeout if you encounter a message you don't want to turn into an event right now -- FIFO queues apparently change the behavior of message visibility so that all of the messages with a given MessageGroupId, including future messages with the same ID, remain invisible together, as a group. (You'll want to verify this behavior.)

Now, the above might work together to fundamentally accomplish what you need, even though it is admittedly not all of what you asked for -- because at times of low workload, it will likely deliver more messages than at times of higher workload. If you need to literally constrain your external notification delivery to "not more than once every n seconds" per job, things do get a little more complex, because you definitely need to "remember" when you last sent a message for a given job so that you can decide whether it is "too soon" to send another... so you'll need a data sructure (the associative array, mentioned above) where you can "remember" when you sent the last message for each job so that you can tell SQS how long to leave a batch of messages invisible... but isn't a likely to be a critical structure, since the worst thing that happens if its contents are lost (such as by restarting this microservice, assuming it's stored in-memory) is that you send the next message sooner than you otherwise would.


¹ Depending on how aggressive you want to be in this design, with each successful long-poll of SQS in the outer loop, if you receive the maximum number of messages, you may want to account for the fact that there may yet be more messages for the same MessageGroupId waiting in the queue, which means that even though the last message in the batch will necessarily be the latest one within the batch, the last message you received isn't necessarily the latest message for the group because a later one remains in the queue. Accounting for this special case increases complexity. Not accounting for it simply increases the frequency of notifications you generate during times of low traffic... which may be an acceptable tradeoff for simplicity.

² 10 is still, according to the documentation, the largest number of messages you can receive in a single read request, even with FIFO queues. The documentation does suggest that a FIFO queue is in a sense more likely than a conventional queue to receive as many messages as possible, since the distributed scaling inside SQS is implemented somewhat differently. This comes with a tradeoff of a performance limit of 300 TPS with a FIFO queue. Conventional queues, because of SQS's internal scaling, have effectively unlimited TPS at the cost of posibble out-of-order and potential (but anecdotally rare) duplicate delivery of messages.

like image 52
Michael - sqlbot Avatar answered Jan 13 '23 13:01

Michael - sqlbot