Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to create a distributed 'debounce' task to drain a Redis List?

I have the following usecase: multiple clients push to a shared Redis List. A separate worker process should drain this list (process and delete). Wait/multi-exec is in place to make sure, this goes smoothly.

For performance reasons I don't want to call the 'drain'-process right away, but after x milliseconds, starting from the moment the first client pushes to the (then empty) list.

This is akin to a distributed underscore/lodash debounce function, for which the timer starts to run the moment the first item comes in (i.e.: 'leading' instead of 'trailing')

I'm looking for the best way to do this reliably in a fault tolerant way.

Currently I'm leaning to the following method:

  1. Use Redis Set with the NX and px method. This allows:
    • to only set a value (a mutex) to a dedicated keyspace, if it doesn't yet exist. This is what the nx argument is used for
    • expires the key after x milliseconds. This is what the px argument is used for
  2. This command returns 1 if the value could be set, meaning no value did previously exist. It returns 0 otherwise. A 1 means the current client is the first client to run the process since the Redis List was drained. Therefore,
  3. this client puts a job on a distributed queue which is scheduled to run in x milliseconds.
  4. After x milliseconds, the worker to receive the job starts the process of draining the list.

This works on paper, but feels a bit complicated. Any other ways to make this work in a distributed fault-tolerant way?

Btw: Redis and a distributed queue are already in place so I don't consider it an extra burden to use it for this issue.

like image 717
Geert-Jan Avatar asked Oct 09 '14 19:10

Geert-Jan


1 Answers

Sorry for that, but normal response would require a bunch of text/theory. Because your good question you've already written a good answer :)

First of all we should define the terms. The 'debounce' in terms of underscore/lodash should be learned under the David Corbacho’s article explanation:

Debounce: Think of it as "grouping multiple events in one". Imagine that you go home, enter in the elevator, doors are closing... and suddenly your neighbor appears in the hall and tries to jump on the elevator. Be polite! and open the doors for him: you are debouncing the elevator departure. Consider that the same situation can happen again with a third person, and so on... probably delaying the departure several minutes.

Throttle: Think of it as a valve, it regulates the flow of the executions. We can determine the maximum number of times a function can be called in certain time. So in the elevator analogy you are polite enough to let people in for 10 secs, but once that delay passes, you must go!

Your are asking about debounce sinse first element would be pushed to list:

So that, by analogy with the elevator. Elevator should go up after 10 minutes after the lift came first person. It does not matter how many people crammed into the elevator more.

In case of distributed fault-tolerant system this should be viewed as a set of requirements:

  1. Processing of the new list must begin within X time, after inserting the first element (ie the creation of the list).
  2. The worker crash should not break anything.
  3. Dead lock free.
  4. The first requirement must be fulfilled regardless of the number of workers - be it 1 or N.

I.e. you should know (in distributed way) - group of workers have to wait, or you can start the list processing. As soon as we utter the phrase "distributed" and "fault-tolerant". These concepts always lead with they friends:

  1. Atomicity (eg by blocking)
  2. Reservation

In practice

In practice, i am afraid that your system needs to be a little bit more complicated (maybe you just do not have written, and you already have it).

Your method:

  1. Pessimistic locking with mutex via SET NX PX. NX is a guarantee that only one process at a time doing the work (atomicity). The PX ensures that if something happens with this process the lock is released by the Redis (one part of fault-tolerant about dead locking).
  2. All workers try to catch one mutex (per list key), so just one be happy and would process list after X time. This process can update TTL of mutex (if need more time as originally wanted). If process would crash - the mutex would be unlocked after TTL and be grabbed with other worker.

My suggestion

The fault-tolerant reliable queue processing in Redis built around RPOPLPUSH:

  • RPOPLPUSH item from processing to special list (per worker per list).
  • Process item
  • Remove item from special list

Requirements So, if worker would crashed we always can return broken message from special list to main list. And Redis guarantees atomicity of RPOPLPUSH/RPOP. That is, there is only a problem group of workers to wait a while.

And then two options. First - if have much of clients and lesser workers use locking on side of worker. So try to lock mutex in worker and if success - start processing.

And vice versa. Use SET NX PX each time you execute LPUSH/RPUSH (to have "wait N time before pop from me" solution if you have many workers and some push clients). So push is:

SET myListLock 1 PX 10000 NX 
LPUSH myList value

And each worker just check if myListLock exists we should wait not at least key TTL before set processing mutex and start to drain.

like image 193
Nick Bondarenko Avatar answered Oct 14 '22 17:10

Nick Bondarenko