I have just started to look into etcd and one of the use cases that is mentioned in the talks given by the creators, is a work queue system.
But how would you actually implement this?
Basic pattern would be something like this.
1 process generating "work description tickets", and place that ticket in a folder of etcd lets say "/queue/worktickets/00000000001/"
1->many processes listening to the "/queue/worktickets/" folder for changes. when a new work ticket appears every process wil try to grab the ticket by creating a new value in "/queue/locks/00000001" to lock that ticket. Only the first one will be able to create the lock value.
The process that created the lock ticket does it's work, and then removes the ticket from the queue, and maybe the lock value. Then try to grab the next available ticket from the queue. If no more tickets available, start listening to changes in the "/queue/worktickets/" folder again.
In my head this should be fairly simple to implement, but if the queue gets large(ticket is easy to generate but hard to process) then it seems there will be a lot of data being transeferd from etcd to each of the clients. To my knowledege there is no way of saying give me the first value in this folder that does not exist in this folder nor is there some give me top n items from folder.
Anyone care to share their thougths on this.
So I came up with a solution that I think is robust. Here are the goals that I designed my solution for:
So the idea is to have two queues: a pending queue and a running queue. Originally all items are in the pending queue.
When a worker tries to retrieve an item, it transfers it from the pending queue to the running queue using a transaction (available in etcd 3). In the same transaction, the worker also creates a lock for the item. The lock is guarded by a lease, so it's automatically removed if the worker dies.
If a worker successfully finishes processing the item, it removes the item from the running queue and we are done. If the worker fails, the lock will expire, and the item is left in the running queue.
Thus, workers are also expected to look into the running queue once the pending queue has been exhausted. The expectation is that the running queue will be small comparing to the pending queue, thus finding an item that's currently not locked (by simply listing the running queue) is not going to be expensive.
Or as @dannysauer mentioned in the comments, you could also have another fault-tolerant process that transfers items from the running queue back to the pending queue.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With