I have an application which currently has the following setup:
The setup looks something like this:

The job producer queries the database for new items which need to be added to its list of recurring jobs that need to be added to the work queue every N minutes. This job producer is the only node in my whole architecture which if failed, would cause the entire process to fail. I can have a DB server, a queue server, or several worker servers fail and the process would continue to operate.
How can I modify the job producer so that it isn't a single point of failure? I don't know how to distribute the work it does, which is querying the database every N minutes and enqueuing new jobs to be processed. It is a singular task.
I considered having multiple producers, and each producer would use modulus to only process 1/P jobs where P is the number of producers.
Something like:
itemsToBeProcess = db.FetchItems()
for (item in itemsToBeProcessed) {
    if item.id % producerNumber == 0) // Queue job
}
This would divide the work of the producers to multiple servers. However, this still isn't ideal, because if a single producer goes down than 1/P worth of jobs will stop being processed. So, it would still be a partial failure.
Can anyone give any guidance on how I can make this job producer not be a single point of failure in my application?
Is there any specific reason to query db every N minutes? I would solve such problem in a way that instead of N minutes I would query for N items and change an item state (eg. "open" -> "in progress") using "select for update"* (to make sure an item is processing (retrieve and update the state) by one and only one producer). Thanks to that you would be able to scale/provide FO etc. without any problem.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With