I have a job queue (using Amazon SQS) which hands off jobs to many machines for fetching and processing various documents over HTTP. There are hundreds of different hosts which are accessed, and there is no predictable order for the jobs.
In order to be polite, I don't want my system to hammer repeatedly on a single host. Thus, if I get a job #123 to fetch something from example.com, but I see that I have just fetched another thing from example.com in the past X seconds, I should move on to something else and save job #123 for later.
The question is, what's a good way to implement this pattern?
It seems the first step would be to have the job runners keep a list somewhere of all domains and the last time something on that domain was accessed. I suppose this could be a simple DB table.
There are then many possible options for what to do if a message processor gets a job that must be deferred.
Simply push a copy of the message onto the end of the queue, and throw it away without executing it. Hopefully, by the next time it comes around, enough time will have passed. This may result in a lot of redundant SQS messages, especially if a large cluster of jobs for the same domain goes through at once.
Sleep for however many seconds are necessary until politeness dictates that the job can be executed. This may result in a lot of queue processors simultaneously doing nothing.
Accept the job, but save it in a local queue somewhere on each queue processor. I imagine each processor could "claim" a number of jobs this way, and then elect to process them in whatever order achieves maximum politeness. This can still be unpredictable, because each queue processor needs to be aware of the domains hit by all the others.
Establish separate queues for every domain and have one process dedicated to each queue. Each process would have to pause for X seconds between doing each job, so there's a lot of sleeping process overhead, but maybe this isn't such a bad thing.
Do you have any experience with designing this sort of thing? What strategy would you recommend?
Separate queues for each domain and a queue of domains.
Each processor should:
It may help if you organize the queue of domains as a time-priority queue — store the domains in the order of the next update time.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With