Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Multiple delayed job processes starting same job

I'm using delayed job in a setup where I run multiple workers. For the sake of my question, it doesn't really matter, but let's say I run 10 workers (doing that in development mode currently).

The problem I am having is that two different workers sometimes start working on the same job, calling the perform method on my job object.

To the best of my understanding Delayed Job is using pessimistic locking to prevent this from happening, but it seems it sometimes still have enough time to lock steal the job before the first worker has time to actually lock it.

I'm just asking to see if anyone else has experienced this problem, or if it is my setup that is misbehaving. I'm using Postrgres and this happens both in my dev machine and on Heroku where I host it.

I will try to work around it within my jobs, but it is still a bit problematic that this happens. Ideally it would never happen that delayed job works on the same job from two processes.

Thanks!

like image 582
Kenny Lövrin Avatar asked Mar 25 '13 15:03

Kenny Lövrin


1 Answers

We've run about 60 million jobs through delayed job with 12 workers and never had a report of this. Whats the SQL that your delayed job worker is running? Are you using a gem that is changing the locking behavior of postgres?

Here is what the DJ sql looks like for me:

UPDATE "delayed_jobs" SET locked_at = '2014-05-02 21:16:35.419748', locked_by =
'host:whatever.local pid:4729' WHERE id IN (SELECT id FROM "delayed_jobs" 
WHERE ((run_at <= '2014-05-02 21:16:35.415923' 
AND (locked_at IS NULL OR locked_at < '2014-05-02 17:16:35.415947') 
OR locked_by = 'host:whatever.local pid:4729') AND failed_at IS NULL) 
ORDER BY priority ASC, run_at ASC LIMIT 1 FOR UPDATE) RETURNING *

Do you have locking problems with any other code? Could you try running two rails console sessions and doing this:

Console Session 1:

User.find(1).with_lock do sleep(10); puts "worker 1 done" end

Console Session 2:

User.find(1).with_lock do sleep(1); puts "worker 2 done" end

Start both those at the same time and if 2 end before 1, you've got a locking problem more general that delayed job.

like image 189
John Naegle Avatar answered Nov 06 '22 23:11

John Naegle