Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to lock Resque jobs to one server

I have a "cluster" of Resque servers in my infrastructure. They all have the same exact job priorities etc. I automagically scale the number of Resque servers up and down based on how many pending jobs there are and available resources on the servers to handle said jobs. I always have a minimum of two Resque servers up.

My issue is that when I do a quick, one off job, sometimes both the servers process that job. This is bad.

I've tried adding a lock to my job with something like the following:

require 'resque-lock-timeout'

class ExampleJob
  extend Resque::Plugins::LockTimeout

  def self.perform
   # some code
  end
end

This plugin works for longer running jobs. However for these super tiny one off jobs, processing happens right away. The Resque servers both do not see the lock set by its sister server, both set a lock, process the job, unlock, and are done.

I'm not entirely sure what to do at this point or what solutions there are except for having one dedicated server handle this type of job. That would be a serious pain to configure and scale. I really want both the servers to be able to handle it, but once one of them grabs it from the queue, ensure the other does not run it.

Can anyone suggest some viable solution(s)?

like image 446
randombits Avatar asked Oct 24 '12 23:10

randombits


2 Answers

Write your lock interpreter to wait T milliseconds before it looks for a lock with a unique_id less than the value of the lock it made.

This will determine who won the race, and the loser will self-terminate.

T is the parallelism latency between all N servers in the pool of a given queue. You can determine this heuristically by scaling back from 1000 milliseconds until you again find the job happening in-duplicate. Give padding for latency variation.

This is called the Busy-Wait solution to mutex thread safety. It is considered one of the trade-offs acceptable given the various scenarios in which one must solve Mutex (e.g. Locking, etc)

I'll post some links when off mobile. Wikipedia entry on mutex should explain all this.

Of this won't work for you, then: 1. Use a scheduler to control duplication. 2. Classify short-running jobs to a queue designed to run them in serial.

TL;DR there is no perfect solution, only good trade-off for your conditions.

like image 200
New Alexandria Avatar answered Oct 21 '22 15:10

New Alexandria


It should not be possible for two workers to get the same 'payload' because items are dequeued using BLPOP. Redis will only send the queued item to the first client that calls BLPOP. It sounds like you are enqueueing the job more than once and therefore two workers are able to acquire different payloads with the same arguments. The purpose of 'resque-lock-timeout' is to assure that payloads that have the same method and arguments do not run concurrently; it does not however stop the second payload from being worked if the first job releases the lock before the second job tries to acquire it.

It would make sense that this only happens to short running jobs. Here is what might be happening:

payload 1 is enqueued
payload 2 is enqueued
payload 1 is locked 
payload 1 is worked
payload 1 is unlocked
payload 2 is locked
payload 2 is worked
payload 2 is unlocked

Where as in long running jobs the following senario might happen:

payload 1 is enqueued
payload 2 is enqueued
payload 1 is locked
payload 1 is worked 
payload 2 is fails to get lock
payload 1 is unlocked

Try turning off Resque and enqueueing your job. Take a look in redis at the list for your Resque queue (or monitor Redis using redis-cli monitor). See if Resque has queued more than one payload. If you still only see one payload then monitor the list to see if another one of your resque workers is calling recreate on failed jobs.

If you want to have 'resque-lock-timeout' hold the lock for longer than the duration it takes to process the job you can override the release_lock! method to set an expiry on the lock instead of just deleting it.

module Resque
  module Plugins
    module LockTimeout  
      def release_lock!(*args)
        lock_redis.expire(redis_lock_key(*args), 60) # expire lock after 60 seconds
      end
    end
  end
end

https://github.com/lantins/resque-lock-timeout/blob/master/lib/resque/plugins/lock_timeout.rb#l153-155

like image 26
lastcanal Avatar answered Oct 21 '22 15:10

lastcanal