Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Atomically mark and return a group of rows in database

I'm writing a background service that needs to process a series of jobs, stored as records in a sqlserver table. The service needs to find the oldest 20 jobs that need to be worked (where status = 'new'), mark them (set status = 'processing'), run them, and update the jobs afterward.

It's the first part I need help with. There could be multiple threads accessing the database at the same time, and I want to make sure that the "mark & return" query runs atomically, or nearly so.

This service will be spending comparatively little time accessing the database, and it's not the end of the world if a job gets run twice, so I might be able to accept a small probability of jobs running more than once for increased simplicity in the code.

What is the best way to do this? I'm using linq-to-sql for my data layer, but I assume I'll have to drop down into t-sql for this.

like image 873
Gabe Moothart Avatar asked Nov 30 '09 17:11

Gabe Moothart


3 Answers

Your table of jobs is a queue. Writing user tables backed up queues is a notoriously error prone as it leads to deadlocks and concurency issues.

The simplest thing would be to drop the user table and use a true queue instead. This will give you deadlock free concurency free queue on system tested and validated code base. The problem is that the whole paradigm around queues changes from INSERT and DELETE/UPDATE to SEND/RECEIVE. On the other hand with built-in queue you get some very powerfull free goodies, namely Activation and correlated items locking.

If you want to continue down the path of user table backed queues then the second most important trick in writing user tables queues is to use UPDATE ... OUTPUT:

WITH cte AS (
  SELECT TOP(20) status, id, ...
  FROM table WITH (ROWLOCK, READPAST, UPDLOCK)
  WHERE status = 'new'
  ORDER BY enqueue_time)
UPDATE cte
  SET status = 'processing'
OUTPUT
  INSERTED.id, ...

The CTE syntax is just for convenience of placing the TOP and ORDER BY properly, the query can be written using derived tables just as esily. You cannot use straight UPDATE ... TOP because UPDATE does not support an ORDER BY and you require this to satisfy the 'oldest' part of your requirement. The lock hints are needed to facilitate high concurency between parallel processing threads.

I said this is the second most important trick. The most important is how you organize the table. For a queue it must be clustered by (status, enqueue_time). If you don't organize the table properly you'll end up with deadlocks. Pre-emptive comment: fragmentation is irelevant in this scenario.

like image 86
Remus Rusanu Avatar answered Oct 21 '22 10:10

Remus Rusanu


Please see my answer here: SQL Server Process Queue Race Condition which also manages 20 rows in one go.

Basically, it's quite simple in SQL Server to manage concurrency and polling using the hints ROWLOCK, READPAST and UPDLOCK.

I can't comment about Linq, but a transaction still leaves you open to concurrency issues: you need to use the hints I mentioned

like image 21
gbn Avatar answered Oct 21 '22 10:10

gbn


Building on gbn's answer...

If you're using SQL Server 2005 or newer, you can return the updated rows atomically by using an OUTPUT clause in your UPDATE statement:

UPDATE TOP (20) your_table
SET status = 'processing'
OUTPUT INSERTED.*
FROM your_table WITH (ROWLOCK, READPAST, UPDLOCK)
WHERE status = 'new'
like image 4
LukeH Avatar answered Oct 21 '22 10:10

LukeH