Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Can multiple threads cause duplicate updates on constrained set?

In postgres if I run the following statement

update table set col = 1 where col = 2

In the default READ COMMITTED isolation level, from multiple concurrent sessions, am I guaranteed that:

  1. In a case of a single match only 1 thread will get a ROWCOUNT of 1 (meaning only one thread writes)
  2. In a case of a multi match that only 1 thread will get a ROWCOUNT > 0 (meaning only one thread writes the batch)
like image 459
Sam Saffron Avatar asked Aug 11 '12 12:08

Sam Saffron


1 Answers

Your stated guarantees apply in this simple case, but not necessarily in slightly more complex queries. See the end of the answer for examples.

The simple case

Assuming that col1 is unique, has exactly one value "2", or has stable ordering so every UPDATE matches the same rows in the same order:

What'll happen for this query is that the threads will find the row with col=2 and all try to grab a write lock on that tuple. Exactly one of them will succeed. The others will block waiting for the first thread's transaction to commit.

That first tx will write, commit, and return a rowcount of 1. The commit will release the lock.

The other tx's will again try to grab the lock. One by one they'll succeed. Each transaction will in turn go through the following process:

  • Obtain the write lock on the contested tuple.
  • Re-check the WHERE col=2 condition after getting the lock.
  • The re-check will show that the condition no longer matches so the UPDATE will skip that row.
  • The UPDATE has no other rows so it will report zero rows updated.
  • Commit, releasing the lock for the next tx trying to get hold of it.

In this simple case the row-level locking and the condition re-check effectively serializes the updates. In more complex cases, not so much.

You can easily demonstrate this. Open say four psql sessions. In the first, lock the table with BEGIN; LOCK TABLE test;*. In the rest of the sessions run identical UPDATEs - they'll block on the table level lock. Now release the lock by COMMITting your first session. Watch them race. Only one will report a row count of 1, the others will report 0. This is easily automated and scripted for repetition and scaling up to more connections/threads.

To learn more, read rules for concurrent writing, page 11 of PostgreSQL concurrency issues - and then read the rest of that presentation.

And if col1 is non-unique?

As Kevin noted in the comments, if col isn't unique so you might match multiple rows, then different executions of the UPDATE could get different orderings. This can happen if they choose different plans (say one is a via a PREPARE and EXECUTE and another is direct, or you're messing with the enable_ GUCs) or if the plan they all use uses an unstable sort of equal values. If they get the rows in a different order then tx1 will lock one tuple, tx2 will lock another, then they'll each try to get locks on each others' already-locked tuples. PostgreSQL will abort one of them with a deadlock exception. This is yet another good reason why all your database code should always be prepared to retry transactions.

If you're careful to make sure concurrent UPDATEs always get the same rows in the same order you can still rely on the behaviour described in the first part of the answer.

Frustratingly, PostgreSQL doesn't offer UPDATE ... ORDER BY so ensuring that your updates always select the same rows in the same order isn't as simple as you might wish. A SELECT ... FOR UPDATE ... ORDER BY followed by a separate UPDATE is often safest.

More complex queries, queuing systems

If you're doing queries with multiple phases, involving multiple tuples, or conditions other than equality you can get surprising results that differ from the results of a serial execution. In particular, concurrent runs of anything like:

UPDATE test SET col = 1 WHERE col = (SELECT t.col FROM test t ORDER BY t.col LIMIT 1);

or other efforts to build a simple "queue" system will *fail* to work how you expect. See the PostgreSQL docs on concurrency and this presentation for more info.

If you want a work queue backed by a database there are well-tested solutions that handle all the surprisingly complicated corner cases. One of the most popular is PgQ. There's a useful PgCon paper on the topic, and a Google search for 'postgresql queue' is full of useful results.


*BTW, instead of a LOCK TABLE you can use SELECT 1 FROM test WHERE col = 2 FOR UPDATE; to obtain a write lock on just that on tuple. That'll block updates against it but not block writes to other tuples or block any reads. That allows you to simulate different kinds of concurrency issues.

like image 165
Craig Ringer Avatar answered Sep 28 '22 02:09

Craig Ringer