Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Postgres, update and lock ordering

I'm working on Postgres 9.2.

There are 2 UPDATEs, each in their own transactions. One looks like:

UPDATE foo SET a=1 WHERE b IN (1,2,3,4);

The other is similar:

UPDATE foo SET a=2 WHERE b IN (1,2,3,4);

These could possibly run at the same time and in reality have 500+ in the 'IN' expression. I'm sometimes seeing deadlocks. Is is true that that order of items in the 'IN' expression may not actually influence the true lock ordering?

like image 331
seand Avatar asked Dec 02 '22 18:12

seand


2 Answers

There is no ORDER BY in the UPDATE command.
But there is for SELECT. Use row-level locking with the FOR UPDATE clause in a subquery:

UPDATE foo f
SET    a = 1
FROM (
   SELECT b FROM foo
   WHERE  b IN (1,2,3,4)
   ORDER BY b
   FOR   UPDATE
   ) upd
WHERE f.b = upd.b;

Of course, b has to be UNIQUE or you need to add more expressions to the ORDER BY clause to make it unambiguous.

And you need to enforce the same order for all UPDATE, DELETE and SELECT .. FOR UPDATE statements on the table.

Related, with more details:

  • Postgres UPDATE … LIMIT 1
  • Avoiding PostgreSQL deadlocks when performing bulk update and delete operations
  • Optimizing concurrent updates in Postgres
like image 84
Erwin Brandstetter Avatar answered Dec 20 '22 22:12

Erwin Brandstetter


Yes. I think the main issue here is that IN checks for membership in the set specified, but does not confer any sort of ordering on the UPDATE, which in turn, means that no concrete ordering is conferred upon the lock ordering.

The WHERE clause in an UPDATE statement essentially behaves the same way it does in a SELECT. For example, I will often simulate an UPDATE using a SELECT to check what will be updated to see that it's what I expected.

With that in mind, the following example using SELECT demonstrates that IN does not in itself confer ordering:

Given this schema/data:

create table foo
(
  id serial,
  val text
);

insert into foo (val)
values ('one'), ('two'), ('three'), ('four');

The following queries:

select *
from foo
where id in (1,2,3,4);


select *
from foo
where id in (4,3,2,1);

yield the exact same results -- the rows in order from id 1-4.

Even that isn't guaranteed, since I did not use an ORDER BY in the select. Rather, without it, Postgres uses whatever order the server decides is fastest (see point 8 about ORDER BY in the Postgres SELECT doc). Given a fairly static table, it's often the same order in which it was inserted (as was the case here). However, there's nothing guaranteeing that, and if there's a lot of churn on the table (lots of dead tuples, rows removed, etc.), it's less likely to be the case.

I suspect that's what's happening here with your UPDATE. Sometimes -- if not even most of the time -- it may end up in numerical order if that's the same way the rows were inserted, but there's nothing to guarantee that, and the cases where you see the deadlocks are likely scenarios where the data has changed such that one update is ordered different than the other.

sqlfiddle with the above code.

Possible fixes/workarounds:

In terms of what you could do about it, there are various options, depending on your requirements. You could explicitly take out a table lock on the table, although that would of course have the effect of serializing the updates there, which may prove to be too large a bottleneck.

Another option, which would still allow for concurrency -- is to explicitly iterate over the items using dynamic SQL in, say, Python. That way, you'd have a set of one-row updates that occurred always in the same order, and since you could ensure that consistent order, the normal Postgres locking should be able to handle the concurrency without deadlocking.

That won't perform as well as batch-updating in pure SQL, but it should solve the lock issue. One suggestion to bump up performance is to only COMMIT every so often, and not after every single row -- that saves a lot of overhead.

Another option would be to do the loop in a Postgres function written in PL/pgSQL. That function could then be called externally, in, say, Python, but the looping would be done (also explicitly) server-side, which may save on some overhead, since the looping and UPDATEs are done entirely server-side without having to go over the wire each loop iteration.

like image 44
khampson Avatar answered Dec 21 '22 00:12

khampson