I am hitting a three-way deadlock in Postgres, and I don't really understand what is happening that is causing it. The log message is,
Process 5671 waits for ExclusiveLock on tuple (33,12) of relation 31709 of database 16393; blocked by process 5652.
Process 5652 waits for ShareLock on transaction 3382643; blocked by process 5670.
Process 5670 waits for ExclusiveLock on tuple (33,12) of relation 31709 of database 16393; blocked by process 5671.
Process 5671: UPDATE "user" SET
seqno = seqno + 1,
push_pending = true
WHERE user_id = $1
RETURNING seqno
Process 5652: UPDATE "user" SET
seqno = seqno + 1,
push_pending = true
WHERE user_id = $1
RETURNING seqno
Process 5670: UPDATE "user" SET
seqno = seqno + 1,
push_pending = true
WHERE user_id = $1
RETURNING seqno
(Relation 31709 is the "user"
table. The user_id
is the same in all three transactions.)
This doesn't seem like your run-of-the-mill deadlock you see demonstrated in the documentation. I am not updating multiple rows of this table out of order. I suspect the RETURNING
clause has something to do with it, but I can't see exactly why. Any ideas on how I can solve this, or debug it further?
Update for Erwin's questions in the comments: This is Postgres 9.3. There are other commands in this transaction, although I don't believe they are touching the "user" table. There is one trigger on the table for updating an updated_at
column with current_timestamp()
:
CREATE OR REPLACE FUNCTION update_timestamp() RETURNS TRIGGER AS $$
BEGIN
NEW.updated_at = now();
RETURN NEW;
END;
$$ LANGUAGE plpgsql;
I'll look into getting details for the transaction when I can reproduce this; I am analyzing the logs after the fact.
Update #2: I rebuilt Postgres with LOCK_DEBUG
and ran with trace_locks=on
in an attempt to wrap my head around the order in which locks are taken.
The updated deadlock message is:
Process 54131 waits for ShareLock on transaction 4774; blocked by process 54157.
Process 54157 waits for ExclusiveLock on tuple (1,16) of relation 18150 of database 18136; blocked by process 54131.
I can see the blocking on the ExclusiveLock
pretty clearly:
2014-08-05 10:35:15 EDT apiary [54131] 2/316 4773 LOG: 00000: LockAcquire: new: lock(0x10f039f88) id(18136,18150,1,16,3,1) grantMask(0) req(0,0,0,0,0,0,0)=0 grant(0,0,0,0,0,0,0)=0 wait(0) type(ExclusiveLock)
2014-08-05 10:35:15 EDT apiary [54131] 2/316 4773 LOG: 00000: GrantLock: lock(0x10f039f88) id(18136,18150,1,16,3,1) grantMask(80) req(0,0,0,0,0,0,1)=1 grant(0,0,0,0,0,0,1)=1 wait(0) type(ExclusiveLock)
2014-08-05 10:35:15 EDT apiary [54157] 3/188 4774 LOG: 00000: LockAcquire: found: lock(0x10f039f88) id(18136,18150,1,16,3,1) grantMask(80) req(0,0,0,0,0,0,1)=1 grant(0,0,0,0,0,0,1)=1 wait(0) type(ExclusiveLock)
2014-08-05 10:35:15 EDT apiary [54157] 3/188 4774 LOG: 00000: WaitOnLock: sleeping on lock: lock(0x10f039f88) id(18136,18150,1,16,3,1) grantMask(80) req(0,0,0,0,0,0,2)=2 grant(0,0,0,0,0,0,1)=1 wait(0) type(ExclusiveLock)
(Format is date program [pid] virtualtxid txid msg
)
But I don't see where the ShareLock
is created, or why transaction 4773 would be blocking on transaction 4774. I see similar results when querying the pg_locks
table: always a transaction waiting on a ShareLock
of another transaction which is blocking on a tuple from the first transaction. Any suggestions for how to dig into the source of the ShareLock
?
Update 3: I needed to update the LOCK_DEBUG_ENABLED
inline function to unconditionally return true to see the ShareLock
creations. Once I did that, I started seeing their creation:
2014-08-05 12:53:22 EDT apiary [76705] 2/471 6294 LOG: 00000: LockAcquire: lock [6294,0] ExclusiveLock
2014-08-05 12:53:22 EDT apiary [76705] 2/471 6294 LOG: 00000: LockAcquire: new: lock(0x115818378) id(6294,0,0,0,4,1) grantMask(0) req(0,0,0,0,0,0,0)=0 grant(0,0,0,0,0,0,0)=0 wait(0) type(ExclusiveLock)
2014-08-05 12:53:22 EDT apiary [76705] 2/471 6294 LOG: 00000: GrantLock: lock(0x115818378) id(6294,0,0,0,4,1) grantMask(80) req(0,0,0,0,0,0,1)=1 grant(0,0,0,0,0,0,1)=1 wait(0) type(ExclusiveLock)
2014-08-05 12:53:22 EDT apiary [76706] 4/153 6295 LOG: 00000: LockAcquire: lock [6294,0] ShareLock
2014-08-05 12:53:22 EDT apiary [76706] 4/153 6295 LOG: 00000: LockAcquire: found: lock(0x115818378) id(6294,0,0,0,4,1) grantMask(80) req(0,0,0,0,0,0,1)=1 grant(0,0,0,0,0,0,1)=1 wait(0) type(ShareLock)
2014-08-05 12:53:22 EDT apiary [76706] 4/153 6295 LOG: 00000: WaitOnLock: sleeping on lock: lock(0x115818378) id(6294,0,0,0,4,1) grantMask(80) req(0,0,0,0,1,0,1)=2 grant(0,0,0,0,0,0,1)=1 wait(0) type(ShareLock)
But I'm still not really sure why that ShareLock
is being created, and why 6295 (in this case) must wait for 6294.
The first thing to do is to look at the postgres logs. postgres documentation recommends avoiding these kinds of deadlocks by ensuring transactions acquire locks in a consistent order. As before postgres will lock rows in the update table and the locks will be held until the transaction commits or rolls back.
The reason is that transactions have to wait for one another. If two transactions are in a conflict, PostgreSQL will not resolve the problem immediately, rather it will wait for deadlock_timeout and then trigger the deadlock detection algorithm to resolve the problem.
This might be how the deadlock is happening. Each table has a foreign key on user_id to the user table. When you insert into a table that has a foreign key constraint, postgres needs to lock the row of the referenced table to ensure that it's not deleted while you insert the row that refers to it (and violate the FK contraints on commit). This should be a shared lock now.
It looks like all the insert/updates for tables that refer to user also update the user seq on the user table after the insert on the primary table. These updates require an exclusive lock and get blocked by any shared locks that are not part of the current transaction. If two happen concurrently, they deadlock.
For example, two transaction inserting into media_size and source concurrently might deadlock w/ something like this:
T1 T2 ----------------------------------------------------------------------- 1. Insert media size 1a. Excl Lock media size row 1b. Shared Lock on user row (FK) 2. Insert Source 2a. Excl Lock source row 2b. Shared lock on user row (FK) 3. Update user seq 3a. Excl Lock on user row (BLOCKS on 2b) 4. Update user seq 4a. Excl Lock on user row (Blocks on 1b) 5. Deadlock
I think switching the update user seq step to be first makes sense as it would force T1 and T2 to block before trying to acquire a shared lock (which it wouldn't need since it has an exclused lock already).
Assuming the default transaction isolation level Read Committed
.
The UPDATE
statement always locks updated rows. Concurrent transactions trying to update the same row cannot proceed until the first transaction is rolled back or committed.
The RETURNING
clause is orthogonal to the problem.
The deadlock would have happened without Process 5671
just as well. That's just another transaction queuing for the same row, that came in between. Process 5670
and Process 5652
are deadlocking, really, most probably due to other commands in the same transaction. A less likely candidate would be triggers on the table.
Try to break transactions into smaller pieces that update tables and rows of same tables in the same, straight order. Then they cannot interlock.
Since you mention foreign keys in a later comment: those can play their part in a deadlock, too. Postgres 9.3 introduced new locking levels to address that:
FOR KEY SHARE
and FOR NO KEY UPDATE
Details in this blog post by Michael Paquier.
Which should render FK constaints less of a problem. Still not ruled out, though.
There have been a number of small fixes to locking mechanisms since version 9.3.0. Upgrade to the latest point release, it might help.
In reply to your comment:
You can find (most) currently held locks in the system catalog view pg_locks
. Be sure to read the manual and possibly more before drawing conclusions.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With