Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Postgres concurrency and serializability. Do I need a SERIALIZABLE isolation level?

I have an Items and Jobs table:

Items

  • id = PK
  • job_id = Jobs FK
  • status = IN_PROGRESS | COMPLETE

Jobs

  • id = PK

Items start out as IN_PROGRESS, but work is performed on them, and handed off to a worker to update. I have an updater process that is updating Items as they come in, with a new status. The approach I have been doing so far has been (in pseudocode):

def work(item: Item) = {
  insideTransaction {
    updateItemWithNewStatus(item)
    jobs, items = getParentJobAndAllItems(item)
    newJobStatus = computeParentJobStatus(jobs, items)
    // do some stuff depending on newJobStatus
  }
}

Does that make sense? I want this to work in a concurrent environment. The issue I have right now, is that COMPLETE is arrived at multiple times for a job, when I only want to do logic on COMPLETE, once.

If I change my transaction level to SERIALIZABLE, I do get the "ERROR: could not serialize access due to read/write dependencies among transactions" error as described.

So my questions are:

  • Do I need SERIALIZABLE?
  • Can I get away with SELECT FOR UPDATE, and where?
  • Can someone explain to me what is happening, and why?

Edit: I have reopened this question because I was not satisfied with the previous answers explanation. Is anyone able to explain this for me? Specifically, I want some example queries for that pseudocode.

like image 620
Dominic Bou-Samra Avatar asked Feb 15 '17 06:02

Dominic Bou-Samra


People also ask

Should I use SERIALIZABLE isolation level?

The Serializable isolation level provides the strictest transaction isolation. This level emulates serial transaction execution for all committed transactions; as if transactions had been executed one after another, serially, rather than concurrently.

Which isolation level provides the best concurrency?

Under CS, access to the uncommitted data of other applications is not possible. However, nonrepeatable reads and phantom reads are possible. CS is the default isolation level. It is suitable when you want maximum concurrency and need to see only committed data.

Why do we need isolation in concurrent transactions?

It prevents the reader from seeing any read that is uncommitted (think: intermediate, 'dirty'). Still, it does not exclude the appearance of phantoms or non-repeatable reads. Repeatable reads guarantees all of the above and it additionally guarantees that once the data are read, they won't change.

Which isolation level prevents all concurrency issues?

SERIALIZABLE. SERIALIZABLE is the strongest isolation level, because it prevents all four concurrency problems already discussed.


1 Answers

If you want the jobs to be able to run concurrently, neither SERIALIZABLE nor SELECT FOR UPDATE will work directly.

If you lock the row using SELECT FOR UPDATE, then another process will simply block when it executes the SELECT FOR UPDATE until the first process commits the transaction.

If you do SERIALIZABLE, both processes could run concurrently (processing the same row) but at least one should be expected to fail by the time it does a COMMIT since the database will detect the conflict. Also SERIALIZABLE might fail if it conflicts with any other queries going on in the database at the same time which affect related rows. The real reason to use SERIALIZABLE is precisely if you are trying to protect against concurrent database updates made by other jobs, as opposed to blocking the same job from executing twice.

Note there are tricks to make SELECT FOR UPDATE skip locked rows. If you do that then you can have actual concurrency. See Select unlocked row in Postgresql.

Another approach I see more often is to change your "status" column to have a 3rd temporary state which is used while a job is being processed. Typically one would have states like 'PENDING', 'IN_PROGRESS', 'COMPLETE'. When your process searches for work to do, it finds a 'PENDING' jobs, immediately moves it to 'IN_PROGRESS' and commits the transaction, then carries on with the work and finally moves it to 'COMPLETE'. The disadvantage is that if the process dies while processing a job, it will be left in 'IN_PROGRESS' indefinitely.

like image 184
zeroimpl Avatar answered Sep 20 '22 06:09

zeroimpl