I have an Items and Jobs table:
Items
Jobs
Items start out as IN_PROGRESS, but work is performed on them, and handed off to a worker to update. I have an updater process that is updating Items as they come in, with a new status. The approach I have been doing so far has been (in pseudocode):
def work(item: Item) = {
insideTransaction {
updateItemWithNewStatus(item)
jobs, items = getParentJobAndAllItems(item)
newJobStatus = computeParentJobStatus(jobs, items)
// do some stuff depending on newJobStatus
}
}
Does that make sense? I want this to work in a concurrent environment. The issue I have right now, is that COMPLETE is arrived at multiple times for a job, when I only want to do logic on COMPLETE, once.
If I change my transaction level to SERIALIZABLE, I do get the "ERROR: could not serialize access due to read/write dependencies among transactions" error as described.
So my questions are:
Edit: I have reopened this question because I was not satisfied with the previous answers explanation. Is anyone able to explain this for me? Specifically, I want some example queries for that pseudocode.
The Serializable isolation level provides the strictest transaction isolation. This level emulates serial transaction execution for all committed transactions; as if transactions had been executed one after another, serially, rather than concurrently.
Under CS, access to the uncommitted data of other applications is not possible. However, nonrepeatable reads and phantom reads are possible. CS is the default isolation level. It is suitable when you want maximum concurrency and need to see only committed data.
It prevents the reader from seeing any read that is uncommitted (think: intermediate, 'dirty'). Still, it does not exclude the appearance of phantoms or non-repeatable reads. Repeatable reads guarantees all of the above and it additionally guarantees that once the data are read, they won't change.
SERIALIZABLE. SERIALIZABLE is the strongest isolation level, because it prevents all four concurrency problems already discussed.
If you want the jobs to be able to run concurrently, neither SERIALIZABLE
nor SELECT FOR UPDATE
will work directly.
If you lock the row using SELECT FOR UPDATE
, then another process will simply block when it executes the SELECT FOR UPDATE
until the first process commits the transaction.
If you do SERIALIZABLE
, both processes could run concurrently (processing the same row) but at least one should be expected to fail by the time it does a COMMIT
since the database will detect the conflict. Also SERIALIZABLE
might fail if it conflicts with any other queries going on in the database at the same time which affect related rows. The real reason to use SERIALIZABLE
is precisely if you are trying to protect against concurrent database updates made by other jobs, as opposed to blocking the same job from executing twice.
Note there are tricks to make SELECT FOR UPDATE
skip locked rows. If you do that then you can have actual concurrency. See Select unlocked row in Postgresql.
Another approach I see more often is to change your "status" column to have a 3rd temporary state which is used while a job is being processed. Typically one would have states like 'PENDING', 'IN_PROGRESS', 'COMPLETE'. When your process searches for work to do, it finds a 'PENDING' jobs, immediately moves it to 'IN_PROGRESS' and commits the transaction, then carries on with the work and finally moves it to 'COMPLETE'. The disadvantage is that if the process dies while processing a job, it will be left in 'IN_PROGRESS' indefinitely.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With