I am using Postgres and have a table users like this
| id | array_column | name |
|---|---|---|
| uuid1 | [A, B, C] | John1 |
| uuid2 | [D, E, F] | John2 |
where id is the PK and array_column is a list of labels
I want to write to this table using INSERT WHERE NOT EXISTS statements, basically to avoid writing a row if another row (with a different id PK) with overlapping labels already exists. Like this,
Transaction #1:
insert into users
values ('u1', ['A', 'B', 'C'], 'john1')
where not exists (select 1 from users
where array_column @> array['A', 'B', 'C'])`
Transaction #2 (in parallel):
insert into users values ('u2', ['B', 'C', 'D'], 'john2')
where not exists (select 1 from users
where array_column @> array['B', 'C', 'D'])`
So one of these transactions should fail.
What I want to understand is:;
Is this approach safe in a highly concurrent application i.e. no two rows with overlapping labels can be written at the same exact time? If yes, how does Postgres achieve it? I am thinking of a situation where both the insert statements are executed in parallel and the internal select statements returns false for both (parallel reads?) and then both transactions proceed to write different rows on commit. Or will Postgres re-evaluate one transaction in case the snapshot version of the rows is updated by the other transaction ? Please correct me you see any gap.
Will Postgres serialize these writes given there is a PK column involved? If there was no PK (or unique index) involved, would this INSERT WHERE NOT EXISTS still be safe?
If this approach is not 100% safe, is there any recommended solution without using full table level locks ? I want to avoid losing on write performance.
I am not able to reproduce this behaviour using a local installation of Postgres, which can show me that two rows with overlapping labels exits. Please suggest a better approach if any.
Normalization would be the simplest solution. Set up a separate table that assigns individual labels to users, many-to-one:
create table user_labels(
user_id int references users(id),
label text primary key);
Instead of cramming all labels belonging to a user into an array, it holds one per row. It's easier to find, sort, edit their labels this way, and it prevents any and all overlaps between users' label lists. That's assuming you're really trying to avoid overlap, not containment.
Is concurrent INSERT WHERE NOT EXISTS in Postgres safe?
Not according to your definition of safe.
- Is this approach safe in a highly concurrent application i.e. no two rows with overlapping labels can be written at the same exact time?
@> is a containment operator so it's enough for the incoming row to have one new element to be let in, the rest is allowed to overlap. If you want to disable any overlap, that's what && is for.
The check and the insert are within a single statement, making the operation atomic and safe on its own, but it will not guarantee the same safety a constraint/index would: by default, concurrent transactions all only read committed changes as of query start. On repeatable read and serializable levels they don't even see concurrently committed changes, only what happened before the entire transaction. That means all concurrent inserts like yours are free to ignore each other in all cases.
On top of that, everyone's free to update any row at any time in a way that would make it violate your condition.
- Will Postgres serialize these writes given there is a PK column involved? If there was no PK (or unique index) involved, would this INSERT WHERE NOT EXISTS still be safe?
If your payload passes the collision condition you explicitly defined in not exists, the incoming row is still checked against all constraints and indexes in place, including PK and any unique you might have. If you had no PK or unique indexes defined, it would change nothing about the reliability of this not exists. If by serializing writes you mean some kind of sequence/order you expect things to be written in, unique and primary key offer no such thing. Serializable isolation level doesn't solely rely on PK and unique either.
- If this approach is not 100% safe, is there any recommended solution without using full table level locks ? I want to avoid losing on write performance.
As established above, it's 0% safe and you're right that a table-level lock would be an overkill for what you're trying to achieve.
Ideally, you'd want an exclusion constraint. Problem is, those can only use b-tree, hash, GiST and SP-GiST indexes, while your array &> containment and && operators are only supported by GIN index array_ops class, so that's not an option. If you switched your labels from text to int, and thus the arrays from text[] to int[], you could use gist__int_ops class from intarray extension, which includes both of your operators. Demo at db<>fiddle:
create table users (
id uuid default gen_random_uuid() primary key,
array_column int[],
name text);
insert into users values
('05f2133d-c1bb-4f6d-8868-d500c2fa094c', '{1,2,3}', 'John1'),
('80eb8f90-a25b-47c1-a73e-1c5233b8a036', '{4,5,6}', 'John2');
create extension intarray;
alter table users add constraint non_overlapping_array_column
exclude using gist (array_column with &&);
insert into users values (gen_random_uuid(), '{7,8,1}', 'John3');
ERROR: conflicting key value violates exclusion constraint "non_overlapping_array_column" DETAIL: Key (array_column)=({7,8,1}) conflicts with existing key (array_column)=({1,2,3}).
There are also constraint triggers which can fit any logic any routine can fit, don't rely on indexes and offer constraint-like deferrability. In your case, this would unfortunately only narrow down the time window when concurrent transactions can collide, not eliminate it.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With