Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Postgresql batch insert or ignore

I have the responsibility of switching our code from sqlite to postgres. One of the queries I am having trouble with is copied below.

INSERT INTO group_phones(group_id, phone_name)
SELECT g.id, p.name 
FROM phones AS p, groups as g
WHERE g.id IN ($add_groups) AND p.name IN ($phones);

The problem arises when there is a duplicate record. In this table the combination of both values must be unique. I have used a few plpgsql functions in other places to do update-or-insert operations, but in this case I can do several inserts at once. I am not sure how to write a stored routine for this. Thanks for all the help from all the sql gurus out there!

like image 481
jlunavtgrad Avatar asked Aug 09 '12 19:08

jlunavtgrad


1 Answers

There are 3 challenges.

  1. Your query has no JOIN condition between the tables phones and groups, making this effectively a limited CROSS JOIN - which you most probably do not intend. I.e. every phone that qualifies is combined with every group that qualifies. If you have 100 phones and 100 groups that's already 10,000 combinations.

  2. Insert distinct combinations of (group_id, phone_name)

  3. Avoid inserting rows that are already there in table group_phones .

All things considered it could look like this:

INSERT INTO group_phones(group_id, phone_name)
SELECT i.id, i.name
FROM  (
    SELECT DISTINCT g.id, p.name -- get distinct combinations
    FROM   phones p
    JOIN   groups g ON ??how are p & g connected??
    WHERE  g.id IN ($add_groups)
    AND    p.name IN ($phones)
    ) i
LEFT   JOIN group_phones gp ON (gp.group_id, gp.phone_name) = (i.id, i.name)
WHERE  gp.group_id IS NULL  -- avoid duping existing rows

Concurrency

This form minimizes the chance of a race condition with concurrent write operations. If your table has heavy concurrent write load, you may want to lock the table exclusively or use serializable transaction isolation, This safeguard against the extremely unlikely case that a row is altered by a concurrent transaction in the tiny time slot between the constraint verification (row isn't there) and the write operation in the query.

BEGIN ISOLATION LEVEL SERIALIZABLE;
INSERT ...
COMMIT;

Be prepared to repeat the transaction if it rolls back with a serialization error. For more on that topic good starting points could be this blog post by @depesz or this related question on SO.

Normally, though, you needn't even bother with any of this.

Performance

LEFT JOIN tbl ON right_col = left_col WHERE right_col IS NULL

is generally the fastest method with distinct columns in the right table. If you have dupes in the column (especially if there are many),

WHERE NOT EXISTS (SELECT 1 FROM tbl WHERE right_col = left_col)

May be faster because it can stop to scan as soon as the first row is found.

You can also use IN, like @dezso demonstrates, but it is usually slower in PostgreSQL.

like image 51
Erwin Brandstetter Avatar answered Oct 02 '22 14:10

Erwin Brandstetter