Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to include excluded rows in RETURNING from INSERT ... ON CONFLICT

I've got this table (generated by Django):

CREATE TABLE feeds_person (
  id serial PRIMARY KEY,
  created timestamp with time zone NOT NULL,
  modified timestamp with time zone NOT NULL,
  name character varying(4000) NOT NULL,
  url character varying(1000) NOT NULL,
  email character varying(254) NOT NULL,
  CONSTRAINT feeds_person_name_ad8c7469_uniq UNIQUE (name, url, email)
);

I'm trying to bulk insert a lot of data using INSERT with an ON CONFLICT clause.

The wrinkle is that I need to get the id back for all of the rows, whether they're already existing or not.

In other cases, I would do something like:

INSERT INTO feeds_person (created, modified, name, url, email)
VALUES blah blah blah
ON CONFLICT (name, url, email) DO UPDATE SET url = feeds_person.url
RETURNING id

Doing the UPDATE causes the statement to return the id of that row. Except, it doesn't work with this table. I think it doesn't work because I've got multiple fields unique together whereas in other instances I've used this method I've had just one unique field.

I get this error when trying to run the SQL through Django's cursor:

django.db.utils.ProgrammingError: ON CONFLICT DO UPDATE command cannot affect row a second time
HINT:  Ensure that no rows proposed for insertion within the same command have duplicate constrained values.

How do I do the bulk insert with this table and get back the inserted and existing ids?

like image 452
Dustin Wyatt Avatar asked Mar 11 '16 21:03

Dustin Wyatt


People also ask

What is needed for an insert on conflict update to work?

You must have INSERT privilege on a table in order to insert into it. If ON CONFLICT DO UPDATE is present, UPDATE privilege on the table is also required. If a column list is specified, you only need INSERT privilege on the listed columns.

What is Upsert in PostgreSQL?

In relational databases, the term upsert is referred to as merge. The idea is that when you insert a new row into the table, PostgreSQL will update the row if it already exists, otherwise, it will insert the new row. That is why we call the action is upsert (the combination of update or insert).

What does excluded do in Postgres?

Introduction to PostgreSQL EXCLUDE. PostgreSQL excludes statements in PostgreSQL is used to compare any two rows from the specified column or expression by using the operator specified in PostgreSQL. At the time of excluding the column, the comparison operator will return the null or false value as output.


1 Answers

The error you get:

ON CONFLICT DO UPDATE command cannot affect row a second time

... indicates you are trying to upsert the same row more than once in a single command. In other words: you have dupes on (name, url, email) in your VALUES list. Fold duplicates (if that's an option) and the error goes away. This chooses an arbitrary row from each set of dupes:

INSERT INTO feeds_person (created, modified, name, url, email)
SELECT DISTINCT ON (name, url, email) *
FROM  (
   VALUES
   ('blah', 'blah', 'blah', 'blah', 'blah')
   -- ... more rows
   ) AS v(created, modified, name, url, email)  -- match column list
ON     CONFLICT (name, url, email) DO UPDATE
SET    url = feeds_person.url
RETURNING id;

Since we use a free-standing VALUES expression now, you have to add explicit type casts for non-default types. Like:

VALUES
    (timestamptz '2016-03-12 02:47:56+01'
   , timestamptz '2016-03-12 02:47:56+01'
   , 'n3', 'u3', 'e3')
   ...

Your timestamptz columns need an explicit type cast, while the string types can operate with default text. (You could still cast to varchar(n) right away.)

If you want to have a say in which row to pick from each set of dupes, there are ways to do that:

  • Select first row in each GROUP BY group?

You are right, there is (currently) no way to use excluded columns in the RETURNING clause. I quote the Postgres Wiki:

Note that RETURNING does not make visible the "EXCLUDED.*" alias from the UPDATE (just the generic "TARGET.*" alias is visible there). Doing so is thought to create annoying ambiguity for the simple, common cases [30] for little to no benefit. At some point in the future, we may pursue a way of exposing if RETURNING-projected tuples were inserted and updated, but this probably doesn't need to make it into the first committed iteration of the feature [31].

However, you shouldn't be updating rows that are not supposed to be updated. Empty updates are almost as expensive as regular updates - and might have unintended side effects. You don't strictly need UPSERT to begin with, your case looks more like "SELECT or INSERT". Related:

  • Is SELECT or INSERT in a function prone to race conditions?

One cleaner way to insert a set of rows would be with data-modifying CTEs:

WITH val AS (
   SELECT DISTINCT ON (name, url, email) *
   FROM  (
      VALUES 
      (timestamptz '2016-1-1 0:0+1', timestamptz '2016-1-1 0:0+1', 'n', 'u', 'e')
    , ('2016-03-12 02:47:56+01', '2016-03-12 02:47:56+01', 'n1', 'u3', 'e3')
      -- more (type cast only needed in 1st row)
      ) v(created, modified, name, url, email)
   )
, ins AS (
   INSERT INTO feeds_person (created, modified, name, url, email)
   SELECT created, modified, name, url, email FROM val
   ON     CONFLICT (name, url, email) DO NOTHING
   RETURNING id, name, url, email
   )
SELECT 'inserted' AS how, id FROM ins  -- inserted
UNION  ALL
SELECT 'selected' AS how, f.id         -- not inserted
FROM   val v
JOIN   feeds_person f USING (name, url, email);

The added complexity should pay for big tables where INSERT is the rule and SELECT the exception.

Originally, I had added a NOT EXISTS predicate on the last SELECT to prevent duplicates in the result. But that was redundant. All CTEs of a single query see the same snapshots of tables. The set returned with ON CONFLICT (name, url, email) DO NOTHING is mutually exclusive to the set returned after the INNER JOIN on the same columns.

Unfortunately this also opens a tiny window for a race condition. If ...

  • a concurrent transaction inserts conflicting rows
  • has not committed yet
  • but commits eventually

... some rows may be lost.

You might just INSERT .. ON CONFLICT DO NOTHING, followed by a separate SELECT query for all rows - within the same transaction to overcome this. Which in turn opens another tiny window for a race condition if concurrent transactions can commit writes to the table between INSERT and SELECT (in default READ COMMITTED isolation level). Can be avoided with REPEATABLE READ transaction isolation (or stricter). Or with a (possibly expensive or even unacceptable) write lock on the whole table. You can get any behavior you need, but there may be a price to pay.

Related:

  • How to use RETURNING with ON CONFLICT in PostgreSQL?
  • Return rows from INSERT with ON CONFLICT without needing to update
like image 122
Erwin Brandstetter Avatar answered Oct 25 '22 02:10

Erwin Brandstetter