Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Update PostgreSQL table with values from self

I am attempting to update multiple columns on a table with values from another row in the same table:

CREATE TEMP TABLE person (
  pid INT
, name VARCHAR(40)
, dob DATE
, younger_sibling_name VARCHAR(40)
, younger_sibling_dob DATE
);

INSERT INTO person VALUES (pid, name, dob)
  (1, 'John' , '1980-01-05')
, (2, 'Jimmy', '1975-04-25')
, (3, 'Sarah', '2004-02-10')
, (4, 'Frank', '1934-12-12')
;

The task is to populate younger_sibling_name and younger_sibling_dob with the name and birthday of the person that is closest to them in age, but not older or the same age.

I can set the younger sibling dob easily because this is the value that determines the record to use with a correlated subquery (I think this is an example of that?):

UPDATE person SET younger_sibling_dob = (
SELECT MAX(dob)
FROM person AS sibling
WHERE sibling.dob < person.dob);

I just can't see any way to get the name?
The real query of this will run over about 1M rows in groups of 100-500 for each MAX selection so performance is a concern.

Edit

After trying many different approaches, I've decided on this one which I think is a good balance of being able to verify the data with the intermediate result, shows the intention of what the logic is, and performs adequately:

WITH sibling AS (
  SELECT person.pid, sibling.dob, sibling.name,
         row_number() OVER (PARTITION BY person.pid
                            ORDER BY sibling.dob DESC) AS age_closeness
  FROM person
  JOIN person AS sibling ON sibling.dob < person.dob
)
UPDATE person
  SET younger_sibling_name = sibling.name
     ,younger_sibling_dob  = sibling.dob
FROM sibling
WHERE person.pid = sibling.pid
   AND sibling.age_closeness = 1;

SELECT * FROM person ORDER BY dob;
like image 875
Jeremy Avatar asked Feb 18 '23 02:02

Jeremy


2 Answers

Rewrite 2022

I expect your added solution to perform poorly, as it's doing a of of unnecessary work. The following should be much faster.

The question and the added solution do not define which row to pick when there are multiple with the same dob. Typically you'll want a deterministic pick. This query picks the alphabetically first name from each group of peers with the same dob. Adapt to your needs.

UPDATE person p
SET    younger_sibling_name = y.name
     , younger_sibling_dob  = y.dob
FROM  (
   SELECT dob, name, lead(dob) OVER (ORDER BY dob) AS next_dob
   FROM  (
      SELECT DISTINCT ON (dob)
             dob, name
      FROM   person p
      ORDER  BY dob, name  -- ①
      ) sub
   ) y
WHERE  p.dob = y.next_dob;

db<>fiddle here - with extended test case

Works since at least Postgres 8.4.

Needs an index on dob to be fast, ideally a multicolumn index on (dob, name).

Subquery sub passes over the whole table once and distills distinct rows per dob.

① I added name to ORDER BY as tiebreaker to pick the row with the alphabetically first name. Adapt to our needs.

In the outer SELECT add the next later dob (next_dob) to each row with lead() - simple now with distinct dob. Then join to that next_dob and the rest is simple.

If no younger person exists, no UPDATE happens and the columns stay NULL.

About DISTINCT ON and possibly faster query techniques for many duplicates:

  • Select first row in each GROUP BY group?
  • Optimize GROUP BY query to retrieve latest row per user

Taking dob and name from the same row guarantees we stay in sync. Multiple correlated subqueries would not offer this guarantee, and would be more expensive anyway.

Original answer

Still valid.

Old query 1

WITH cte AS (
   SELECT *, dense_rank() OVER (ORDER BY dob) AS drk
   FROM   person
    )
UPDATE person p
SET    younger_sibling_name = y.name
     , younger_sibling_dob  = y.dob
FROM   cte x
JOIN   (SELECT DISTINCT ON (drk) * FROM cte) y ON y.drk = x.drk - 1
WHERE  x.pid = p.pid;

Old sqlfiddle

In the CTE cte use the window function dense_rank() to get a rank without gaps according to the dop for every person.

Join cte to itself, but remove duplicates on dob from the second instance. Thereby everybody gets exactly one UPDATE. If more than one person share the same dop, the same one is selected as younger sibling for all persons on the next dob. I do this with:

   (SELECT DISTINCT ON (rnk) * FROM cte)

Add ORDER BY rnk, ... to this subquery to pick a particular person for every dob.

Old query 2

WITH cte AS (
   SELECT dob, min(name) AS name
        , row_number() OVER (ORDER BY dob) rn
   FROM   person p
   GROUP  BY dob
   )
UPDATE person p
SET    younger_sibling_name = y.name
     , younger_sibling_dob  = y.dob
FROM   cte x
JOIN   cte y ON y.rn = x.rn - 1
WHERE  x.dob = p.dob;

Old sqlfiddle

This works, because aggregate functions are applied before window functions. And it should be very fast since both operations agree on the sort order.

Obviates the need for a later DISTINCT like in query 1.

Result is the same as query 1, exactly.
Again, you can add more columns to ORDER BY to pick a particular person for every dob.

like image 72
Erwin Brandstetter Avatar answered Feb 27 '23 17:02

Erwin Brandstetter


1) Finding the MAX() can alway be rewritten in terms of NOT EXISTS (...)

UPDATE person dst
SET younger_sibling_name = src.name
        ,younger_sibling_dob = src.dob
FROM person src
WHERE src.dob < dst.dob
   OR src.dob = dst.dob AND src.pid < dst.pid
AND NOT EXISTS (
        SELECT * FROM person nx
        WHERE nx.dob < dst.dob
           OR nx.dob = dst.dob AND nx.pid < dst.pid
        AND nx.dob > src.dob
           OR nx.dob = src.dob AND nx.pid > src.pid
        );

2) Instead of rank() / row_number(), you could also use a LAG() function over the WINDOW:

UPDATE person dst
SET younger_sibling_name = src.name
        ,younger_sibling_dob = src.dob
FROM    (
        SELECT pid
        , LAG(name) OVER win AS name
        , LAG(dob) OVER win AS dob 
        FROM person
        WINDOW win AS (ORDER BY dob, pid)
        ) src
WHERE src.pid = dst.pid
        ;

Both versions require a self-joined subquery (or CTE) because UPDATE does not allow window functions.

like image 38
wildplasser Avatar answered Feb 27 '23 16:02

wildplasser