I am attempting to update multiple columns on a table with values from another row in the same table:
CREATE TEMP TABLE person (
pid INT
, name VARCHAR(40)
, dob DATE
, younger_sibling_name VARCHAR(40)
, younger_sibling_dob DATE
);
INSERT INTO person VALUES (pid, name, dob)
(1, 'John' , '1980-01-05')
, (2, 'Jimmy', '1975-04-25')
, (3, 'Sarah', '2004-02-10')
, (4, 'Frank', '1934-12-12')
;
The task is to populate younger_sibling_name
and younger_sibling_dob
with the name and birthday of the person that is closest to them in age, but not older or the same age.
I can set the younger sibling dob
easily because this is the value that determines the record to use with a correlated subquery (I think this is an example of that?):
UPDATE person SET younger_sibling_dob = (
SELECT MAX(dob)
FROM person AS sibling
WHERE sibling.dob < person.dob);
I just can't see any way to get the name
?
The real query of this will run over about 1M rows in groups of 100-500 for each MAX selection so performance is a concern.
After trying many different approaches, I've decided on this one which I think is a good balance of being able to verify the data with the intermediate result, shows the intention of what the logic is, and performs adequately:
WITH sibling AS (
SELECT person.pid, sibling.dob, sibling.name,
row_number() OVER (PARTITION BY person.pid
ORDER BY sibling.dob DESC) AS age_closeness
FROM person
JOIN person AS sibling ON sibling.dob < person.dob
)
UPDATE person
SET younger_sibling_name = sibling.name
,younger_sibling_dob = sibling.dob
FROM sibling
WHERE person.pid = sibling.pid
AND sibling.age_closeness = 1;
SELECT * FROM person ORDER BY dob;
I expect your added solution to perform poorly, as it's doing a of of unnecessary work. The following should be much faster.
The question and the added solution do not define which row to pick when there are multiple with the same dob
. Typically you'll want a deterministic pick. This query picks the alphabetically first name from each group of peers with the same dob
. Adapt to your needs.
UPDATE person p
SET younger_sibling_name = y.name
, younger_sibling_dob = y.dob
FROM (
SELECT dob, name, lead(dob) OVER (ORDER BY dob) AS next_dob
FROM (
SELECT DISTINCT ON (dob)
dob, name
FROM person p
ORDER BY dob, name -- ①
) sub
) y
WHERE p.dob = y.next_dob;
db<>fiddle here - with extended test case
Works since at least Postgres 8.4.
Needs an index on dob
to be fast, ideally a multicolumn index on (dob, name)
.
Subquery sub
passes over the whole table once and distills distinct rows per dob
.
① I added name
to ORDER BY
as tiebreaker to pick the row with the alphabetically first name. Adapt to our needs.
In the outer SELECT
add the next later dob
(next_dob
) to each row with lead()
- simple now with distinct dob
. Then join to that next_dob
and the rest is simple.
If no younger person exists, no UPDATE
happens and the columns stay NULL
.
About DISTINCT ON
and possibly faster query techniques for many duplicates:
Taking dob
and name
from the same row guarantees we stay in sync. Multiple correlated subqueries would not offer this guarantee, and would be more expensive anyway.
Still valid.
WITH cte AS (
SELECT *, dense_rank() OVER (ORDER BY dob) AS drk
FROM person
)
UPDATE person p
SET younger_sibling_name = y.name
, younger_sibling_dob = y.dob
FROM cte x
JOIN (SELECT DISTINCT ON (drk) * FROM cte) y ON y.drk = x.drk - 1
WHERE x.pid = p.pid;
Old sqlfiddle
In the CTE cte
use the window function dense_rank()
to get a rank without gaps according to the dop
for every person.
Join cte
to itself, but remove duplicates on dob
from the second instance. Thereby everybody gets exactly one UPDATE
. If more than one person share the same dop
, the same one is selected as younger sibling for all persons on the next dob
. I do this with:
(SELECT DISTINCT ON (rnk) * FROM cte)
Add ORDER BY rnk, ...
to this subquery to pick a particular person for every dob
.
WITH cte AS (
SELECT dob, min(name) AS name
, row_number() OVER (ORDER BY dob) rn
FROM person p
GROUP BY dob
)
UPDATE person p
SET younger_sibling_name = y.name
, younger_sibling_dob = y.dob
FROM cte x
JOIN cte y ON y.rn = x.rn - 1
WHERE x.dob = p.dob;
Old sqlfiddle
This works, because aggregate functions are applied before window functions. And it should be very fast since both operations agree on the sort order.
Obviates the need for a later DISTINCT
like in query 1.
Result is the same as query 1, exactly.
Again, you can add more columns to ORDER BY
to pick a particular person for every dob
.
1) Finding the MAX() can alway be rewritten in terms of NOT EXISTS (...)
UPDATE person dst
SET younger_sibling_name = src.name
,younger_sibling_dob = src.dob
FROM person src
WHERE src.dob < dst.dob
OR src.dob = dst.dob AND src.pid < dst.pid
AND NOT EXISTS (
SELECT * FROM person nx
WHERE nx.dob < dst.dob
OR nx.dob = dst.dob AND nx.pid < dst.pid
AND nx.dob > src.dob
OR nx.dob = src.dob AND nx.pid > src.pid
);
2) Instead of rank() / row_number(), you could also use a LAG() function over the WINDOW:
UPDATE person dst
SET younger_sibling_name = src.name
,younger_sibling_dob = src.dob
FROM (
SELECT pid
, LAG(name) OVER win AS name
, LAG(dob) OVER win AS dob
FROM person
WINDOW win AS (ORDER BY dob, pid)
) src
WHERE src.pid = dst.pid
;
Both versions require a self-joined subquery (or CTE) because UPDATE does not allow window functions.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With