I am doing a data migration between two tables (splitting out a related table). The existing table is reminders
, and it has a start
column and a newly-added dateset_id
column pointing to a new dateset
table, which also has a start
column. For every row in reminders
, I want to INSERT
a new row in dateset
with the start
value copied over, and UPDATE
the corresponding row in reminders
with the newly-inserted dateset
ID.
Here's the SQL I tried:
WITH inserted_datesets AS (
INSERT INTO dateset (start)
SELECT start FROM reminder
RETURNING reminder.id AS reminder_id, id AS dateset_id
)
UPDATE reminder
SET dateset_id = ids.dateset_id
FROM inserted_datesets AS ids
WHERE reminder.id = ids.reminder_id
I get an error missing FROM-clause entry for table "reminder"
, because I'm including the reminder.id
column in the RETURNING
clause, but not actually selecting it for the insert. This makes sense, but I can't figure out how to modify the query to do what I need. Is there a totally different approach I'm missing?
There are several ways to solve the problem.
1. temporarily add a column
As others mentioned, the straight-forward way is to temporarily add a column reminder_id
to the dateset
. Populate it with original IDs
from reminder
table. Use it to join reminder
with the dateset
table. Drop the temporary column.
2. when start is unique
If values of the start
column is unique it is possible to do it without extra column by joining reminder
table with the dateset
table on the start
column.
INSERT INTO dateset (start)
SELECT start FROM reminder;
WITH
CTE_Joined
AS
(
SELECT
reminder.id AS reminder_id
,reminder.dateset_id AS old_dateset_id
,dateset.id AS new_dateset_id
FROM
reminder
INNER JOIN dateset ON dateset.start = reminder.start
)
UPDATE CTE_Joined
SET old_dateset_id = new_dateset_id
;
3. when start is not unique
It is possible to do it without temporary column even in this case. The main idea is the following. Let's have a look at this example:
We have two rows in reminder
with the same start
value and IDs 3 and 7:
reminder
id start dateset_id
3 2015-01-01 NULL
7 2015-01-01 NULL
After we insert them into the dateset
, there will be new IDs generated, for example, 1 and 2:
dateset
id start
1 2015-01-01
2 2015-01-01
It doesn't really matter how we link these two rows. The end result could be
reminder
id start dateset_id
3 2015-01-01 1
7 2015-01-01 2
or
reminder
id start dateset_id
3 2015-01-01 2
7 2015-01-01 1
Both of these variants are correct. Which brings us to the following solution.
Simply insert all rows first.
INSERT INTO dateset (start)
SELECT start FROM reminder;
Match/join two tables on start
column knowing that it is not unique. "Make it" unique by adding ROW_NUMBER
and joining by two columns. It is possible to make the query shorter, but I spelled out each step explicitly:
WITH
CTE_reminder_rn
AS
(
SELECT
id
,start
,dateset_id
,ROW_NUMBER() OVER (PARTITION BY start ORDER BY id) AS rn
FROM reminder
)
,CTE_dateset_rn
AS
(
SELECT
id
,start
,ROW_NUMBER() OVER (PARTITION BY start ORDER BY id) AS rn
FROM dateset
)
,CTE_Joined
AS
(
SELECT
CTE_reminder_rn.id AS reminder_id
,CTE_reminder_rn.dateset_id AS old_dateset_id
,CTE_dateset_rn.id AS new_dateset_id
FROM
CTE_reminder_rn
INNER JOIN CTE_dateset_rn ON
CTE_dateset_rn.start = CTE_reminder_rn.start AND
CTE_dateset_rn.rn = CTE_reminder_rn.rn
)
UPDATE CTE_Joined
SET old_dateset_id = new_dateset_id
;
I hope it is clear from the code what it does, especially when you compare it to the simpler version without ROW_NUMBER
. Obviously, the complex solution will work even if start
is unique, but it is not as efficient, as a simple solution.
This solution assumes that dateset
is empty before this process.
Using INSERT RETURNING in subqueries is, according to the documentation, supported, for Postgres versions 9.1 and after. The hypothetical DML subquery in the original answer should work for Postgres >= 9.1:
UPDATE reminder SET dateset_id = (
INSERT INTO dateset (start)
VALUES (reminder.start)
RETURNING dateset.id));
Here's another way of doing it, distinct from the 3 ways Vladimir suggested so far.
A temporary function will let you read the id of the new rows created as well as other values in the query:
--minimal demonstration schema
CREATE TABLE dateset (
id SERIAL PRIMARY KEY,
start TIMESTAMP
-- other things here...
);
CREATE TABLE reminder (
id SERIAL PRIMARY KEY,
start TIMESTAMP,
dateset_id INTEGER REFERENCES dateset(id)
-- other things here...
);
--pre-migration data
INSERT INTO reminder (start) VALUES ('2014-02-14'), ('2014-09-06'), ('1984-01-01'), ('2014-02-14');
--all at once
BEGIN;
CREATE FUNCTION insertreturning(ts TIMESTAMP) RETURNS INTEGER AS $$
INSERT INTO dateset (start)
VALUES (ts)
RETURNING dateset.id;
$$ LANGUAGE SQL;
UPDATE reminder SET dateset_id = insertreturning(reminder.start);
DROP FUNCTION insertreturning(TIMESTAMP);
ALTER TABLE reminder DROP COLUMN start;
END;
This approach to the problem suggested itself after I realized that writing INSERT ... RETURNING
as a subquery would solve the issue; although INSERT
s are not allowed as subqueries, calls to functions certainly are.
Intriguingly, this suggests that DML subqueries that return values might be broadly useful. If they were possible, we would just write:
UPDATE reminder SET dateset_id = (
INSERT INTO dateset (start)
VALUES (reminder.start)
RETURNING dateset.id));
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With