UPDATE FROM with a JOIN (Large Table Performance) Postgresql?

Question

I'm trying to get the following query to execute at a reasonable performance:

UPDATE order_item_imprint SET item_new_id = oi.item_new_id
      FROM order_item oi
      INNER JOIN order_item_imprint oii ON oi.item_number = oii.item_id

Currently, it doesn't complete within 8 days so we killed it. The query explaination is as follows:

Query Graphical Explaination

Merge Join  (cost=59038021.60..33137238641.84 rows=1432184234121 width=1392)
Merge Cond: ((oi.item_number)::text = (oii.item_id)::text)
  ->  Nested Loop  (cost=0.00..10995925524.15 rows=309949417305 width=1398)
        ->  Index Scan using unique_order_item_item_number on order_item oi  (cost=0.00..608773.05 rows=258995 width=14)
        ->  Seq Scan on order_item_imprint  (cost=0.00..30486.39 rows=1196739 width=1384)
  ->  Materialize  (cost=184026.24..198985.48 rows=1196739 width=6)
        ->  Sort  (cost=184026.24..187018.09 rows=1196739 width=6)
              Sort Key: oii.item_id
              ->  Seq Scan on order_item_imprint oii  (cost=0.00..30486.39 rows=1196739 width=6)

I have indexes on both tables, and i've ensured the comparing fields are of identical type and size. I am now at the point of trying to change the postgresql server configuration to hopefully help, but I am not sure it will.

The order_item_imprint table is about 1.1 million in size with a 145MB disk footprint, and the order_item table is about a 3rd the size.

The main goal is i need to be able to run this along with several other queries during a few hour maintenance window.

Auto vacuum and analyze has been run prior to execution plan.

Richard Huxton · Accepted Answer

I found an alternate way to write the query that allowed the pgsql optimizer to build the query much more efficiently

Actually, what you did was remove the unconstrained self-join on order_item_inprint.

If you look at the first line you'll see the following row estimate:

rows=1432184234121

That's 1.4 billion updates it is trying to do. When you aliased order_item_inprint in the join it got treated as a separate table from the update target.

Martin Dale Lyness · Answer

I found an alternate way to write the query that allowed the pgsql optimizer to build the query much more efficiently:

UPDATE order_item_imprint SET item_new_id = oi.item_new_id 
FROM order_item oi where oi.item_number = order_item_imprint.item_id

apparently that join was redundant and the following execution plan could have been used:

Hash Join  (cost=1.38..5.73 rows=48 width=1407)
  Hash Cond: ((order_item_imprint.item_id)::text = (oi.item_number)::text)
  ->  Seq Scan on order_item_imprint  (cost=0.00..3.63 rows=63 width=1399)
  ->  Hash  (cost=1.17..1.17 rows=17 width=23)
        ->  Seq Scan on order_item oi  (cost=0.00..1.17 rows=17 width=23)

UPDATE FROM with a JOIN (Large Table Performance) Postgresql?

Tags:

optimization

postgresql

Martin Dale Lyness

2 Answers

Richard Huxton

Martin Dale Lyness

Recent Activity

Donate For Us

UPDATE FROM with a JOIN (Large Table Performance) Postgresql?

Tags:

optimization

postgresql

Martin Dale Lyness

2 Answers

Richard Huxton

Martin Dale Lyness

Related questions

Recent Activity

Donate For Us