Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

UPDATE FROM with a JOIN (Large Table Performance) Postgresql?

I'm trying to get the following query to execute at a reasonable performance:

UPDATE order_item_imprint SET item_new_id = oi.item_new_id
      FROM order_item oi
      INNER JOIN order_item_imprint oii ON oi.item_number = oii.item_id          

Currently, it doesn't complete within 8 days so we killed it. The query explaination is as follows:

Query Graphical Explaination

Merge Join  (cost=59038021.60..33137238641.84 rows=1432184234121 width=1392)
Merge Cond: ((oi.item_number)::text = (oii.item_id)::text)
  ->  Nested Loop  (cost=0.00..10995925524.15 rows=309949417305 width=1398)
        ->  Index Scan using unique_order_item_item_number on order_item oi  (cost=0.00..608773.05 rows=258995 width=14)
        ->  Seq Scan on order_item_imprint  (cost=0.00..30486.39 rows=1196739 width=1384)
  ->  Materialize  (cost=184026.24..198985.48 rows=1196739 width=6)
        ->  Sort  (cost=184026.24..187018.09 rows=1196739 width=6)
              Sort Key: oii.item_id
              ->  Seq Scan on order_item_imprint oii  (cost=0.00..30486.39 rows=1196739 width=6)

I have indexes on both tables, and i've ensured the comparing fields are of identical type and size. I am now at the point of trying to change the postgresql server configuration to hopefully help, but I am not sure it will.

The order_item_imprint table is about 1.1 million in size with a 145MB disk footprint, and the order_item table is about a 3rd the size.

The main goal is i need to be able to run this along with several other queries during a few hour maintenance window.

Auto vacuum and analyze has been run prior to execution plan.

like image 349
Martin Dale Lyness Avatar asked Dec 23 '22 01:12

Martin Dale Lyness


2 Answers

I found an alternate way to write the query that allowed the pgsql optimizer to build the query much more efficiently

Actually, what you did was remove the unconstrained self-join on order_item_inprint.

If you look at the first line you'll see the following row estimate:

rows=1432184234121

That's 1.4 billion updates it is trying to do. When you aliased order_item_inprint in the join it got treated as a separate table from the update target.

like image 187
Richard Huxton Avatar answered Jan 13 '23 05:01

Richard Huxton


I found an alternate way to write the query that allowed the pgsql optimizer to build the query much more efficiently:

UPDATE order_item_imprint SET item_new_id = oi.item_new_id 
FROM order_item oi where oi.item_number = order_item_imprint.item_id

apparently that join was redundant and the following execution plan could have been used:

Hash Join  (cost=1.38..5.73 rows=48 width=1407)
  Hash Cond: ((order_item_imprint.item_id)::text = (oi.item_number)::text)
  ->  Seq Scan on order_item_imprint  (cost=0.00..3.63 rows=63 width=1399)
  ->  Hash  (cost=1.17..1.17 rows=17 width=23)
        ->  Seq Scan on order_item oi  (cost=0.00..1.17 rows=17 width=23)
like image 26
Martin Dale Lyness Avatar answered Jan 13 '23 04:01

Martin Dale Lyness