How can I optimize this query (or is there a better way)?

Question

I recently asked a question and someone provided me with a solution that works but I forgot to mention there was that my table has millions of rows (~10 million for items table, ~1 millon for the others) and perhaps they thought I was working with a small dataset as in the example I provided.

Here is the SQL:

WITH a AS (
  SELECT item.id, string_agg(prefered_store.store::varchar, ',') wishlist_stores
  FROM item, list_wishlist, wishlist, prefered_store
  WHERE item.list=list_wishlist.list
    AND list_wishlist.wishlist=wishlist.id
    AND wishlist.prefered_stores=prefered_store.id
  GROUP BY item.id
), b AS (
  SELECT item.id, 
    string_agg(
      prefered_store.store::varchar || ',' || prefered_store.comment,
      ' ; ') item_stores_comments
    FROM item, prefered_store
    WHERE item.prefered_stores=prefered_store.id
    GROUP BY item.id
)
SELECT a.id,item_stores_comments,wishlist_stores 
FROM a,b
WHERE a.id=b.id

While it does exactly what I need, it's extremely slow. Takes ~10 minutes with limit of just one row and about 15 minutes for 10 rows. I'm still waiting to see how long it will take for a thousand rows (it's been almost and hour). Now my desktop is not the fastest: Pentium 4 with 1.5GB ram but it still doesn't feel right.

I've indexed all the fields in the WHERE clause and created primary keys where needed. Other than this is there any way to make this query run faster?

PostgreSQL 9.2

DDL: https://docs.google.com/file/d/0BwiyuwRCaqkCM09LVkJ4YlVNLWM/edit

Simple diagram with only relevant tables and fields: enter image description here

EXPLAIN ANALYZE:

Merge Join  (cost=23342752.95..12971604557.95 rows=863210883998 width=68) (actual time=1182616.544..1251542.167 rows=13139337 loops=1)
  Merge Cond: (a.id = b.id)
  CTE a
    ->  GroupAggregate  (cost=8477658.65..8992463.86 rows=13139337 width=8) (actual time=252170.500..307061.559 rows=13139337 loops=1)
          ->  Sort  (cost=8477658.65..8547771.35 rows=28045080 width=8) (actual time=252170.391..282495.516 rows=14870222 loops=1)
                Sort Key: public.item.id
                Sort Method: external merge  Disk: 261528kB
                ->  Merge Join  (cost=3010452.34..3474579.76 rows=28045080 width=8) (actual time=138444.102..210768.838 rows=14870222 loops=1)
                      Merge Cond: (list_wishlist.list = public.item.list)
                      ->  Sort  (cost=689954.53..695268.01 rows=2125390 width=8) (actual time=30482.447..55193.049 rows=1286901 loops=1)
                            Sort Key: list_wishlist.list
                            Sort Method: external merge  Disk: 22624kB
                            ->  Hash Join  (cost=66643.55..408462.52 rows=2125390 width=8) (actual time=10417.244..26147.517 rows=1286901 loops=1)
                                  Hash Cond: (wishlist.prefered_stores = public.prefered_store.id)
                                  ->  Hash Join  (cost=38565.70..96225.43 rows=1226863 width=8) (actual time=8188.097..19815.024 rows=1226863 loops=1)
                                        Hash Cond: (list_wishlist.wishlist = wishlist.id)
                                        ->  Seq Scan on list_wishlist  (cost=0.00..22266.63 rows=1226863 width=8) (actual time=12.786..7467.442 rows=1226863 loops=1)
                                        ->  Hash  (cost=20352.20..20352.20 rows=1110120 width=8) (actual time=7314.531..7314.531 rows=1110087 loops=1)
                                              Buckets: 4096  Batches: 64  Memory Usage: 689kB
                                              ->  Seq Scan on wishlist  (cost=0.00..20352.20 rows=1110120 width=8) (actual time=7.621..6572.731 rows=1110087 loops=1)
                                  ->  Hash  (cost=14027.49..14027.49 rows=856349 width=8) (actual time=2159.339..2159.339 rows=856349 loops=1)
                                        Buckets: 4096  Batches: 64  Memory Usage: 536kB
                                        ->  Seq Scan on prefered_store  (cost=0.00..14027.49 rows=856349 width=8) (actual time=0.071..1602.971 rows=856349 loops=1)
                      ->  Materialize  (cost=2320484.45..2386181.13 rows=13139337 width=8) (actual time=107961.603..149020.809 rows=14870219 loops=1)
                            ->  Sort  (cost=2320484.45..2353332.79 rows=13139337 width=8) (actual time=107961.575..145971.848 rows=13139337 loops=1)
                                  Sort Key: public.item.list
                                  Sort Method: external merge  Disk: 231088kB
                                  ->  Seq Scan on item  (cost=0.00..228006.37 rows=13139337 width=8) (actual time=27.636..47661.750 rows=13139337 loops=1)
  CTE b
    ->  GroupAggregate  (cost=7166704.38..7843349.46 rows=13139337 width=12) (actual time=524258.000..794537.585 rows=13139337 loops=1)
          ->  Sort  (cost=7166704.38..7223638.09 rows=22773483 width=12) (actual time=524257.908..755765.703 rows=13858612 loops=1)
                Sort Key: public.item.id
                Sort Method: external merge  Disk: 297912kB
                ->  Merge Join  (cost=2448353.26..2826901.79 rows=22773483 width=12) (actual time=201205.036..425873.108 rows=13858612 loops=1)
                      Merge Cond: (public.prefered_store.id = public.item.prefered_stores)
                      ->  Sort  (cost=127685.43..129826.31 rows=856349 width=12) (actual time=4545.447..12507.054 rows=856346 loops=1)
                            Sort Key: public.prefered_store.id
                            Sort Method: external merge  Disk: 18408kB
                            ->  Seq Scan on prefered_store  (cost=0.00..14027.49 rows=856349 width=12) (actual time=0.060..2707.353 rows=856349 loops=1)
                      ->  Materialize  (cost=2320484.45..2386181.13 rows=13139337 width=8) (actual time=196659.554..406944.706 rows=13858611 loops=1)
                            ->  Sort  (cost=2320484.45..2353332.79 rows=13139337 width=8) (actual time=196659.535..396917.629 rows=13139337 loops=1)
                                  Sort Key: public.item.prefered_stores
                                  Sort Method: external merge  Disk: 231096kB
                                  ->  Seq Scan on item  (cost=0.00..228006.37 rows=13139337 width=8) (actual time=0.032..54885.583 rows=13139337 loops=1)
  ->  Sort  (cost=3253469.82..3286318.16 rows=13139337 width=36) (actual time=344329.838..353118.692 rows=13139337 loops=1)
        Sort Key: a.id
        Sort Method: external sort  Disk: 259792kB
        ->  CTE Scan on a  (cost=0.00..262786.74 rows=13139337 width=36) (actual time=252170.512..320132.738 rows=13139337 loops=1)
  ->  Materialize  (cost=3253469.82..3319166.50 rows=13139337 width=36) (actual time=838286.670..888495.578 rows=13139337 loops=1)
        ->  Sort  (cost=3253469.82..3286318.16 rows=13139337 width=36) (actual time=838286.652..886198.912 rows=13139337 loops=1)
              Sort Key: b.id
              Sort Method: external sort  Disk: 385320kB
              ->  CTE Scan on b  (cost=0.00..262786.74 rows=13139337 width=36) (actual time=524258.017..811717.462 rows=13139337 loops=1)
Total runtime: 1253101.865 ms

vyegorov · Accepted Answer

See, in PostgreSQL CTEs are acting as an optimization fence. As a result no matter which predicates you will add to the outer part of the query, PostgreSQL will perform a full traverse of the WITH() clauses. So to optimize you need to get rid of CTEs.

This is not easy as you prefered_store table is denormalized.

Please, try this query (also on SQL Fiddle):

SELECT i.id item,
    (SELECT string_agg(store||','||comment, ';')
       FROM prefered_store WHERE id=i.prefered_stores) item_stores_comments,
    string_agg(ps.store::text, ',') whishlist_stores
  FROM item i
  JOIN list_wishlist lw ON lw.list=i.list
  JOIN wishlist w ON w.id=lw.wishlist
  JOIN prefered_store ps ON ps.id=w.prefered_stores
 GROUP BY i.id;

But I would recommend reviewing schema design.

Mike Sherrill 'Cat Recall' · Answer

Building columns of comma-separated values in SQL is almost always the wrong approach. Better to return rows of data, and let application code deal with display formatting.

Test by eliminating the string_agg() function and the GROUP BY clause.

with a as (
  select item.id, 
         prefered_store.store wishlist_stores
  from item
  inner join list_wishlist on item.list=list_wishlist.list
  inner join wishlist on list_wishlist.wishlist=wishlist.id
  inner join prefered_store on wishlist.prefered_stores=prefered_store.id
), b as (
  select item.id, 
         prefered_store.store,
         prefered_store.comment item_stores_comments
    from item
    inner join prefered_store on item.prefered_stores=prefered_store.id
)
select * from a
inner join b on a.id = b.id

The SQL Fiddle you posted isn't much use. It has no primary keys, no secondary indexes, and not enough rows to avoid sequential scans.

How can I optimize this query (or is there a better way)?

Tags:

sql

postgresql

postgresql-9.2

userBG

2 Answers

vyegorov

Mike Sherrill 'Cat Recall'

Recent Activity

Donate For Us

How can I optimize this query (or is there a better way)?

Tags:

sql

postgresql

postgresql-9.2

userBG

2 Answers

vyegorov

Mike Sherrill 'Cat Recall'

Related questions

Recent Activity

Donate For Us