Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Postgresql query optimization No inner/outer join allowed

I am given this query to optimize on POSTGRESQL 9.2:

SELECT C.name, COUNT(DISTINCT I.id) AS NumItems, COUNT(B.id)
FROM Categories C INNER JOIN Items I ON(C.id = I.category) 
                  INNER JOIN Bids B ON (I.id = B.item_id)
GROUP BY C.name

As part of my school assignment.

I have created these indexes on respective table: items(category)-->2ndary b+tree, bids(item_id)-->2ndary b+tree, and categories(id)-->primary index here,

The weird part is, PostgreSQL is doing a sequential scan on my Items, Categories, and Bids table, and when i set the enable_seqscan=off, the index search turns out to be more horrendous than the result below.

When I run explain in PostgreSQL this is the result: PLEASE DON'T REMOVE THE INDENTATIONS AS THEY ARE IMPORTANT!

GroupAggregate  (cost=119575.55..125576.11 rows=20 width=23) (actual time=6912.523..9459.431 rows=20 loops=1)
  Buffers: shared hit=30 read=12306, temp read=6600 written=6598
  ->  Sort  (cost=119575.55..121075.64 rows=600036 width=23) (actual time=6817.015..8031.285 rows=600036 loops=1)
        Sort Key: c.name
        Sort Method: external merge  Disk: 20160kB
        Buffers: shared hit=30 read=12306, temp read=6274 written=6272
        ->  Hash Join  (cost=9416.95..37376.03 rows=600036 width=23) (actual time=407.974..3322.253 rows=600036 loops=1)
              Hash Cond: (b.item_id = i.id)
              Buffers: shared hit=30 read=12306, temp read=994 written=992
              ->  Seq Scan on bids b  (cost=0.00..11001.36 rows=600036 width=8) (actual time=0.009..870.898 rows=600036 loops=1)
                    Buffers: shared hit=2 read=4999
              ->  Hash  (cost=8522.95..8522.95 rows=50000 width=19) (actual time=407.784..407.784 rows=50000 loops=1)
                    Buckets: 4096  Batches: 2  Memory Usage: 989kB
                    Buffers: shared hit=28 read=7307, temp written=111
                    ->  Hash Join  (cost=1.45..8522.95 rows=50000 width=19) (actual time=0.082..313.211 rows=50000 loops=1)
                          Hash Cond: (i.category = c.id)
                          Buffers: shared hit=28 read=7307
                          ->  Seq Scan on items i  (cost=0.00..7834.00 rows=50000 width=8) (actual time=0.004..144.554 rows=50000 loops=1)
                                Buffers: shared hit=27 read=7307
                          ->  Hash  (cost=1.20..1.20 rows=20 width=19) (actual time=0.062..0.062 rows=20 loops=1)
                                Buckets: 1024  Batches: 1  Memory Usage: 1kB
                                Buffers: shared hit=1
                                ->  Seq Scan on categories c  (cost=0.00..1.20 rows=20 width=19) (actual time=0.004..0.028 rows=20 loops=1)
                                      Buffers: shared hit=1
Total runtime: 9473.257 ms

See this plan on explain.depesz.com.

I just want to know why this happens, i.e. why indexes make the query horrendously slow compared to sequential scan.

Edit: I think i have managed to uncover a couple of stuff by going through the postgresql documentation. Postgresql decided to do seq scan on some table such as bids and items because it predicted it has to retrieve every single rows in the table (compare the number of rows in the bracket before the actual time and the number of rows in the actual time part). Sequential scan is better in retrieving all rows. Well nothing can be done in that part.

I have created extra index for categories(name), and the result below is what i have. It somehow improved but now hash join is replaced with nested loop. Any clue in why?

GroupAggregate  (cost=0.00..119552.02 rows=20 width=23) (actual time=617.330..7725.314 rows=20 loops=1)
  Buffers: shared hit=178582 read=37473 written=14, temp read=2435 written=436
  ->  Nested Loop  (cost=0.00..115051.55 rows=600036 width=23) (actual time=0.120..6186.496 rows=600036 loops=1)
        Buffers: shared hit=178582 read=37473 written=14, temp read=2109 written=110
        ->  Nested Loop  (cost=0.00..26891.55 rows=50000 width=19) (actual time=0.066..2827.955 rows=50000 loops=1)
              Join Filter: (c.id = i.category)
              Rows Removed by Join Filter: 950000
              Buffers: shared hit=2 read=7334 written=1, temp read=2109 written=110
              ->  Index Scan using categories_name_idx on categories c  (cost=0.00..12.55 rows=20 width=19) (actual time=0.039..0.146 rows=20 loops=1)
                    Buffers: shared hit=1 read=1
              ->  Materialize  (cost=0.00..8280.00 rows=50000 width=8) (actual time=0.014..76.908 rows=50000 loops=20)
                    Buffers: shared hit=1 read=7333 written=1, temp read=2109 written=110
                    ->  Seq Scan on items i  (cost=0.00..7834.00 rows=50000 width=8) (actual time=0.007..170.464 rows=50000 loops=1)
                          Buffers: shared hit=1 read=7333 written=1
        ->  Index Scan using bid_itemid_idx on bids b  (cost=0.00..1.60 rows=16 width=8) (actual time=0.016..0.036 rows=12 loops=50000)
              Index Cond: (item_id = i.id)
              Buffers: shared hit=178580 read=30139 written=13
Total runtime: 7726.392 ms

Have a look on the plan here if it is better.

I have managed to reduce it to 114062.92 by creating index on category(id) and items(category). Postgresql used both of the indexes to get to 114062.92 cost. However, now postgresql is playing game with me by not using the index! why is it so buggy?

like image 791
ImNoob Avatar asked Oct 20 '12 03:10

ImNoob


People also ask

Does PostgreSQL support outer join?

PostgreSQL supports inner join, left join, right join, full outer join, cross join, natural join, and a special kind of join called self-join.

Which join is faster in PostgreSQL?

Nested loop joins are particularly efficient if the outer relation is small, because then the inner loop won't be executed too often. It is the typical join strategy used in OLTP workloads with a normalized data model, where it is highly efficient.

Can Postgres handle 100 million rows?

PostgreSQL does not impose a limit on the number of rows in any table. There is no PostgreSQL-imposed limit on the number of indexes you can create on a table. Of course, performance may degrade if you choose to create more and more indexes on a table with more and more columns.


1 Answers

Thankyou for posting EXPLAIN output without being asked, and for the EXPLAIN (BUFFERS, ANALYZE).

A significant part of your query's performance issue is likely to be the outer sort plan node, which is doing an on-disk merge sort with a temporary file:

Sort Method: external merge Disk: 20160kB

You could do this sort in memory by setting:

SET work_mem = '50MB';

before running your query. This setting can also be set per-user, per-database or globally in postgresql.conf.

I'm not convinced that adding indexes will be of much benefit as the query is currently structured. It needs to read and join all rows from all three tables, and hash joins are likely to be the fastest way to do so.

I suspect there are other ways to express that query that will use entirely different and more efficient execution strategies, but I'm having a brain-fade about what they might be and don't want to spend the time to make up dummy tables to play around. More work_mem should significantly improve the query as it stands.

like image 82
Craig Ringer Avatar answered Sep 17 '22 15:09

Craig Ringer