Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Improving PARTITION BY performance?

Consider the following table :

 foo | bar
-----+-----
  3  |  1
  8  |  1
  2  |  1
  8  |  5
  6  |  5
  5  |  5
  4  |  5
  5  |  7
  4  |  7

Column foo contains whatever. Column bar is almost ordered and rows of common bar value follow each other. Table contains ~1.7 million rows in total and about 15 rows for each distinct bar value.

I find PARTITION BY quite slow and am wondering if I can do anything to improve its performance ?

I tried to CREATE INDEX bar_idx ON foobar(bar) but it had no effect on performance (IRL there is already a primary key on another column of the table). I'm using PostgreSQL 9.3.5.

Here are the EXPLAIN ANALYZE for a simple query with and without PARTITION BY :

> EXPLAIN ANALYZE SELECT count(foo) OVER (PARTITION BY bar) FROM foobar;
                                                           QUERY PLAN                                                       
--------------------------------------------------------------------------------------------------------------------------------
 WindowAgg  (cost=262947.92..293133.35 rows=1724882 width=8) (actual time=2286.082..3504.372 rows=1724882 loops=1)
   ->  Sort  (cost=262947.92..267260.12 rows=1724882 width=8) (actual time=2286.068..2746.047 rows=1724882 loops=1)
         Sort Key: bar
         Sort Method: external merge  Disk: 27176kB
         ->  Seq Scan on foobar  (cost=0.00..37100.82 rows=1724882 width=8) (actual time=0.019..441.827 rows=1724882 loops=1)
 Total runtime: 3606.695 ms
(6 lignes)

> EXPLAIN ANALYZE SELECT foo FROM foobar;
                                                     QUERY PLAN                                                 
--------------------------------------------------------------------------------------------------------------------
 Seq Scan on foobar  (cost=0.00..37100.82 rows=1724882 width=4) (actual time=0.014..385.931 rows=1724882 loops=1)
 Total runtime: 458.776 ms
(2 lignes)

First improvement, increasing work_mem :

In most cases increasing work_mem, as suggested by hbn, should help. In my case I'm working on SSD, so switching to RAM (increasing work_mem to 1 GB) only reduces the processing time by 1.5 :

> EXPLAIN (ANALYZE, BUFFERS) SELECT foo OVER (PARTITION BY bar) FROM foobar;
                                                           QUERY PLAN                                                           
--------------------------------------------------------------------------------------------------------------------------------
 WindowAgg  (cost=215781.92..245967.35 rows=1724882 width=8) (actual time=933.575..1931.656 rows=1724882 loops=1)
   Buffers: shared hit=2754 read=17098
   ->  Sort  (cost=215781.92..220094.12 rows=1724882 width=8) (actual time=933.558..1205.314 rows=1724882 loops=1)
         Sort Key: bar
         Sort Method: quicksort  Memory: 130006kB
         Buffers: shared hit=2754 read=17098
         ->  Seq Scan on foobar  (cost=0.00..37100.82 rows=1724882 width=8) (actual time=0.023..392.446 rows=1724882 loops=1)
               Buffers: shared hit=2754 read=17098
 Total runtime: 2051.494 ms
(9 lignes)

Second improvement, using CLUSTER :

I tried a few of the suggestions of this post — increasing statistics had no significant effect in my case. The only one which helped or was not already active was to "rewrite the table in the physical order of the index", using CLUSTER (you may prefer pg_repack, read original post) :

> CLUSTER foobar USING bar_idx;
CLUSTER
> EXPLAIN (ANALYZE, BUFFERS) SELECT count(foo) OVER (PARTITION BY bar) FROM foobar;
                                                                  QUERY PLAN                                                                  
----------------------------------------------------------------------------------------------------------------------------------------------
 WindowAgg  (cost=0.43..150079.25 rows=1724882 width=8) (actual time=0.031..1372.416 rows=1724882 loops=1)
   Buffers: shared hit=64 read=24503
   ->  Index Scan using bar_idx on foobar  (cost=0.43..124206.02 rows=1724882 width=8) (actual time=0.018..581.665 rows=1724882 loops=1)
         Buffers: shared hit=64 read=24503
 Total runtime: 1484.974 ms
(5 lignes)

Third improvement, subset of table :

In my case I eventually need to select on this table along with another table, so it seems to make sense to create a subset of the table as a table of its own :

CREATE TABLE subfoobar AS (SELECT * FROM foobar WHERE bar IN (SELECT DISTINCT bar FROM othertable) ORDER BY bar);

The new table as only 700k rows instead of 1.7 million, and the query time seems (after recreating an index on bar) roughly proportional, so the gain is significant :

> EXPLAIN (ANALYZE, BUFFERS) SELECT count(foo) OVER (PARTITION BY bar) FROM subfoobar;
                                                                      QUERY PLAN                                                                       
-------------------------------------------------------------------------------------------------------------------------------------------------------
 WindowAgg  (cost=0.42..37455.61 rows=710173 width=8) (actual time=0.025..543.437 rows=710173 loops=1)
   Buffers: shared hit=10290
   ->  Index Scan using bar_sub_idx on subfoobar  (cost=0.42..26803.02 rows=710173 width=8) (actual time=0.015..222.211 rows=710173 loops=1)
         Buffers: shared hit=10290
 Total runtime: 590.063 ms
(5 lignes)

Fourth improvement, summary table :

Since IRL the window function is involved many times in the query, the query itself will be performed many times (data mining), and the result of aggregates over the partitions will always be the same, I decided to opt for a more efficient approach : I extracted all of these values into a new "summary table" (not sure my definition matches the "official" one ?).

In our simple example, this would give

CREATE TABLE summary_foobar AS SELECT DISTINCT ON (bar) count(foo) OVER (PARTITION BY bar) AS cfoo, bar FROM foobar;

Actually, as suggested by hbn in comments, it's even better to create a MATERIALIZED VIEW instead of a new table, so that we can update it anytime simply with REFRESH MATERIALIZED VIEW summary_foobar; :

CREATE MATERIALIZED VIEW summary_foobar AS SELECT DISTINCT ON (bar) count(foo) OVER (PARTITION BY bar) AS cfoo, bar FROM foobar;

Then, applying the initial query to my real case tables :

> EXPLAIN (ANALYZE, BUFFERS) SELECT cfoo FROM subfoobar,summary_foobar WHERE subfoobar.bar=summary_foobar.bar;
                                                          QUERY PLAN                                                      
------------------------------------------------------------------------------------------------------------------------------
 Hash Join  (cost=1254.64..28939.67 rows=424685 width=73) (actual time=9.893..268.704 rows=370393 loops=1)
   Hash Cond: (subfoobar.bar = summary_foobar.bar)
   Buffers: shared hit=8916
   ->  Seq Scan on subfoobar  (cost=0.00..15448.73 rows=710173 width=4) (actual time=0.003..70.850 rows=710173 loops=1)
         Buffers: shared hit=8347
   ->  Hash  (cost=873.73..873.73 rows=30473 width=77) (actual time=9.872..9.872 rows=30473 loops=1)
         Buckets: 4096  Batches: 1  Memory Usage: 3347kB
         Buffers: shared hit=569
         ->  Seq Scan on summary_foobar  (cost=0.00..873.73 rows=30473 width=77) (actual time=0.003..4.569 rows=30473 loops=1)
               Buffers: shared hit=569
 Total runtime: 286.910 ms [~550 ms if using foobar instead of subfoobar]
(11 lignes)

All in all, for my real case queries I went down from 5000+ ms per query to about 150 ms (less than the example because of the WHERE clauses).

like image 958
Skippy le Grand Gourou Avatar asked Sep 30 '14 17:09

Skippy le Grand Gourou


People also ask

Does partitioning improve query performance?

In summary, partition itself may not get you better performance. It is quite possible when you partition your queries even start getting slower because now there is one more function to be processed between your query and data.

Can data partitioning reduce performance?

In many large-scale solutions, data is divided into partitions that can be managed and accessed separately. Partitioning can improve scalability, reduce contention, and optimize performance.

How does partitioning improve performance in Oracle?

Partitioning offers these advantages: Partitioning enables data management operations such data loads, index creation and rebuilding, and backup/recovery at the partition level, rather than on the entire table. This results in significantly reduced times for these operations. Partitioning improves query performance.


1 Answers

You probably need to increase work_mem. Your query is using an on-disk sort. It's using 27MB - try setting work_mem to 64MB or so and see how it performs then.

You can set it per session or transaction as well as in postgresql.conf.

SET work_mem TO '64MB';

Will set it for your current session.

Obviously a sensible value depends on how much RAM you have in your machine and the number of concurrent connections you expect to have.

like image 104
hbn Avatar answered Nov 15 '22 11:11

hbn