Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Different query plan for GROUP BY DISTINCT query on 2 distinct machines

Tags:

postgresql

I'm trying to run the same query on two databases running on separate machines. And I am trying to find the setting/settings that causes them to use different query plans.

SETUP

CREATE TABLE my_table (
    condition_id BIGINT,
    group_by_id BIGINT,
    order_date DATE,
    data DOUBLE PRECISION
);

INSERT INTO my_table (condition_id, group_by_id, order_date, data)
SELECT 1000 * random(),
       10000 * random(),
       current_date - cast(cast(100 * random() AS BIGINT) || ' day' AS INTERVAL),
       random()
FROM generate_series(1, 10000000);

CREATE INDEX my_table_idx ON my_table (condition_id, group_by_id, order_date DESC);

VACUUM ANALYSE my_table;

QUERY

EXPLAIN (ANALYSE TRUE, BUFFERS TRUE)
SELECT DISTINCT ON (group_by_id) *
FROM my_table
    WHERE condition_id = 1
ORDER BY group_by_id, order_date DESC;

RESULTS

MACHINE 1 UNCACHED

Unique  (cost=0.56..10220.12 rows=6209 width=28) (actual time=0.034..8.820 rows=6393 loops=1)
  Buffers: shared hit=10106 read=54
  I/O Timings: read=0.367
  ->  Index Scan using my_table_idx on my_table  (cost=0.56..10195.97 rows=9662 width=28) (actual time=0.033..6.935 rows=10106 loops=1)
        Index Cond: (condition_id = 1)
        Buffers: shared hit=10106 read=54
        I/O Timings: read=0.367
Planning time: 0.177 ms
Execution time: 9.197 ms

MACHINE 1 CACHED

Unique  (cost=0.56..10220.12 rows=6209 width=28) (actual time=0.013..9.059 rows=6393 loops=1)
  Buffers: shared hit=10160
  ->  Index Scan using my_table_idx on my_table  (cost=0.56..10195.97 rows=9662 width=28) (actual time=0.013..7.245 rows=10106 loops=1)
        Index Cond: (condition_id = 1)
        Buffers: shared hit=10160
Planning time: 0.069 ms
Execution time: 9.430 ms

MACHINE 2 UNCACHED

Unique  (cost=10261.82..10271.81 rows=6319 width=28) (actual time=3604.888..3607.945 rows=6243 loops=1)
  Buffers: shared read=9236
  I/O Timings: read=3186.548
  ->  Sort  (cost=10261.82..10266.81 rows=9987 width=28) (actual time=3604.886..3606.016 rows=9808 loops=1)
        Sort Key: group_by_id, order_date DESC
        Sort Method: quicksort  Memory: 1151kB
        Buffers: shared read=9236
        I/O Timings: read=3186.548
        ->  Bitmap Heap Scan on my_table  (cost=70.59..10129.13 rows=9987 width=28) (actual time=29.201..3528.437 rows=9808 loops=1)
              Recheck Cond: (condition_id = 1)
              Heap Blocks: exact=9184
              Buffers: shared read=9236
              I/O Timings: read=3186.548
              ->  Bitmap Index Scan on my_table_idx  (cost=0.00..70.09 rows=9987 width=0) (actual time=27.637..27.637 rows=9808 loops=1)
                    Index Cond: (condition_id = 1)
                    Buffers: shared read=52
                    I/O Timings: read=2.168
Planning time: 0.771 ms
Execution time: 3608.721 ms

MACHINE 2 CACHED

Unique  (cost=10261.82..10271.81 rows=6319 width=28) (actual time=32.859..40.082 rows=6243 loops=1)
  Buffers: shared hit=9236
  ->  Sort  (cost=10261.82..10266.81 rows=9987 width=28) (actual time=32.856..36.278 rows=9808 loops=1)
        Sort Key: group_by_id, order_date DESC
        Sort Method: quicksort  Memory: 1151kB
        Buffers: shared hit=9236
        ->  Bitmap Heap Scan on my_table  (cost=70.59..10129.13 rows=9987 width=28) (actual time=2.884..26.122 rows=9808 loops=1)
              Recheck Cond: (condition_id = 1)
              Heap Blocks: exact=9184
              Buffers: shared hit=9236
              ->  Bitmap Index Scan on my_table_idx  (cost=0.00..70.09 rows=9987 width=0) (actual time=1.551..1.551 rows=9808 loops=1)
                    Index Cond: (condition_id = 1)
                    Buffers: shared hit=52
Planning time: 0.098 ms
Execution time: 40.907 m

They are running different hardware and slightly different postgres versions. Machine 1 is running 10.10 while Machine 2 is running 10.6. Next step for me is probably to try and upgrade the version on both machines.

I've tried changing work_mem, random_page_cost, effective_cache_size and cpu_tuple_cost. But that doesn't seem to affect it (if I don't set it to something silly which cause them both to behave worse). Setup on both machines when running above test was:

work_mem 20971kB
random_page_cost 1.1
effective_cache_size 24GB
cpu_tuple_cost 0.01

I believe that I want Machine 2 to run the same query plan as Machine 1. Because as is it now 2 is heavily affected by how the data is ordered on disc and whether the data is in memory or not.

It turned out the query plans weren't all that different. Just that Machine 1 just kept it in cache forever. To actually make the reads reasonable when it's not cached I will probably try to cluster it from time to time. And maybe instead look over why machine 2 uncaches the table so often.

CLUSTER my_table USING my_table_idx;
like image 580
Andreas Sjöroos Avatar asked Nov 07 '22 10:11

Andreas Sjöroos


1 Answers

The two plans have almost the same estimated cost, so it could be randomness. Remember that you used random() to generate data. Does the plan remain stable per machine across several test runs?

The observed difference may well be a caching effect. The slow plan performs a lot of I/O, while the fast plan had everything cached.

To get even more from your execution plans, set track_io_timing to on.

Configuration parameters that influence the plan choice here are effective_cache_size (high valurs favor index scans) and random_page_cost (low values favor index scans).

like image 125
Laurenz Albe Avatar answered Nov 14 '22 21:11

Laurenz Albe