Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why is Postgres not using index on a simple GROUP BY?

I have created a 36M rows table with an index on type column:

CREATE TABLE items AS
  SELECT
    (random()*36000000)::integer AS id,
    (random()*10000)::integer AS type,
    md5(random()::text) AS s
  FROM
    generate_series(1,36000000);
CREATE INDEX items_type_idx ON items USING btree ("type");

I run this simple query and expect postgresql to use my index:

explain select count(*) from "items" group by "type";

But the query planner decides to use Seq Scan instead:

HashAggregate  (cost=734592.00..734627.90 rows=3590 width=12) (actual time=6477.913..6478.344 rows=3601 loops=1)
  Group Key: type
  ->  Seq Scan on items  (cost=0.00..554593.00 rows=35999800 width=4) (actual time=0.044..1820.522 rows=36000000 loops=1)
Planning time: 0.107 ms
Execution time: 6478.525 ms

Time without EXPLAIN: 5s 979ms

I have tried several solutions from here and here:

  • Run VACUUM ANALYZE or VACUUM ANALYZE
  • Configure default_statistics_target, random_page_cost, work_mem

but nothing helps apart from setting enable_seqscan = OFF:

SET enable_seqscan = OFF;
explain select count(*) from "items" group by "type";

GroupAggregate  (cost=0.56..1114880.46 rows=3590 width=12) (actual time=5.637..5256.406 rows=3601 loops=1)
  Group Key: type
  ->  Index Only Scan using items_type_idx on items  (cost=0.56..934845.56 rows=35999800 width=4) (actual time=0.074..2783.896 rows=36000000 loops=1)
        Heap Fetches: 0
Planning time: 0.103 ms
Execution time: 5256.667 ms

Time without EXPLAIN: 659ms

Query with index scan is about 10x faster on my machine.

Is there a better solution than setting enable_seqscan?

UPD1

My postgresql version is 9.6.3, work_mem = 4MB (tried 64MB), random_page_cost = 4 (tried 1.1), max_parallel_workers_per_gather = 0 (tried 4).

UPD2

I have tried to fill type column not with random numbers, but with i / 10000 to make pg_stats.correlation = 1 - still seqscan.

UPD3

@jgh is 100% right:

This typically only happens when the table's row width is much wider than some indexes

I've made large column data and now postgres use index. Thanks everyone!

like image 852
Denis Drozdov Avatar asked Jul 06 '17 17:07

Denis Drozdov


People also ask

Why is Postgres not using my index?

As such, if Postgres chooses not to use your index, it is likely that it is calculating the cost of that query plan to be higher than the one it has chosen.

Does PostgreSQL automatically index?

PostgreSQL automatically creates a unique index when a unique constraint or primary key is defined for a table. The index covers the columns that make up the primary key or unique constraint (a multicolumn index, if appropriate), and is the mechanism that enforces the constraint.

How does group by work in PostgreSQL?

The PostgreSQL GROUP BY clause is used in collaboration with the SELECT statement to group together those rows in a table that have identical data. This is done to eliminate redundancy in the output and/or compute aggregates that apply to these groups.

Do you need to index primary key Postgres?

PostgreSQL automatically creates an index for each unique constraint and primary key constraint to enforce uniqueness. Thus, it is not necessary to create an index explicitly for primary key columns.


1 Answers

The Index-only scans wiki says

It is important to realise that the planner is concerned with minimising the total cost of the query. With databases, the cost of I/O typically dominates. For that reason, "count(*) without any predicate" queries will only use an index-only scan if the index is significantly smaller than its table. This typically only happens when the table's row width is much wider than some indexes'.

and

Index-only scans are only used when the planner surmises that that will reduce the total amount of I/O required, according to its imperfect cost-based modelling. This all heavily depends on visibility of tuples, if an index would be used anyway (i.e. how selective a predicate is, etc), and if there is actually an index available that could be used by an index-only scan in principle

Accordingly, your index is not considered "significantly smaller" and the entire dataset is to be read, which leads the planner in using a seq scan

like image 132
JGH Avatar answered Oct 26 '22 02:10

JGH