I have a table in PostgreSQL 9.2 that has a text
column. Let's call this text_col
. The values in this column are fairly unique (may contain 5-6 duplicates at the most). The table has ~5 million rows. About half these rows contain a null
value for text_col
. When I execute the following query I expect 1-5 rows. In most cases (>80%) I only expect 1 row.
explain analyze SELECT col1,col2.. colN
FROM table
WHERE text_col = 'my_value';
A btree
index exists on text_col
. This index is never used by the query planner and I am not sure why. This is the output of the query.
Seq Scan on two (cost=0.000..459573.080 rows=93 width=339) (actual time=1392.864..3196.283 rows=2 loops=1)
Filter: (victor = 'foxtrot'::text)
Rows Removed by Filter: 4077384
I added another partial index to try to filter out those values that were not null, but that did not help (with or without text_pattern_ops
. I do not need text_pattern_ops
considering no LIKE
conditions are expressed in my queries, but they also match equality).
CREATE INDEX name_idx
ON table
USING btree
(text_col COLLATE pg_catalog."default" text_pattern_ops)
WHERE text_col IS NOT NULL;
Disabling sequence scans using set enable_seqscan = off;
makes the planner still pick the seqscan
over an index_scan
. In summary...
To create a partial index that suits our example, use a command such as this: CREATE INDEX access_log_client_ip_ix ON access_log (client_ip) WHERE NOT (client_ip > inet '192.168. 100.0' AND client_ip < inet '192.168.
Fortunately, PostgreSQL has the ability to combine multiple indexes (including multiple uses of the same index) to handle cases that cannot be implemented by single index scans. The system can form AND and OR conditions across several index scans.
MySQL as of version 8.0 does not support partial indexes. In MySQL, the term "partial index" is sometimes used to refer to prefix indexes, where only a truncated prefix of each value is stored in the index. This is another technique for reducing index size.
PostgreSQL automatically creates a unique index when a unique constraint or primary key is defined for a table. The index covers the columns that make up the primary key or unique constraint (a multicolumn index, if appropriate), and is the mechanism that enforces the constraint.
A partial index is a good idea to exclude half the rows of the table which you obviously do not need. Simpler:
CREATE INDEX name_idx ON table (text_col)
WHERE text_col IS NOT NULL;
Be sure to run ANALYZE table
after creating the index. (Autovacuum does that automatically after some time if you don't do it manually, but if you test right after creation, your test will fail.)
Then, to convince the query planner that a particular partial index can be used, repeat the WHERE
condition in the query - even if it seems completely redundant:
SELECT col1,col2, .. colN
FROM table
WHERE text_col = 'my_value'
AND text_col IS NOT NULL; -- repeat condition
Voilá.
Per documentation:
However, keep in mind that the predicate must match the conditions used in the queries that are supposed to benefit from the index. To be precise, a partial index can be used in a query only if the system can recognize that the
WHERE
condition of the query mathematically implies the predicate of the index. PostgreSQL does not have a sophisticated theorem prover that can recognize mathematically equivalent expressions that are written in different forms. (Not only is such a general theorem prover extremely difficult to create, it would probably be too slow to be of any real use.) The system can recognize simple inequality implications, for example "x < 1" implies "x < 2"; otherwise the predicate condition must exactly match part of the query'sWHERE
condition or the index will not be recognized as usable. Matching takes place at query planning time, not at run time. As a result, parameterized query clauses do not work with a partial index.
As for parameterized queries: again, add the (redundant) predicate of the partial index as an additional, constant WHERE
condition, and it works just fine.
An important update in Postgres 9.6 largely improves chances for index-only scans (which can make queries cheaper and the query planner will more readily chose such query plans). Related:
A partial index is only used if the WHERE conditions match. Thus an index with WHERE text_col IS NOT NULL
can only be used if you use the same condition in your SELECT
. Collation mismatch could also cause harm.
Try the following:
CREATE INDEX foo ON table (text_col)
ANALYZE table
I figured it out. Upon taking a closer look at the pg_stats
view that analyze
helps build, I came across this excerpt on the documentation.
Statistical correlation between physical row ordering and logical ordering of the column values. This ranges from -1 to +1. When the value is near -1 or +1, an index scan on the column will be estimated to be cheaper than when it is near zero, due to reduction of random access to the disk. (This column is null if the column data type does not have a < operator.)
On my local box the correlation number is 0.97
and on production it was 0.05
. Thus the planner is estimating that it is easier to go through all those rows sequentially instead of looking up the index each time and diving into a random access on the disk block. This is the query I used to peek at the correlation number.
select * from pg_stats where tablename = 'table_name' and attname = 'text_col';
This table also has a few updates performed on its rows. The avg_width
of the rows is estimated to be 20 bytes. If the update has a large value for a text column, it can exceed the average and also result in a slower update. My guess was that the physical and logical ordering are slowing moving apart with each update. To fix that I executed the following queries.
ALTER TABLE table_name SET (FILLFACTOR = 80);
VACUUM FULL table_name;
REINDEX TABLE table_name;
ANALYZE table_name;
The idea is that I could give each disk block a 20% buffer and vacuum full
the table to reclaim lost space and maintain physical and logical order. After I did this the query picks up the index.
explain analyze SELECT col1,col2... colN
FROM table_name
WHERE text_col is not null
AND
text_col = 'my_value';
Index Scan using tango on two (cost=0.000..165.290 rows=40 width=339) (actual time=0.083..0.086 rows=1 loops=1)
Index Cond: ((victor five NOT NULL) AND (victor = 'delta'::text))
Excluding the NULL condition picks up the other index with a bitmap heap scan.
Bitmap Heap Scan on two (cost=5.380..392.150 rows=98 width=339) (actual time=0.038..0.039 rows=1 loops=1)
Recheck Cond: (victor = 'delta'::text)
-> Bitmap Index Scan on tango (cost=0.000..5.360 rows=98 width=0) (actual time=0.029..0.029 rows=1 loops=1)
Index Cond: (victor = 'delta'::text)
While it initially looked like correlation
plays a major role in choosing the index scan @Mike has observed that a correlation
value that is close to 0 on his database still resulted in an index scan. Changing fill factor and vacuuming fully has helped but I'm unsure why.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With