Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Postgres - Is this the right way to create a partial index on a boolean column?

I have the following table:

CREATE TABLE recipemetadata
(
  --Lots of columns
  diet_glutenfree boolean NOT NULL,
);

Most every row will be set to FALSE unless someone comes up with some crazy new gluten free diet that sweeps the country.

I need to be able to very quickly query for rows where this value is true. I've created the index:

CREATE INDEX IDX_RecipeMetadata_GlutenFree ON RecipeMetadata(diet_glutenfree) WHERE diet_glutenfree;

It appears to work, however I can't figure out how to tell if indeed it's only indexing rows where the value is true. I want to make sure it's not doing something silly like indexing any rows with any value at all.

Should I add an operator to the WHERE clause, or is this syntax perfectly valid? Hopefully this isn't one of those super easy RTFM questions that will get downvoted 30 times.

UPDATE:

I've gone ahead and added 10,000 rows to RecipeMetadata with random values. I then did an ANALYZE on the table and a REINDEX just to be sure. When I run the query:

select recipeid from RecipeMetadata where diet_glutenfree;

I get:

'Seq Scan on recipemetadata  (cost=0.00..214.26 rows=5010 width=16)'
'  Filter: diet_glutenfree'

So, it appears to be doing a sequential scan on the table even though only about half the rows have this flag. The index is being ignored.

If I do:

select recipeid from RecipeMetadata where not diet_glutenfree;

I get:

'Seq Scan on recipemetadata  (cost=0.00..214.26 rows=5016 width=16)'
'  Filter: (NOT diet_glutenfree)'

So no matter what, this index is not being used.

like image 461
Mike Christensen Avatar asked Dec 15 '11 03:12

Mike Christensen


2 Answers

I've confirmed the index works as expected.

I re-created the random data, only this time set diet_glutenfree to random() > 0.9 so there's only a 10% chance of an on bit.

I then re-created the indexes and tried the query again.

SELECT RecipeId from RecipeMetadata where diet_glutenfree;

Returns:

'Index Scan using idx_recipemetadata_glutenfree on recipemetadata  (cost=0.00..135.15 rows=1030 width=16)'
'  Index Cond: (diet_glutenfree = true)'

And:

SELECT RecipeId from RecipeMetadata where NOT diet_glutenfree;

Returns:

'Seq Scan on recipemetadata  (cost=0.00..214.26 rows=8996 width=16)'
'  Filter: (NOT diet_glutenfree)'

It seems my first attempt was polluted since PG estimates it's faster to scan the whole table rather than hit the index if it has to load over half the rows anyway.

However, I think I would get these exact results on a full index of the column. Is there a way to verify the number of rows indexed in a partial index?

UPDATE

The index is around 40k. I created a full index of the same column and it's over 200k, so it looks like it's definitely partial.

like image 105
Mike Christensen Avatar answered Sep 22 '22 00:09

Mike Christensen


An index on a one-bit field makes no sense. For understanding the decisions made by the planner, you must think in terms of pages, not in terms of rows.

For 8K pages and an (estinated) rowsize of 80, there are 100 rows on every page. Assuming a random distribution, the chance that a page consist of only rows with a true value is neglectable, pow (0.5, 100), about 1e-33, IICC. (and the same for 'false' of course) Thus for a query on gluten_free == true, every page has to be fetched anyway, and filtered afterwards. Using an index would only cause more pages (:the index) to be fetched.

like image 28
wildplasser Avatar answered Sep 21 '22 00:09

wildplasser