Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Index for a WHERE clause with datetime, and more

I'm using Postgres 9.1 and have a horribly slow performing query.

The Query:

Explain Analyze SELECT COUNT(DISTINCT email) FROM "invites" WHERE (
 created_at < '2012-10-10 21:08:05.259200'
 AND invite_method = 'email' 
 AND accept_count = 0 
 AND reminded_count < 3 
 AND (last_reminded_at IS NULL OR last_reminded_at < '2012-10-10 21:08:05.261483'))

Results:

Aggregate  (cost=19828.24..19828.25 rows=1 width=21) (actual time=11395.903..11395.903 rows=1 loops=1)
  ->  Seq Scan on invites  (cost=0.00..18970.57 rows=343068 width=21) (actual time=0.036..353.121 rows=337143 loops=1)
        Filter: ((created_at < '2012-10-10 21:08:05.2592'::timestamp without time zone) AND (reminded_count < 3) AND ((last_reminded_at IS NULL) OR (last_reminded_at < '2012-10-10 21:08:05.261483'::timestamp without time zone)) AND ((invite_method)::text = 'email'::text) AND (accept_count = 0))
Total runtime: 11395.970 ms

As you can see this is taking about 11 seconds. How would I go about adding an index to optimize this queries performance?

like image 810
AnApprentice Avatar asked Oct 16 '12 21:10

AnApprentice


1 Answers

Just indexing "everything" like Jim advises is not a very efficient strategy. Indexes carry a cost to maintain and combining many individual indexes is more expensive (to maintain and to query) than one tailored index. It always depends on your complete situation.

The cost of indexes is low for read-only or rarely written tables, but high for volatile tables with lots of write operations. An additional downside is that indexes prohibit HOT-Updates (Heap Only Tuples) changing involved columns. See:

  • Redundant data in update statements

If performance of the particular query is important, a partial multi-column index would be a good strategy. Specialized, but a lot cheaper and faster than individual indexes on all involved columns. The rule of thumb is to ...

  • put the columns for volatile conditions (vary between queries) in the index.
  • use stable conditions (the same for every query) in the WHERE clause to narrow down the partition of the index.

Judging from your column names (for lack of information), accept_count = 0 seems to be the most selective (and stable) filter here, while created_at and last_reminded_at probably keep changing. So maybe something like this:

CREATE INDEX invites_special_idx
ON     invites (created_at, last_reminded_at)
WHERE  accept_count = 0
AND    invite_method = 'email'
AND    reminded_count < 3;

Sort created_at and last_reminded_at ascending to match the query perfectly - which happens to be the default anyway. This way, the system can get all relevant rows in a single scan from the top of the index. Should be very fast.

As we discussed in one of your previous questions, it may be of additional help to cluster the table on the index. Be sure to read the manual about CLUSTER.
As @Craig provided, you can't CLUSTER on a partial index. Since CLUSTER is a one-time operation (effects degrade with later write operations) you could circumvent this restriction by creating a full index, CLUSTER the table and drop the index again. Like:

CREATE INDEX invites_special_idx2 ON invites (created_at, last_reminded_at);
CLUSTER invites USING invites_special_idx2;
DROP INDEX invites_special_idx2;

CLUSTER is only useful while there aren't other important queries with contradicting requirements for data distribution.

PostgreSQL 9.2 has a couple of new features that would make your query faster. In particular index-only scans (first item in the release notes). Consider upgrading.

like image 160
Erwin Brandstetter Avatar answered Oct 15 '22 11:10

Erwin Brandstetter