Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What to index on queries with lots of columns in the WHERE clause

Building a search engine for an apartment site and I'm not sure how to index the apartments table.

Example of queries:

  • ...WHERE city_id = 1 AND size > 500 AND rooms = 2
  • ...WHERE area_id = 2 AND ad_type = 'agent' AND price BETWEEN 10000 AND 14000
  • ...WHERE area_id = 2 OR area_id = 4 AND published_at > '2016-01-01' AND ad_type = 1

As you can see, the columns can vary a lot, and the number of columns in the WHERE clause can be up to 10, or possibly even more.

  • Should I index all of them?
  • Only the most common ones?
like image 742
Frexuz Avatar asked Sep 19 '16 06:09

Frexuz


People also ask

Do indexes help with where clause?

Indexes are also used to aid in filtering tables to assist in optimizing queries. The most obvious case of this is to optimize WHERE clauses. For example, the query "select * from employee where lastname = 'Jones'" results in a live cursor. The SQL engine uses Advantage Optimized Filters (AOFs) to create the filter.

How does index work on multiple columns?

Multicolumn indexes (also known as composite indexes) are similar to standard indexes. They both store a sorted “table” of pointers to the main table. Multicolumn indexes however can store additional sorted pointers to other columns.

Can you apply index to multiple columns?

An index can be defined on more than one column of a table. For example, if you have a table of this form: CREATE TABLE test2 ( major int, minor int, name varchar );

How do you know which columns need indexing?

Columns with one or more of the following characteristics are good candidates for indexing: Values are unique in the column, or there are few duplicates. There is a wide range of values (good for regular indexes). There is a small range of values (good for bitmap indexes).


2 Answers

You have to figure out what WHERE clauses you are going to use with this query, how often each will occur and and how selective each condition will be.

  • Don't index for queries that occur seldom unless you have to.

  • Use multicolumn indexes, starting with those columns that will occur in an = comparison.

  • Concerning the order of columns in a multicolumn index, start with those columns that will be used in a query by themselves (an index can be used for a query with only some of its columns, provided they are at the beginning of the index).

  • You might omit columns with low selectivity, like gender.

For example, with your above queries, if they are all frequent and all columns are selective, these indexes would be good:

... ON apartments (city_id, rooms, size)

... ON apartments (area_id, ad_type, price)

... ON apartments (area_id, ad_type, published_at)

These indexes could also be used for WHERE clauses with only area_id or city_id in them.

It is bad to have too many indexes.

If the above method would lead to too many indexes, e.g. because the user can pick arbitrary columns for the WHERE clause, it is better to index individual columns or occasionally pairs of columns that regularly go together.

That way PostgreSQL can pick a bitmap index scan to combine several indexes for one query. That is less efficient than a regular index scan, but usually better than a sequential scan.

like image 104
Laurenz Albe Avatar answered Oct 05 '22 19:10

Laurenz Albe


Postgres 9.6 provides a new extension to address your conundrum precisely:

bloom index

From the same authors who brought trigram indexes or text search to Postgres (among other things).

A single bloom index on all involved columns works well for any combination of them in the WHERE clause - even if not as well as a separate btree indexes on each column. But a single index is much smaller and cheaper to maintain than many indexes. You'll have to weigh costs and benefits.

A bloom index excels for many index columns that can be combined in many ways.

I might combine a bloom index as "catch-all" with some tailored multicolumn btree indexes to optimize the most common combinations (along the guidelines provided by @Laurenz) and some single column indexes on the most frequently queried columns.

Some more explanation:

  • Is a composite index also good for queries on the first field?

The feature is new and there are some important limitations. Quoting the manual:

  • Only operator classes for int4 and text are included with the module.

  • Only the = operator is supported for search. But it is possible to add support for arrays with union and intersection operations in the future.

So not for published_at, which looks like a date (but you could still extract an EPOCH and index that) and only for equality predicates.

After creating the extension (once per DB):

CREATE EXTENSION bloom;

Create a bloom index:

CREATE INDEX tbl_bloomidx
ON tbl USING bloom (area_id, city_id, size, rooms, ad_type);  -- many more columns?

And some others:

CREATE INDEX tbl_published_at ON tbl (published_at);
CREATE INDEX tbl_published_at ON tbl (price);
-- some popular combinations...

The manual has some examples comparing bloom, multicolumn and single-column btree indexes. Very insightful.

like image 29
Erwin Brandstetter Avatar answered Oct 05 '22 18:10

Erwin Brandstetter