Why does a slight change in the search term slow down the query so much?

Question

I have the following query in PostgreSQL (9.5.1):

select e.id, (select count(id) from imgitem ii where ii.tabid = e.id and ii.tab = 'esp') as imgs,
 e.ano, e.mes, e.dia, cast(cast(e.ano as varchar(4))||'-'||right('0'||cast(e.mes as varchar(2)),2)||'-'|| right('0'||cast(e.dia as varchar(2)),2) as varchar(10)) as data,
 pl.pltag, e.inpa, e.det, d.ano anodet, coalesce(p.abrev,'')||' ('||coalesce(p.prenome,'')||')' determinador, d.tax, coalesce(v.val,v.valf)||' '||vu.unit as altura,
 coalesce(v1.val,v1.valf)||' '||vu1.unit as DAP, d.fam, tf.nome família, d.gen, tg.nome gênero, d.sp, ts.nome espécie, d.inf, e.loc, l.nome localidade, e.lat, e.lon
from esp e
left join det d on e.det = d.id
left join tax tf on d.fam = tf.oldfam
left join tax tg on d.gen = tg.oldgen
left join tax ts on d.sp = ts.oldsp
left join tax ti on d.inf = ti.oldinf
left join loc l on e.loc = l.id
left join pess p on p.id = d.detby
left join var v on v.esp = e.id and v.key = 265
left join varunit vu on vu.id = v.unit
left join var v1 on v1.esp = e.id and v1.key = 264
left join varunit vu1 on vu1.id = v1.unit
left join pl on pl.id = e.pl
WHERE unaccent(TEXT(coalesce(p.abrev,'')||' ('||coalesce(p.prenome,'')||')')) ilike unaccent('%vicen%')

It takes 430ms to retrieve 1129 rows from a total of 9250 in esp table.

If I change the search term from %vicen% to %vicent% (adding a 't'), it takes 431ms to retrieve the same 1129 rows.

Ordering by the search column, ascending and descending, I see that all 1129 rows have exactly the same name in both cases.

Now the strange: if I change the search term from %vicent% to %vicenti% (adding an 'i'), now it takes unbelievable 24.4 seconds to retrieve the same 1129 rows!

The searched term is always in the first coalesce, i.e. coalesce(p.abrev,''). I expect the query to run slower or faster, depending on the size of the searched string, but not that much!! Anyone has any idea of what's going on?

Results of EXPLAIN ANALYZE (would exceed the 30k character limit here):

For %vicen%: http://explain.depesz.com/s/2XF

For %vicenti%: http://explain.depesz.com/s/dEc6

Erwin Brandstetter · Accepted Answer

Why?

The reason is this:

Fast query:

->  Hash Left Join  (cost=1378.60..2467.48 rows=15 width=79) (actual time=41.759..85.037 rows=1129 loops=1)
      ...
      Filter: (unaccent(((((COALESCE(p.abrev, ''::character varying))::text || ' ('::text) || (COALESCE(p.prenome, ''::character varying))::text) || ')'::text)) ~~* (...)

Slow query:

->  Hash Left Join  (cost=1378.60..2467.48 rows=1 width=79) (actual time=35.084..80.209 rows=1129 loops=1)
      ...
      Filter: (unaccent(((((COALESCE(p.abrev, ''::character varying))::text || ' ('::text) || (COALESCE(p.prenome, ''::character varying))::text) || ')'::text)) ~~* unacc (...)

Extending the search pattern by another character causes Postgres to assume yet fewer hits. (Typically, this is a reasonable estimate.) Postgres obviously does not have precise enough statistics (none, actually, keep reading) to expect the same number of hits that you really get.

This causes a switch to a different query plan, which is even less optimal for the actual number of hits rows=1129.

Solution

Assuming current Postgres 9.5 since it has not been declared.

One way to improve the situation is to create an expression index on the expression in the predicate. This makes Postgres gather statistics for the actual expression, which can help the query even if the index itself is not used for the query. Without the index, there are no statistics for the expression at all. And if done right the index can be used for the query, that's even much better. But there are multiple problems with your current expression:

~~unaccent(TEXT(coalesce(p.abrev,'')||' ('||coalesce(p.prenome,'')||')')) ilike unaccent('%vicen%')~~

Consider this updated query, based on some assumptions about your undisclosed table definitions:

SELECT e.id
     , (SELECT count(*) FROM imgitem
        WHERE tabid = e.id AND tab = 'esp') AS imgs -- count(*) is faster
     , e.ano, e.mes, e.dia
     , e.ano::text || to_char(e.mes2, 'FM"-"00')
                   || to_char(e.dia,  'FM"-"00') AS data    
     , pl.pltag, e.inpa, e.det, d.ano anodet
     , format('%s (%s)', p.abrev, p.prenome) AS determinador
     , d.tax
     , coalesce(v.val,v.valf)   || ' ' || vu.unit  AS altura
     , coalesce(v1.val,v1.valf) || ' ' || vu1.unit AS dap
     , d.fam, tf.nome família, d.gen, tg.nome AS gênero, d.sp
     , ts.nome AS espécie, d.inf, e.loc, l.nome localidade, e.lat, e.lon
FROM      pess    p                        -- reorder!
JOIN      det     d   ON d.detby   = p.id  -- INNER JOIN !
LEFT JOIN tax     tf  ON tf.oldfam = d.fam
LEFT JOIN tax     tg  ON tg.oldgen = d.gen
LEFT JOIN tax     ts  ON ts.oldsp  = d.sp
LEFT JOIN tax     ti  ON ti.oldinf = d.inf  -- unused, see @joop's comment
LEFT JOIN esp     e   ON e.det     = d.id
LEFT JOIN loc     l   ON l.id      = e.loc
LEFT JOIN var     v   ON v.esp     = e.id AND v.key  = 265
LEFT JOIN varunit vu  ON vu.id     = v.unit
LEFT JOIN var     v1  ON v1.esp    = e.id AND v1.key = 264
LEFT JOIN varunit vu1 ON vu1.id    = v1.unit
LEFT JOIN pl          ON pl.id     = e.pl
WHERE f_unaccent(p.abrev)   ILIKE f_unaccent('%' || 'vicenti' || '%') OR
      f_unaccent(p.prenome) ILIKE f_unaccent('%' || 'vicenti' || '%');

Major points

Why f_unaccent()? Because unaccent() can't be indexed. Read this:

Does PostgreSQL support "accent insensitive" collations?

I used the function outlined there to allow the following (recommended!) multicolumn functional trigram GIN index:

CREATE INDEX pess_unaccent_nome_trgm_idx ON pess
USING gin (f_unaccent(pess) gin_trgm_ops, f_unaccent(prenome) gin_trgm_ops);

If you are not familiar with trigram indexes, read this first:

PostgreSQL LIKE query performance variations

And possibly:

Pattern matching with LIKE, SIMILAR TO or regular expressions in PostgreSQL

Be sure to run the latest version of Postgres (currently 9.5). There have been substantial improvements to GIN indexes. And you'll be interested in improvements in pg_trgm 1.2, scheduled to be released with the upcoming Postgres 9.6:

Trigram search gets much slower as search string gets longer

Prepared statements are a common way to execute queries with parameters (especially with text from user input). Postgres has to find a plan that works best for any given parameter. Add wildcards as constants to the to the search term like this:

f_unaccent(p.abrev) ILIKE f_unaccent('%' || 'vicenti' || '%')

('vicenti' would be replaced with a parameter.) So Postgres knows we are dealing with a pattern that is neither anchored left nor right - which would allow different strategies. Related answer with more details:

Performance impact of empty LIKE in a prepared statement

Or maybe re-plan the query for every search term (possibly using dynamic SQL in a function). But make sure planning time isn't eating any possible performance gain.

The WHERE condition on columns in pess contradicts the ~~LEFT JOIN~~. Postgres is forced to convert that to an INNER JOIN. What's worse the join comes late in the join tree. And since Postgres cannot reorder your joins (see below), that can become very expensive. Move the table to the first position in the FROM clause to eliminate rows early. Following LEFT JOINs do not eliminate any rows by definition. But with that many tables it is important to move joins that might multiply rows to the end.

You are joining 13 tables, 12 of them with LEFT JOIN which leaves 12! possible combinations - or 11! * 2! if we take the one LEFT JOIN into account that's really an INNER JOIN. That's too many for Postgres to evaluate all possible permutations for the best query plan. Read about join_collapse_limit:

Sample Query to show Cardinality estimation error in PostgreSQL
SQL INNER JOIN over multiple tables equal to WHERE syntax

The default setting for join_collapse_limit is 8, which means that Postgres won't try to reorder tables in your FROM clause and the order of tables is relevant.

One way work around this would be to split the performance-critical part into a CTE like @joop commented. Don't set join_collapse_limit much higher or times for query planning involving many joined tables will deteriorate.

About your concatenated date named data:

~~cast(cast(e.ano as varchar(4))||'-'||right('0'||cast(e.mes as varchar(2)),2)||'-'|| right('0'||cast(e.dia as varchar(2)),2) as varchar(10)) as data~~

Assuming you build from three numeric columns for year, month and day, which are defined NOT NULL, use this instead:

e.ano::text || to_char(e.mes2, 'FM"-"00')
            || to_char(e.dia,  'FM"-"00') AS data

About the FM template pattern modifier:

Check for integer in string array

But really, you should store the date as data type date to begin with.

Also simplified:

format('%s (%s)', p.abrev, p.prenome) AS determinador

Won't make the query faster, but it's much cleaner. See format().

First things last, all the usual advice for performance optimization applies:

Keep PostgreSQL from sometimes choosing a bad query plan

If you get all of this right, you should see much faster queries for all patterns.

joop · Answer

A way to reduce the size of the range table is to squeeze out a trivial part of the query into a CTE, such AS:

WITH zzz AS (
        SELECT l.id, l.nome
        , coalesce(v.val,v.valf)||' '||vu.unit as altura
        , coalesce(v1.val,v1.valf)||' '||vu1.unit as DAP
        FROM loc l 
         left join var v on v.esp = l.id and v.key = 265
         left join varunit vu on vu.id = v.unit
         left join var v1 on v1.esp = l.id and v1.key = 264
         left join varunit vu1 on vu1.id = v1.unit
        )
select e.id, (select count(id) from imgitem ii
                where ii.tabid = e.id and ii.tab = 'esp'
                ) as imgs
        , e.ano, e.mes, e.dia
        , cast(cast(e.ano as varchar(4))||'-'||right('0'||cast(e.mes as varchar(2)),2)||'-'|| right('0'||cast(e.dia as varchar(2)),2) as varchar(10)) as data
        , pl.pltag, e.inpa, e.det, d.ano anodet
        , coalesce(p.abrev,'')||' ('||coalesce(p.prenome,'')||')' determinador
        , d.tax

        , zzz.altura as altura
        , zzz.DAP as DAP

        , d.fam, tf.nome família
        , d.gen, tg.nome gênero
        , d.sp , ts.nome espécie
        , d.inf, e.loc
        , zzz.nome AS localidade
        , e.lat, e.lon
from esp e
left join det d on e.det = d.id         -- these could possibly be
left join pess p on p.id = d.detby      -- plain joins
        -- 
left join tax tf on d.fam = tf.oldfam
left join tax tg on d.gen = tg.oldgen
left join tax ts on d.sp = ts.oldsp
 -- ### commented out, since it is never referred
 -- ### left join tax ti on d.inf = ti.oldinf
left join pl on pl.id = e.pl
left JOIN zzz ON zzz.id = e.loc
        -- 
WHERE unaccent(TEXT(coalesce(p.abrev,'')||' ('||coalesce(p.prenome,'')||')')) ilike unaccent('%vicen%')
        ;

[untested, since I don't have the table definitions]

Why does a slight change in the search term slow down the query so much?

Tags:

performance

pattern-matching

postgresql

query-performance

postgresql-performance

Rodrigo

2 Answers

Why?

Solution

Major points

Erwin Brandstetter

joop

Recent Activity

Donate For Us

Why does a slight change in the search term slow down the query so much?

Tags:

performance

pattern-matching

postgresql

query-performance

postgresql-performance

Rodrigo

2 Answers

Why?

Solution

Major points

Erwin Brandstetter

joop

Related questions

Recent Activity

Donate For Us