Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Optimizing a postgres similarity query (pg_trgm + gin index)

I have defined the following index:

CREATE INDEX
    users_search_idx
ON
    auth_user
USING
    gin(
        username gin_trgm_ops,
        first_name gin_trgm_ops,
        last_name gin_trgm_ops
    );

I am performing the following query:

PREPARE user_search (TEXT, INT) AS
    SELECT
        username,
        email,
        first_name,
        last_name,
        ( -- would probably do per-field weightings here
            s_username + s_first_name + s_last_name
        ) rank
    FROM
        auth_user,
        similarity(username, $1) s_username,
        similarity(first_name, $1) s_first_name,
        similarity(last_name, $1) s_last_name
    WHERE
        username % $1 OR
        first_name % $1 OR
        last_name % $1
    ORDER BY
        rank DESC
    LIMIT $2;

The auth_user table has 6.2 million rows.

The speed of the query seems to depend very heavily on the number of results potentially returned by the similarity query.

Increasing the similarity threshold via set_limit helps, but reduces usefulness of results by eliminating partial matches.

Some searches return in 200ms, others take ~ 10 seconds.

We have an existing implementation of this feature using Elasticsearch that returns in < 200ms for any query, while doing more complicated (better) ranking.

I would like to know if there is any way we could improve this to get more consistent performance?

It's my understanding that GIN index (inverted index) is the same basic approach used by Elasticsearch so I would have thought there is some optimization possible.

An EXPLAIN ANALYZE EXECUTE user_search('mel', 20) shows:

Limit  (cost=54099.81..54099.86 rows=20 width=52) (actual time=10302.092..10302.104 rows=20 loops=1)
  ->  Sort  (cost=54099.81..54146.66 rows=18739 width=52) (actual time=10302.091..10302.095 rows=20 loops=1)
        Sort Key: (((s_username.s_username + s_first_name.s_first_name) + s_last_name.s_last_name)) DESC
        Sort Method: top-N heapsort  Memory: 26kB
        ->  Nested Loop  (cost=382.74..53601.17 rows=18739 width=52) (actual time=118.164..10293.765 rows=8380 loops=1)
              ->  Nested Loop  (cost=382.74..53132.69 rows=18739 width=56) (actual time=118.150..10262.804 rows=8380 loops=1)
                    ->  Nested Loop  (cost=382.74..52757.91 rows=18739 width=52) (actual time=118.142..10233.990 rows=8380 loops=1)
                          ->  Bitmap Heap Scan on auth_user  (cost=382.74..52383.13 rows=18739 width=48) (actual time=118.128..10186.816 rows=8380loops=1)"
                                Recheck Cond: (((username)::text % 'mel'::text) OR ((first_name)::text % 'mel'::text) OR ((last_name)::text %'mel'::text))"
                                Rows Removed by Index Recheck: 2434523
                                Heap Blocks: exact=49337 lossy=53104
                                ->  BitmapOr  (cost=382.74..382.74 rows=18757 width=0) (actual time=107.436..107.436 rows=0 loops=1)
                                      ->  Bitmap Index Scan on users_search_idx  (cost=0.00..122.89 rows=6252 width=0) (actual time=40.200..40.200rows=88908 loops=1)"
                                            Index Cond: ((username)::text % 'mel'::text)
                                      ->  Bitmap Index Scan on users_search_idx  (cost=0.00..122.89 rows=6252 width=0) (actual time=43.847..43.847rows=102028 loops=1)"
                                            Index Cond: ((first_name)::text % 'mel'::text)
                                      ->  Bitmap Index Scan on users_search_idx  (cost=0.00..122.89 rows=6252 width=0) (actual time=23.387..23.387rows=58740 loops=1)"
                                            Index Cond: ((last_name)::text % 'mel'::text)
                          ->  Function Scan on similarity s_username  (cost=0.00..0.01 rows=1 width=4) (actual time=0.004..0.004 rows=1 loops=8380)
                    ->  Function Scan on similarity s_first_name  (cost=0.00..0.01 rows=1 width=4) (actual time=0.002..0.002 rows=1 loops=8380)
              ->  Function Scan on similarity s_last_name  (cost=0.00..0.01 rows=1 width=4) (actual time=0.002..0.002 rows=1 loops=8380)
Execution time: 10302.559 ms

Server is Postgres 9.6.1 running on Amazon RDS

update

1.

Shortly after posting the question I found this info: https://www.postgresql.org/message-id/[email protected]

So I tried

-> SHOW work_mem;
4MB
-> SET work_mem='12MB';
-> EXECUTE user_search('mel', 20);
(results returned in ~1.5s)

This made a big improvement (previously > 10s)!

1.5s is still way slower than ES for similar query so I would still like to hear any suggestions for optimising the query.

2.

In response to comments, and after seeing this question (Postgresql GIN index slower than GIST for pg_trgm), I tried exactly the same set up with a GIST index in place of the GIN one.

Trying the same search above, it returned in ~3.5s, using default work_mem='4MB'. Increasing work_mem made no difference.

From this I conclude that GIST index is more memory efficient (did not hit pathological case like GIN did) but is slower than GIN when GIN is working properly. This is inline with what's described in the docs recommending GIN index.

3.

I still don't understand why so much time is spent in:

 ->  Bitmap Heap Scan on auth_user  (cost=382.74..52383.13 rows=18739 width=48) (actual time=118.128..10186.816 rows=8380loops=1)"
     Recheck Cond: (((username)::text % 'mel'::text) OR ((first_name)::text % 'mel'::text) OR ((last_name)::text %'mel'::text))"
     Rows Removed by Index Recheck: 2434523
     Heap Blocks: exact=49337 lossy=53104

I don't understand why this step is needed or what it's doing.

There are the three Bitmap Index Scan beneath it for each of the username % $1 clauses... these results then get combined with a BitmapOr step. These parts are all quite fast.

But even in the case where we don't run out of work mem, we still spend nearly a whole second in Bitmap Heap Scan.

like image 801
Anentropic Avatar asked May 09 '17 10:05

Anentropic


1 Answers

I expect much faster results with this approach:

1.

Create a GiST index with 1 column holding concatenated values:

CREATE INDEX users_search_idx ON auth_user
USING gist((username || ' ' || first_name || ' ' || last_name) gist_trgm_ops);

This assumes all 3 columns to be defined NOT NULL (you did not specify). Else you need to do more.
Why not simplify with concat_ws()?

  • Combine two columns and add into one new column
  • Faster query with pattern-matching on multiple text fields
  • Combine two columns and add into one new column

2.

Use a proper nearest-neighbor query, matching above index:

SELECT username, email, first_name, last_name
     , similarity(username  , $1) AS s_username
     , similarity(first_name, $1) AS s_first_name
     , similarity(last_name , $1) AS s_last_name
     , row_number() OVER () AS rank  -- greatest similarity first
FROM   auth_user
WHERE     (username || ' ' || first_name || ' ' || last_name) %   $1  -- !!
ORDER  BY (username || ' ' || first_name || ' ' || last_name) <-> $1  -- !!
LIMIT  $2;

Expressions in WHERE and ORDER BY must match index expression!

In particular ORDER BY rank (like you had it) will always perform poorly for a small LIMIT picking from a much larger pool of qualifying rows, because it cannot use an index directly: The sophisticated expression behind rank has to be calculated for every qualifying row, then all have to be sorted before the small selection of best matches can be returned. This is much, much more expensive than a true nearest-neighbor query that can pick the best results from the index directly without even looking at the rest.

row_number() with empty window definition just reflects the ordering produced by the ORDER BY of the same SELECT.

Related answers:

  • Best index for similarity function
  • Search in 300 million addresses with pg_trgm

As for your item 3., I added an answer to the question you referenced, that should explain it:

  • PostgreSQL GIN index slower than GIST for pg_trgm?
like image 182
Erwin Brandstetter Avatar answered Oct 19 '22 01:10

Erwin Brandstetter