Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Postgres pg_trgm - why ordering by similarity is very slow

I have table Users with column displayName (text) and pg_trgm gin index on this column.

CREATE INDEX "Users-displayName-pg-trgm-index"
  ON "Users"
  USING gin
  ("displayName" COLLATE pg_catalog."default" gin_trgm_ops);

Here is my query:

SELECT "User"."id"
    ,"User"."displayName"
    ,"User"."firstName"
    ,"User"."lastName"
    ,"User"."email"
    ,"User"."password"
    ,"User"."isVerified"
    ,"User"."isBlocked"
    ,"User"."verificationToken"
    ,"User"."birthDate"
    ,"User"."gender"
    ,"User"."isPrivate"
    ,"User"."role"
    ,"User"."coverImageUrl"
    ,"User"."profileImageUrl"
    ,"User"."facebookId"
    ,"User"."deviceType"
    ,"User"."deviceToken"
    ,"User"."coins"
    ,"User"."LocaleId"
    ,"User"."createdAt"
    ,"User"."updatedAt"
FROM "Users" AS "User"
WHERE (similarity("User"."displayName", 'John') > 0.2)
ORDER BY similarity("User"."displayName", 'John')
    ,"User"."id" ASC LIMIT 25;

Query above takes ~200ms to return results. When I remove

ORDER BY similarity("User"."displayName", 'John')

and order just by id then query speeds up to 30ms.

I am querying on table with 50k users.

Here is explain analyze: http://explain.depesz.com/s/lXC

For some reason I don't see any index usage (gin pg_trgm on displayName)


It seems that when I replace line

WHERE (similarity("User"."displayName", 'John') > 0.2)

with

WHERE ("User"."displayName" % 'John')

query is super-fast - can anyone tell me why? I thought that % operator just checks if similarity(...) is greater than treshold... so what is the difference?

like image 452
user606521 Avatar asked Feb 13 '15 14:02

user606521


People also ask

What is PG_TRGM in PostgreSQL?

F.33.7. Authors The pg_trgm module provides functions and operators for determining the similarity of alphanumeric text based on trigram matching, as well as index operator classes that support fast searching for similar strings.

What is the default word similarity threshold for the <%> operator?

The threshold must be between 0 and 1 (default is 0.3). Sets the current word similarity threshold that is used by the <% and %> operators. The threshold must be between 0 and 1 (default is 0.6). Sets the current strict word similarity threshold that is used by the <<% and %>> operators. The threshold must be between 0 and 1 (default is 0.5).

How can I make the similarity query faster?

I'd like any hint to make the query faster. The way you have it, similarity between every element and every other element of the table has to be calculated (almost a cross join). If your table has 1000 rows, that's already 1,000,000 (!) similarity calculations, before those can be checked against the condition and sorted.

What is the most similar extent of an ordered set?

The most similar extent of an ordered set of trigrams in the second string is {" w"," wo","wor","ord"}, and the similarity is 0.8. This function returns a value that can be approximately understood as the greatest similarity between the first string and any substring of the second string.


2 Answers

PostgreSQL doesn't use indexes for function, it uses indexes only for operators.

The query that orders by similarity() calls that function for every row and then orders the rows.

The query that uses the % uses the index and runs similarity function on those that match (no index only scans for functions).

If you want to order by least similarity (as in the question) those that have similarity greater than 0.2 you should use the distance operator <->.

Like so:

WHERE "User"."displayName" <-> 'John' < 0.8
ORDER BY "User"."displayName" <-> 'John' DESC

The distance is 1- similarity hence 0.8

like image 170
Jakub Kania Avatar answered Oct 24 '22 07:10

Jakub Kania


In my experience GIST index has been working better / faster for similarity ordering.

In this example I'm having customer table with ~500k rows.

select *,similarity(coalesce(details::text,'') || coalesce(name,''),'9') 
  from customer 
  order by (coalesce(details::text,'') || coalesce(name,'')) <-> '9' 
  asc limit 50;

Without any index query takes around 8,5s with query plan:

                              QUERY PLAN                                          
-----------------------------------------------------------------------------------
 Limit  (cost=47687.03..47687.16 rows=50 width=1144)
   ->  Sort  (cost=47687.03..49184.52 rows=598995 width=1144)
         Sort Key: (((COALESCE((details)::text, ''::text) ||
                     (COALESCE(name, ''::character varying))::text) <-> '9'::text))
         ->  Seq Scan on customer  (cost=0.00..27788.85 rows=598995 width=1144)
(4 rows)

When adding GIN index:

CREATE INDEX ON customer USING gin ((coalesce(details::text,'') || coalesce(name,'')) gin_trgm_ops);

Nothing happens. Query plan still looks the same and query still takes around 8.5 seconds to complete. No index is used for ordering.

After creating GIST index:

CREATE INDEX ON customer USING gist ((coalesce(details::text,'') || coalesce(name,'')) gist_trgm_ops);

Query takes around 240ms and query plan shows index being used

                     QUERY PLAN                         
--------------------------------------------------------------------------
 Limit  (cost=0.42..10.19 rows=50 width=1144)
   ->  Index Scan using customer_expr_idx1 on customer  (cost=0.42..117106.73 rows=598995 width=1144)
     Order By: ((COALESCE((details)::text, ''::text) || 
                (COALESCE(name, ''::character varying))::text) <-> '9'::text)
(3 rows) 

Just for curiosity rows returned looks like this:

   id   |           name           |        details         | similarity 
--------+--------------------------+------------------------+------------
     25 | Generic Company (9) Inc. |                        |  0.0909091
    125 | Generic Company (9) Inc. |                        |  0.0909091
 268649 | 9bg1ubTCYo7mMcDaHmCC     | { "fatty": "McDaddy" } |  0.0294118
 470217 | 9hSXtDmW9cXvKk4Q6McD     | { "fatty": "McDaddy" } |  0.0285714
 180775 | 9pRPi1w9nqV9999g2ceo     | { "fatty": "McDaddy" } |  0.0285714
 162931 | 9qMyYbWNJLZdv7uYYbOl     | { "fatty": "McDaddy" } |  0.0285714
 176961 | 9ow1NcTjAmCDyRsapDl4     | { "fatty": "McDaddy" } |  0.0285714
   ... etc ...
like image 37
Mikael Lepistö Avatar answered Oct 24 '22 08:10

Mikael Lepistö