I have table Users
with column displayName (text)
and pg_trgm gin index
on this column.
CREATE INDEX "Users-displayName-pg-trgm-index"
ON "Users"
USING gin
("displayName" COLLATE pg_catalog."default" gin_trgm_ops);
Here is my query:
SELECT "User"."id"
,"User"."displayName"
,"User"."firstName"
,"User"."lastName"
,"User"."email"
,"User"."password"
,"User"."isVerified"
,"User"."isBlocked"
,"User"."verificationToken"
,"User"."birthDate"
,"User"."gender"
,"User"."isPrivate"
,"User"."role"
,"User"."coverImageUrl"
,"User"."profileImageUrl"
,"User"."facebookId"
,"User"."deviceType"
,"User"."deviceToken"
,"User"."coins"
,"User"."LocaleId"
,"User"."createdAt"
,"User"."updatedAt"
FROM "Users" AS "User"
WHERE (similarity("User"."displayName", 'John') > 0.2)
ORDER BY similarity("User"."displayName", 'John')
,"User"."id" ASC LIMIT 25;
Query above takes ~200ms
to return results. When I remove
ORDER BY similarity("User"."displayName", 'John')
and order just by id
then query speeds up to 30ms
.
I am querying on table with 50k
users.
Here is explain analyze: http://explain.depesz.com/s/lXC
For some reason I don't see any index usage (gin pg_trgm
on displayName
)
It seems that when I replace line
WHERE (similarity("User"."displayName", 'John') > 0.2)
with
WHERE ("User"."displayName" % 'John')
query is super-fast - can anyone tell me why? I thought that %
operator just checks if similarity(...) is greater than treshold... so what is the difference?
F.33.7. Authors The pg_trgm module provides functions and operators for determining the similarity of alphanumeric text based on trigram matching, as well as index operator classes that support fast searching for similar strings.
The threshold must be between 0 and 1 (default is 0.3). Sets the current word similarity threshold that is used by the <% and %> operators. The threshold must be between 0 and 1 (default is 0.6). Sets the current strict word similarity threshold that is used by the <<% and %>> operators. The threshold must be between 0 and 1 (default is 0.5).
I'd like any hint to make the query faster. The way you have it, similarity between every element and every other element of the table has to be calculated (almost a cross join). If your table has 1000 rows, that's already 1,000,000 (!) similarity calculations, before those can be checked against the condition and sorted.
The most similar extent of an ordered set of trigrams in the second string is {" w"," wo","wor","ord"}, and the similarity is 0.8. This function returns a value that can be approximately understood as the greatest similarity between the first string and any substring of the second string.
PostgreSQL doesn't use indexes for function, it uses indexes only for operators.
The query that orders by similarity() calls that function for every row and then orders the rows.
The query that uses the %
uses the index and runs similarity function on those that match (no index only scans for functions).
If you want to order by least similarity (as in the question) those that have similarity greater than 0.2 you should use the distance operator <->
.
Like so:
WHERE "User"."displayName" <-> 'John' < 0.8
ORDER BY "User"."displayName" <-> 'John' DESC
The distance is 1- similarity hence 0.8
In my experience GIST index has been working better / faster for similarity ordering.
In this example I'm having customer table with ~500k rows.
select *,similarity(coalesce(details::text,'') || coalesce(name,''),'9')
from customer
order by (coalesce(details::text,'') || coalesce(name,'')) <-> '9'
asc limit 50;
Without any index query takes around 8,5s with query plan:
QUERY PLAN
-----------------------------------------------------------------------------------
Limit (cost=47687.03..47687.16 rows=50 width=1144)
-> Sort (cost=47687.03..49184.52 rows=598995 width=1144)
Sort Key: (((COALESCE((details)::text, ''::text) ||
(COALESCE(name, ''::character varying))::text) <-> '9'::text))
-> Seq Scan on customer (cost=0.00..27788.85 rows=598995 width=1144)
(4 rows)
When adding GIN index:
CREATE INDEX ON customer USING gin ((coalesce(details::text,'') || coalesce(name,'')) gin_trgm_ops);
Nothing happens. Query plan still looks the same and query still takes around 8.5 seconds to complete. No index is used for ordering.
After creating GIST index:
CREATE INDEX ON customer USING gist ((coalesce(details::text,'') || coalesce(name,'')) gist_trgm_ops);
Query takes around 240ms and query plan shows index being used
QUERY PLAN
--------------------------------------------------------------------------
Limit (cost=0.42..10.19 rows=50 width=1144)
-> Index Scan using customer_expr_idx1 on customer (cost=0.42..117106.73 rows=598995 width=1144)
Order By: ((COALESCE((details)::text, ''::text) ||
(COALESCE(name, ''::character varying))::text) <-> '9'::text)
(3 rows)
Just for curiosity rows returned looks like this:
id | name | details | similarity
--------+--------------------------+------------------------+------------
25 | Generic Company (9) Inc. | | 0.0909091
125 | Generic Company (9) Inc. | | 0.0909091
268649 | 9bg1ubTCYo7mMcDaHmCC | { "fatty": "McDaddy" } | 0.0294118
470217 | 9hSXtDmW9cXvKk4Q6McD | { "fatty": "McDaddy" } | 0.0285714
180775 | 9pRPi1w9nqV9999g2ceo | { "fatty": "McDaddy" } | 0.0285714
162931 | 9qMyYbWNJLZdv7uYYbOl | { "fatty": "McDaddy" } | 0.0285714
176961 | 9ow1NcTjAmCDyRsapDl4 | { "fatty": "McDaddy" } | 0.0285714
... etc ...
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With