I need to create a ranking of similar strings in a table.
I have the following table
create table names ( name character varying(255) );
Currently, I'm using pg_trgm module which offers the similarity
function, but I have an efficiency problem. I created an index like the Postgres manual suggests:
CREATE INDEX trgm_idx ON names USING gist (name gist_trgm_ops);
and I'm executing the following query:
select (similarity(n1.name, n2.name)) as sim, n1.name, n2.name from names n1, names n2 where n1.name != n2.name and similarity(n1.name, n2.name) > .8 order by sim desc;
The query works, but is really slow when you have hundreds of names. Moreover, maybe I forgot a bit of SQL, but I don't understand why I cannot use the condition and sim > .8
without getting a "column sim doesn't exist" error.
I'd like any hint to make the query faster.
We can compare the string using like clause in PostgreSQL, we can also compare the string using the =, != , <>, <, >, <= and >= character string operator. Basically character string operator in PostgreSQL is used to compare the string and return the result as we specified input within the query.
Use ILIKE : SELECT * FROM table WHERE columnName ILIKE 'R%'; or a case-insensitive regular expression: SELECT * FROM table WHERE columnName ~* '^R.
The PostgreSQL LIKE operator is used to match text values against a pattern using wildcards. If the search expression can be matched to the pattern expression, the LIKE operator will return true, which is 1. The percent sign represents zero, one, or multiple numbers or characters.
The fuzzystrmatch module provides two functions for working with Soundex codes: soundex(text) returns text difference(text, text) returns int. The soundex function converts a string to its Soundex code.
The way you have it, similarity between every element and every other element of the table has to be calculated (almost a cross join). If your table has 1000 rows, that's already 1,000,000 (!) similarity calculations, before those can be checked against the condition and sorted. Scales terribly.
Use SET pg_trgm.similarity_threshold
and the %
operator instead. Both are provided by the pg_trgm
module. This way, a trigram GiST index can be used to great effect.
The configuration parameter pg_trgm.similarity_threshold
replaced the functions set_limit()
and show_limit()
in Postgres 9.6. The deprecated functions still work (as of Postgres 13). Also, performance of GIN and GiST indexes improved in many ways since Postgres 9.1.
Try instead:
SET pg_trgm.similarity_threshold = 0.8; -- Postgres 9.6 or later SELECT similarity(n1.name, n2.name) AS sim, n1.name, n2.name FROM names n1 JOIN names n2 ON n1.name <> n2.name AND n1.name % n2.name ORDER BY sim DESC;
Faster by orders of magnitude, but still slow.
pg_trgm.similarity_threshold
is a "customized" option, which can be handled like any other option. See:
You may want to restrict the number of possible pairs by adding preconditions (like matching first letters) before cross joining (and support that with a matching functional index). The performance of a cross join deteriorates with O(N²).
This does not work because you cannot refer to output columns in WHERE
or HAVING
clauses:
WHERE ... sim > 0.8
That's according to the SQL standard (which is handled rather loosely by certain other RDBMS). On the other hand:
ORDER BY sim DESC
Works because output columns can be used in GROUP BY
and ORDER BY
. See:
I ran a quick test on my old test server to verify my claims.
PostgreSQL 9.1.4. Times taken with EXPLAIN ANALYZE
(best of 5).
CREATE TEMP table t AS SELECT some_col AS name FROM some_table LIMIT 1000; -- real life test strings
First round of tests with GIN index:
CREATE INDEX t_gin ON t USING gin(name gin_trgm_ops); -- round1: with GIN index
Second round of tests with GIST index:
DROP INDEX t_gin; CREATE INDEX t_gist ON t USING gist(name gist_trgm_ops);
New query:
SELECT set_limit(0.8); SELECT similarity(n1.name, n2.name) AS sim, n1.name, n2.name FROM t n1 JOIN t n2 ON n1.name <> n2.name AND n1.name % n2.name ORDER BY sim DESC;
GIN index used, 64 hits: total runtime: 484.022 ms
GIST index used, 64 hits: total runtime: 248.772 ms
Old query:
SELECT (similarity(n1.name, n2.name)) as sim, n1.name, n2.name FROM t n1, t n2 WHERE n1.name != n2.name AND similarity(n1.name, n2.name) > 0.8 ORDER BY sim DESC;
GIN index not used, 64 hits: total runtime: 6345.833 ms
GIST index not used, 64 hits: total runtime: 6335.975 ms
Otherwise identical results. Advice is good. And this is for just 1000 rows!
GIN often provides superior read performance:
But not in this particular case!
This can be implemented quite efficiently by GiST indexes, but not by GIN indexes.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With