I need to create a ranking of similar strings in a table. I have the following table <pre class="prettyprint"><code>create table names ( name character varying(255) ); </code></pre> Currently, I'm using pg_trgm module which offers the <code>similarity</code> function, but I have an efficiency problem. I created an index like the Postgres manual suggests: <pre class="prettyprint"><code>CREATE INDEX trgm_idx ON names USING gist (name gist_trgm_ops); </code></pre> and I'm executing the following query: <pre class="prettyprint"><code>select (similarity(n1.name, n2.name)) as sim, n1.name, n2.name from names n1, names n2 where n1.name != n2.name and similarity(n1.name, n2.name) > .8 order by sim desc; </code></pre> The query works, but is really slow when you have hundreds of names. Moreover, maybe I forgot a bit of SQL, but I don't understand why I cannot use the condition <code>and sim > .8</code> without getting a "column sim doesn't exist" error. I'd like any hint to make the query faster.

The way you have it, similarity between every element and every other element of the table has to be calculated (almost a cross join). If your table has 1000 rows, that's already 1,000,000 (!) similarity calculations, before those can be checked against the condition and sorted. Scales terribly. Use <code>SET pg_trgm.similarity_threshold</code> and the <code>%</code> operator instead. Both are provided by the <code>pg_trgm</code> module. This way, a trigram GiST index can be used to great effect. The configuration parameter <code>pg_trgm.similarity_threshold</code> replaced the functions <code>set_limit()</code> and <code>show_limit()</code> in Postgres 9.6. The deprecated functions still work (as of Postgres 13). Also, performance of GIN and GiST indexes improved in many ways since Postgres 9.1. Try instead: <pre class="prettyprint lang-sql prettyprint-override"><code>SET pg_trgm.similarity_threshold = 0.8; -- Postgres 9.6 or later SELECT similarity(n1.name, n2.name) AS sim, n1.name, n2.name FROM names n1 JOIN names n2 ON n1.name <> n2.name AND n1.name % n2.name ORDER BY sim DESC; </code></pre> Faster by orders of magnitude, but still slow. <code>pg_trgm.similarity_threshold</code> is a "customized" option, which can be handled like any other option. See: <ul> <li>Query a parameter (postgresql.conf setting) like "max_connections"</li> </ul> You may want to restrict the number of possible pairs by adding preconditions (like matching first letters) before cross joining (and support that with a matching functional index). The performance of a cross join deteriorates with O(N²). This does not work because you cannot refer to output columns in <code>WHERE</code> or <code>HAVING</code> clauses: <pre class="prettyprint"><code>WHERE ... sim > 0.8 </code></pre> That's according to the SQL standard (which is handled rather loosely by certain other RDBMS). On the other hand: <pre class="prettyprint"><code>ORDER BY sim DESC </code></pre> Works because output columns can be used in <code>GROUP BY</code> and <code>ORDER BY</code>. See: <ul> <li>PostgreSQL reusing computation result in select query</li> </ul> <h3>Test case</h3> I ran a quick test on my old test server to verify my claims. PostgreSQL 9.1.4. Times taken with <code>EXPLAIN ANALYZE</code> (best of 5). <pre class="prettyprint"><code>CREATE TEMP table t AS SELECT some_col AS name FROM some_table LIMIT 1000; -- real life test strings </code></pre> First round of tests with GIN index: <pre class="prettyprint"><code>CREATE INDEX t_gin ON t USING gin(name gin_trgm_ops); -- round1: with GIN index </code></pre> Second round of tests with GIST index: <pre class="prettyprint"><code>DROP INDEX t_gin; CREATE INDEX t_gist ON t USING gist(name gist_trgm_ops); </code></pre> New query: <pre class="prettyprint lang-sql prettyprint-override"><code>SELECT set_limit(0.8); SELECT similarity(n1.name, n2.name) AS sim, n1.name, n2.name FROM t n1 JOIN t n2 ON n1.name <> n2.name AND n1.name % n2.name ORDER BY sim DESC; </code></pre> GIN index used, 64 hits: total runtime: 484.022 ms GIST index used, 64 hits: total runtime: 248.772 ms Old query: <pre class="prettyprint lang-sql prettyprint-override"><code>SELECT (similarity(n1.name, n2.name)) as sim, n1.name, n2.name FROM t n1, t n2 WHERE n1.name != n2.name AND similarity(n1.name, n2.name) > 0.8 ORDER BY sim DESC; </code></pre> GIN index not used, 64 hits: total runtime: 6345.833 ms GIST index not used, 64 hits: total runtime: 6335.975 ms Otherwise identical results. Advice is good. And this is for just 1000 rows! <h3>GIN or GiST?</h3> GIN often provides superior read performance: <ul> <li>Difference between GiST and GIN index</li> </ul> But not in this particular case! <blockquote> This can be implemented quite efficiently by GiST indexes, but not by GIN indexes. </blockquote> <ul> <li>Multicolumn index on 3 fields with heterogenous data types</li> </ul>

Finding similar strings with PostgreSQL quickly

Tags:

text

sql

postgresql

similarity

postgresql-performance

I need to create a ranking of similar strings in a table.

I have the following table

create table names ( name character varying(255) );

Currently, I'm using pg_trgm module which offers the similarity function, but I have an efficiency problem. I created an index like the Postgres manual suggests:

CREATE INDEX trgm_idx ON names USING gist (name gist_trgm_ops);

and I'm executing the following query:

select (similarity(n1.name, n2.name)) as sim, n1.name, n2.name from names n1, names n2 where n1.name != n2.name and similarity(n1.name, n2.name) > .8 order by sim desc;

The query works, but is really slow when you have hundreds of names. Moreover, maybe I forgot a bit of SQL, but I don't understand why I cannot use the condition and sim > .8 without getting a "column sim doesn't exist" error.

I'd like any hint to make the query faster.

542

asked Jun 28 '12 17:06

cdarwin

1 Answers

The way you have it, similarity between every element and every other element of the table has to be calculated (almost a cross join). If your table has 1000 rows, that's already 1,000,000 (!) similarity calculations, before those can be checked against the condition and sorted. Scales terribly.

Use SET pg_trgm.similarity_threshold and the % operator instead. Both are provided by the pg_trgm module. This way, a trigram GiST index can be used to great effect.

The configuration parameter pg_trgm.similarity_threshold replaced the functions set_limit() and show_limit() in Postgres 9.6. The deprecated functions still work (as of Postgres 13). Also, performance of GIN and GiST indexes improved in many ways since Postgres 9.1.

Try instead:

SET pg_trgm.similarity_threshold = 0.8;  -- Postgres 9.6 or later    SELECT similarity(n1.name, n2.name) AS sim, n1.name, n2.name FROM   names n1 JOIN   names n2 ON n1.name <> n2.name                AND n1.name % n2.name ORDER  BY sim DESC;

Faster by orders of magnitude, but still slow.

pg_trgm.similarity_threshold is a "customized" option, which can be handled like any other option. See:

Query a parameter (postgresql.conf setting) like "max_connections"

You may want to restrict the number of possible pairs by adding preconditions (like matching first letters) before cross joining (and support that with a matching functional index). The performance of a cross join deteriorates with O(N²).

This does not work because you cannot refer to output columns in WHERE or HAVING clauses:

WHERE ... sim > 0.8

That's according to the SQL standard (which is handled rather loosely by certain other RDBMS). On the other hand:

ORDER BY sim DESC

Works because output columns can be used in GROUP BY and ORDER BY. See:

PostgreSQL reusing computation result in select query

Test case

I ran a quick test on my old test server to verify my claims.
PostgreSQL 9.1.4. Times taken with EXPLAIN ANALYZE (best of 5).

CREATE TEMP table t AS  SELECT some_col AS name FROM some_table LIMIT 1000;  -- real life test strings

First round of tests with GIN index:

CREATE INDEX t_gin ON t USING gin(name gin_trgm_ops);  -- round1: with GIN index

Second round of tests with GIST index:

DROP INDEX t_gin; CREATE INDEX t_gist ON t USING gist(name gist_trgm_ops);

New query:

SELECT set_limit(0.8);  SELECT similarity(n1.name, n2.name) AS sim, n1.name, n2.name FROM   t n1 JOIN   t n2 ON n1.name <> n2.name            AND n1.name % n2.name ORDER  BY sim DESC;

GIN index used, 64 hits: total runtime: 484.022 ms
GIST index used, 64 hits: total runtime: 248.772 ms

Old query:

SELECT (similarity(n1.name, n2.name)) as sim, n1.name, n2.name FROM   t n1, t n2 WHERE  n1.name != n2.name AND    similarity(n1.name, n2.name) > 0.8 ORDER  BY sim DESC;

GIN index not used, 64 hits: total runtime: 6345.833 ms
GIST index not used, 64 hits: total runtime: 6335.975 ms

Otherwise identical results. Advice is good. And this is for just 1000 rows!

GIN or GiST?

GIN often provides superior read performance:

Difference between GiST and GIN index

But not in this particular case!

This can be implemented quite efficiently by GiST indexes, but not by GIN indexes.

Multicolumn index on 3 fields with heterogenous data types

122

answered Oct 06 '22 13:10

Erwin Brandstetter

Related questions
                            
                                Is cut() style binning available in dplyr?
                            
                                How we can use CTE in subquery in sql server?
                            
                                Django F expressions joined field
                            
                                What is the best way to represent "Recurring Events" in database?
                            
                                How to get other columns when using Spark DataFrame groupby?
                            
                                Penetration testing tools [closed]
                            
                                Using bind variables with dynamic SELECT INTO clause in PL/SQL
                            
                                Using SELECT result in another SELECT
                            
                                Does Mysql have an equivalent to @@ROWCOUNT like in mssql?
                            
                                MySQL - SELECT * INTO OUTFILE LOCAL ?
                            
                                SELECT / GROUP BY - segments of time (10 seconds, 30 seconds, etc)
                            
                                Conditional SQL count
                            
                                Constraint for only one record marked as default
                            
                                SQL Conditional column data return in a select statement
                            
                                When no 'Order by' is specified, what order does a query choose for your record set?
                            
                                How to escape underscore in the string query in hibernate and SQL?
                            
                                A way to check if foreign key exists in SQL 2005
                            
                                MySQL Trigger to prevent INSERT under certain conditions
                            
                                How to use a SQL for loop to insert rows into database?
                            
                                Difference between a statement and a query in SQL

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With