Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to make a fast pg_trgm DESC (descending)?

I have a list of 100.000 sentences in a table, with pg_trgm I can get the closest ones of my string "super cool" very fast with a GIN/GIST index. See the official example :

https://www.postgresql.org/docs/11/pgtrgm.html

Sadly, I want the opposite, I would like the most different one first, but the GIN/GIST indexes are not used when DESC, so it is very slow.

SELECT t, 'super cool' <-> t AS dist
  FROM test_trgm
  ORDER BY dist DESC LIMIT 10;

How could I do that ? Rebuild pg_trgm from source ? How ?

like image 744
Laurent Debricon Avatar asked Jul 04 '19 07:07

Laurent Debricon


1 Answers

I don't think this can be optimized at all, unless "t" is known beforehand or you can cache something. Even if you try to change Postgres sources, most probably you'll see no benefit at all.

In the docs, the <-> operator is shorthand for similarity(t1, t2). You can index such scores if both terms are known, so for example you can "CREATE INDEX" of this function for any t1,t2 combination and it will work. It will be a standard BTree index and you can perform less than, grater than or any checks or ordering you want.

But t2 is not known, and as such, you cannot create an index for any possible string. (Or you could forge all possible string combinations in a table if they are a reasonable amount)

If you don't know the other term, how sorting works? Well, because you can get for your word t1, extract all trigrams, and get which rows (tids) appear at least X times. This is fast because you have to check only N trigrams for the original word, retrieve the tuple ids in buckets, count and sort.

Now try to do it in reverse: you need all words that have no trigrams in common at all. So you have to scan the retrieved trigrams, obtain the tuple ids, and then fetch the whole table filtering out the tuple ids you got earlier. And after that continue with those with only 1 trigram, then 2, and so. This sounds really inefficient, like scanning the whole table and index one or two times.

The main problem relies on retrieving the matches with zero coincidences. No matter how you do it, you need to scan the whole table.

If you can skip at least those with zero coincidence, then you can speedup this search. For this you could use set_limit(0.0001) and use the "%" operator to filter them out. (But doesn't sound like this is what you wanted)

Even extracting the trigrams into an array or subtable doesn't seem to help. Your problem looks like a bloom filter, but reversed, and still I'm not sure it is possible at all to create an index like that.

Maybe if you add more information on what are you trying to accomplish we could find a different way without using trigrams.

like image 63
deavidsedice Avatar answered Nov 16 '22 08:11

deavidsedice