Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Longest matching substring

How would you search for the longest match within a varchar variable? For example, table GOB has entries as follows:

magic_word |  prize
===================
         sh|  $0.20
        sha|  $0.40
       shaz|  $0.60
      shaza|  $1.50

I would like to write a plpgsql function that takes amongst other arguments a string as input (e.g. shazam), and returns the 'prize' column on the row of GOB with the longest matching substring. In the example shown, that would be $1.50 on the row with magic_word shaza.

All the function format I can handle, it's just the matching bit. I can't think of an elegant solution. I'm guessing it's probably really easy, but I am scratching my head. I don't know the input string at the start, as it will be derived from the result of a query on another table.

Any ideas?

like image 616
Bren McGuire Avatar asked Oct 22 '25 05:10

Bren McGuire


1 Answers

Simple solution

SELECT magic_word
FROM   gob
WHERE  'shazam' LIKE (magic_word || '%')
ORDER  BY magic_word DESC
LIMIT  1;

This works because the longest match sorts last - so I sort DESC and pick the first match.

I am assuming from your example that you want to match left-anchored, from the beginning of the string. If you want to match anywhere in the string (which is more expensive and even harder to back up with an index), use:

...
WHERE  'shazam' LIKE ('%' || magic_word || '%')
...

fiddle
Old sqlfiddle

Performance

The query is not sargable. It might help quite a bit if you had additional information, like a minimum length that you could base an index on, to reduce the number of rows to consider. It needs to be criteria that gets you less than ~ 5% of the table to be effective. So, initials (a natural minimum pick) may or may not be useful. But two or three letters at the start might help quite a bit.

In fact you could optimize this iteratively. Something along the line of:

  • Try a partial index of words with 15 letters+
  • If not found, try 12 letters+
  • If not found, try 9 letters+
  • ...

A simple case of what I outlined in this related answer on dba.SE:

  • Can spatial index help a “range - order by - limit” query

Another approach would be to use a trigram index. You'd need the additional module pg_trgm for that. Normally you would search with a short pattern in a table with longer strings. But trigrams work for your reverse approach, too, with some limitations. Obviously you couldn't match a string with just two characters in the middle of a longer string using trigrams ... Test for corner cases.
There are a number of answers here on SO with more information. Example:

  • Effectively query on column that includes a substring

Advanced solution

Consider the solution under this closely related question for a whole table of search strings. Implemented with a recursive CTE:

  • Longest Prefix Match
like image 88
Erwin Brandstetter Avatar answered Oct 23 '25 18:10

Erwin Brandstetter