Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

pg_trgm how to give higher similarity score when only accents vary

Tags:

postgresql

pg_trgm gives me a score of 0.4 for both of these comparisons :

SELECT similarity('Noemie','Noémie');
0.4 

SELECT similarity('Noemie','NoXmie');
0.4 

Obviously the first one is more "similar" than the second, accents are often ommited in data entry, so it's quite useful to have a score that gives high similarity to letters that vary only by presence of absence of a accent.

Is their a way to tweak pg_trgm to give higher similarity score for words that differ only by accents ?

like image 826
Max L. Avatar asked Sep 06 '25 23:09

Max L.


1 Answers

I would start by suggesting that you remove the accents from your character set. Postgres offers a function to do this, unaccent(), but you need to install it separately. Here is information on the topic.

With this function (or a similar function), you could do:

SELECT similarity(unaccent('Noemie'), unaccent('Noémie'));

Treating the two values the same might be going too far. A weighted average of the two might be more appropriate:

SELECT (alpha * similarity(unaccent('Noemie'), unaccent('Noémie'0)) + 
        (1 - alpha) * similarity('Noemie', 'Noémie')
       )

alpha would be a value between 0 and 1 that gives the weighting for similarity of accented characters.

Here is a good discussion of this issue.

like image 125
Gordon Linoff Avatar answered Sep 10 '25 12:09

Gordon Linoff