pg_trgm gives me a score of 0.4 for both of these comparisons :
SELECT similarity('Noemie','Noémie');
0.4
SELECT similarity('Noemie','NoXmie');
0.4
Obviously the first one is more "similar" than the second, accents are often ommited in data entry, so it's quite useful to have a score that gives high similarity to letters that vary only by presence of absence of a accent.
Is their a way to tweak pg_trgm to give higher similarity score for words that differ only by accents ?
I would start by suggesting that you remove the accents from your character set. Postgres offers a function to do this, unaccent()
, but you need to install it separately. Here is information on the topic.
With this function (or a similar function), you could do:
SELECT similarity(unaccent('Noemie'), unaccent('Noémie'));
Treating the two values the same might be going too far. A weighted average of the two might be more appropriate:
SELECT (alpha * similarity(unaccent('Noemie'), unaccent('Noémie'0)) +
(1 - alpha) * similarity('Noemie', 'Noémie')
)
alpha
would be a value between 0 and 1 that gives the weighting for similarity of accented characters.
Here is a good discussion of this issue.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With