Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Ruby super-insensitive Regex to match school names with accents and other diacritics

The question has been asked in other programming languages, but how would you perform an accent insensitive regex on Ruby ?

My current code is something like

scope :by_registered_name, ->(regex){
  where(:name => /#{Regexp.escape(regex)}/i)
}

I thought maybe I could replace non-alphanumeric+whitespace characters by dots, and remove the escape, but is there not a better way ? I'm afraid I could catch weird things if I do that...

I am targeting French right now, but if I could also fix it for other languages that would be cool.

I am using Ruby 2.3 if that can help.


I realize my requirements are actually a bit stronger, I also need to catch things like dashes, etc. I am basically importing a school database (URL here, the tag is <nom>), and I want people to be able to find their schools by typing its name. Both the search query and search request may contain accents, I believe the easiest way would be to make "both" insensitive.

  • "Télécom" should be matched by "Telecom"
  • "établissement" should be matched by "etablissement"
  • "Institut supérieur national de l'artisanat - Chambre de métiers et de l'Artisanat en Moselle" should be matched by "artisanat chambre de métiers
  • "Ecole hôtelière d'Avignon (CCI du Vaucluse)" Should be matched by Ecole hoteliere d'avignon" (for the parenthesis it's okay to skip it)
  • "Ecole française d'hôtesses" should be matched by "ecole francaise d'hot"

Also crazy stuff I found in that DB, I will consider sanitizing this input I think

  • "Académie internationale de management - Hotel & Tourism Management Academy" Should be matched by "Hotel Tourism" (note the & is actually written &amp; in the XML)
like image 677
Cyril Duchon-Doris Avatar asked Dec 05 '25 07:12

Cyril Duchon-Doris


1 Answers

It looks like the solution for MongoDB is to use a text index, which is diacritic insensitive. French is supported.

It's been a long time since I last used MongoDB, but if you're using Mongoid I think you would create a text index in your model like this:

index(name: "text")

...and then search like this:

scope :by_registered_name, ->(str) {
  where(:$text => { :$search => str })
}

Consult the documentation for the $text query operator for more information.

Original (wrong) answer

As it turns out I was thinking about the question backwards, and wrote this answer initially. I'm preserving it since it might still come in handy. If you were using a database that didn't offer this kind of functionality (like, it seems, MongoDB does), a possible workaround would be to use the following technique to store a sanitized name along with the original name in the database, and then likewise sanitize queries.

Since you're using Rails you can use the handy ActiveSupport::Inflector.transliterate:

regex = /aäoöuü/
transliterated = ActiveSupport::Inflector.transliterate(regex.source, '\?')
# => "aaoouu"
new_regex = Regexp.new(transliterated)
# => /aaoouu/

Or simply:

Regexp.new(ActiveSupport::Inflector.transliterate(regex.source, '\?'))

You'll note that I supplied '\?' as the second argument, which is the replacement string that will replace any invalid UTF-8 characters. This is because the default replacement string is "?", which as you know has special meaning in a regular expression.

Also note that ActiveSupport::Inflector.transliterate does a little bit more than the similar I18n.transliterate. Here's its source:

def transliterate(string, replacement = "?")
  I18n.transliterate(ActiveSupport::Multibyte::Unicode.normalize(
    ActiveSupport::Multibyte::Unicode.tidy_bytes(string), :c),
      :replacement => replacement)
end

The innermost method call, ActiveSupport::Multibyte::Unicode.tidy_bytes, cleans up any invalid UTF-8 characters.

More importantly, ActiveSupport::Multibyte::Unicode.normalize "normalizes" the characters. For example, looks like one character but it's actually two: LATIN SMALL LETTER E and COMBINING CIRCUMFLEX ACCENT. Calling I18n.transliterate("ê") would yield e?, which probably isn't what you want, so normalize is called to turn into ê, which is just one character: LATIN SMALL LETTER E WITH CIRCUMFLEX. Calling I18n.transliterate on (the former) would yield e?, which probably isn't what you want, so that normalize step before transliterate is important. (If you're interested in how that works, read about Unicode equivalence and normalization.)

like image 191
Jordan Running Avatar answered Dec 07 '25 04:12

Jordan Running



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!