The question has been asked in other programming languages, but how would you perform an accent insensitive regex on Ruby ?
My current code is something like
scope :by_registered_name, ->(regex){
where(:name => /#{Regexp.escape(regex)}/i)
}
I thought maybe I could replace non-alphanumeric+whitespace characters by dots, and remove the escape, but is there not a better way ? I'm afraid I could catch weird things if I do that...
I am targeting French right now, but if I could also fix it for other languages that would be cool.
I am using Ruby 2.3 if that can help.
I realize my requirements are actually a bit stronger, I also need to catch things like dashes, etc. I am basically importing a school database (URL here, the tag is <nom>), and I want people to be able to find their schools by typing its name. Both the search query and search request may contain accents, I believe the easiest way would be to make "both" insensitive.
Also crazy stuff I found in that DB, I will consider sanitizing this input I think
& in the XML)It looks like the solution for MongoDB is to use a text index, which is diacritic insensitive. French is supported.
It's been a long time since I last used MongoDB, but if you're using Mongoid I think you would create a text index in your model like this:
index(name: "text")
...and then search like this:
scope :by_registered_name, ->(str) {
where(:$text => { :$search => str })
}
Consult the documentation for the $text query operator for more information.
As it turns out I was thinking about the question backwards, and wrote this answer initially. I'm preserving it since it might still come in handy. If you were using a database that didn't offer this kind of functionality (like, it seems, MongoDB does), a possible workaround would be to use the following technique to store a sanitized name along with the original name in the database, and then likewise sanitize queries.
Since you're using Rails you can use the handy ActiveSupport::Inflector.transliterate:
regex = /aäoöuü/
transliterated = ActiveSupport::Inflector.transliterate(regex.source, '\?')
# => "aaoouu"
new_regex = Regexp.new(transliterated)
# => /aaoouu/
Or simply:
Regexp.new(ActiveSupport::Inflector.transliterate(regex.source, '\?'))
You'll note that I supplied '\?' as the second argument, which is the replacement string that will replace any invalid UTF-8 characters. This is because the default replacement string is "?", which as you know has special meaning in a regular expression.
Also note that ActiveSupport::Inflector.transliterate does a little bit more than the similar I18n.transliterate. Here's its source:
def transliterate(string, replacement = "?")
I18n.transliterate(ActiveSupport::Multibyte::Unicode.normalize(
ActiveSupport::Multibyte::Unicode.tidy_bytes(string), :c),
:replacement => replacement)
end
The innermost method call, ActiveSupport::Multibyte::Unicode.tidy_bytes, cleans up any invalid UTF-8 characters.
More importantly, ActiveSupport::Multibyte::Unicode.normalize "normalizes" the characters. For example, ê looks like one character but it's actually two: LATIN SMALL LETTER E and COMBINING CIRCUMFLEX ACCENT. Calling I18n.transliterate("ê") would yield e?, which probably isn't what you want, so normalize is called to turn ê into ê, which is just one character: LATIN SMALL LETTER E WITH CIRCUMFLEX. Calling I18n.transliterate on ê (the former) would yield e?, which probably isn't what you want, so that normalize step before transliterate is important. (If you're interested in how that works, read about Unicode equivalence and normalization.)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With