Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to match accented characters with a regex?

I am running Ruby on Rails 3.0.10 and Ruby 1.9.2. I am using the following Regex in order to match names:

NAME_REGEX = /^[\w\s'"\-_&@!?()\[\]-]*$/u  validates :name,   :presence   => true,   :format     => {     :with     => NAME_REGEX,     :message  => "format is invalid"   } 

However, if I try to save some words like the followings:

Oilalà Pì Rùby ...  # In few words, those with accented characters 

I have a validation error "Name format is invalid..

How can I change the above Regex so to match also accented characters like à, è, é, ì, ò, ù, ...?

like image 464
user502052 Avatar asked Sep 03 '11 09:09

user502052


People also ask

How do you represent special characters in regex?

To match a character having special meaning in regex, you need to use a escape sequence prefix with a backslash ( \ ). E.g., \. matches "." ; regex \+ matches "+" ; and regex \( matches "(" . You also need to use regex \\ to match "\" (back-slash).

What does '$' mean in regex?

$ means "Match the end of the string" (the position after the last character in the string).

How do you change an accented character to a regular character?

replace(/[^a-z0-9]/gi,'') . However a more intuitive solution (at least for the user) would be to replace accented characters with their "plain" equivalent, e.g. turn á , á into a , and ç into c , etc.

What does regex 0 * 1 * 0 * 1 * Mean?

Basically (0+1)* mathes any sequence of ones and zeroes. So, in your example (0+1)*1(0+1)* should match any sequence that has 1. It would not match 000 , but it would match 010 , 1 , 111 etc. (0+1) means 0 OR 1.


1 Answers

Instead of \w, use the POSIX bracket expression [:alpha:]:

"blåbær dèjá vu".scan /[[:alpha:]]+/  # => ["blåbær", "dèjá", "vu"]  "blåbær dèjá vu".scan /\w+/  # => ["bl", "b", "r", "d", "j", "vu"] 

In your particular case, change the regex to this:

NAME_REGEX = /^[[:alpha:]\s'"\-_&@!?()\[\]-]*$/u 

This does match much more than just accented characters, though. Which is a good thing. Make sure you read this blog entry about common misconceptions regarding names in software applications.

like image 58
Lars Haugseth Avatar answered Sep 29 '22 11:09

Lars Haugseth