Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Matching accented characters with Javascript regexes

Here's a fun snippet I ran into today:

/\ba/.test("a") --> true /\bà/.test("à") --> false 

However,

/à/.test("à") --> true 

Firstly, wtf?

Secondly, if I want to match an accented character at the start of a word, how can I do that? (I'd really like to avoid using over-the-top selectors like /(?:^|\s|'|\(\) ....)

like image 891
nickf Avatar asked Mar 25 '11 18:03

nickf


People also ask

How do you match a character in JavaScript?

match() is an inbuilt function in JavaScript used to search a string for a match against any regular expression. If the match is found, then this will return the match as an array. Parameters: Here the parameter is “regExp” (i.e. regular expression) which will compare with the given string.

What is the regex for special characters?

Special Regex Characters: These characters have special meaning in regex (to be discussed below): . , + , * , ? , ^ , $ , ( , ) , [ , ] , { , } , | , \ . Escape Sequences (\char): To match a character having special meaning in regex, you need to use a escape sequence prefix with a backslash ( \ ).

How do you change an accented character to a regular character?

replace(/[^a-z0-9]/gi,'') . However a more intuitive solution (at least for the user) would be to replace accented characters with their "plain" equivalent, e.g. turn á , á into a , and ç into c , etc.

How do you match a slash in regex?

You need to escape the / with a \ . Show activity on this post. You can escape it by preceding it with a \ (making it \/ ), or you could use new RegExp('/') to avoid escaping the regex.


2 Answers

This worked for me:

/^[a-z\u00E0-\u00FC]+$/i 

With help from here

like image 177
Wak Avatar answered Oct 01 '22 23:10

Wak


The reason why /\bà/.test("à") doesn't match is because "à" is not a word character. The escape sequence \b matches only between a boundary of word character and a non word character. /\ba/.test("a") matches because "a" is a word character. Because of that, there is a boundary between the beginning of the string (which is not a word character) and the letter "a" which is a word character.

Word characters in JavaScript's regex is defined as [a-zA-Z0-9_].

To match an accented character at the start of a string, just use the ^ character at the beginning of the regex (e.g. /^à/). That character means the beginning of the string (unlike \b which matches at any word boundary within the string). It's most basic and standard regular expression, so it's definitely not over the top.

like image 37
Riimu Avatar answered Oct 01 '22 21:10

Riimu