I'm trying to make a dynamic regex that matches a person's name. It works without problems on most names, until I ran into accented characters at the end of the name.
Example: Some Fancy Namé
The regex I've used so far is:
/\b(Fancy Namé|Namé)\b/i
Used like this:
"Goal: Some Fancy Namé. Awesome.".replace(/\b(Fancy Namé|Namé)\b/i, '<a href="#">$1</a>');
This simply won't match. If I replace the é with a e, it matches just fine. If I try to match a name such as "Some Fancy Naméa", it works just fine. If I remove the word last word boundary anchor, it works just fine.
Why doesn't the word boundary flag work here? Any suggestions on how I would get around this problem?
I have considered using something like this, but I'm not sure what the performance penalties would be like:
"Some fancy namé. Allow me to ellaborate.".replace(/([\s.,!?])(fancy namé|namé)([\s.,!?]|$)/g, '$1<a href="#">$2</a>$3')
Suggestions? Ideas?
The metacharacter \b is an anchor like the caret and the dollar sign. It matches at a position that is called a “word boundary”.
The word boundary \b matches positions where one side is a word character (usually a letter, digit or underscore—but see below for variations across engines) and the other side is not a word character (for instance, it may be the beginning of the string or a space character).
A word boundary, in most regex dialects, is a position between \w and \W (non-word char), or at the beginning or end of a string if it begins or ends (respectively) with a word character ( [0-9A-Za-z_] ). So, in the string "-12" , it would match before the 1 or after the 2. The dash is not a word character.
\b is a zero width match of a word boundary. (Either start of end of a word, where "word" is defined as \w+ ) Note: "zero width" means if the \b is within a regex that matches, it does not add any characters to the text captured by that match.
JavaScript's regex implementation is not Unicode-aware. It only knows the ‘word characters’ in standard low-byte ASCII, which does not include é
or any other accented or non-English letters.
Because é
is not a word character to JS, é
followed by a space can never be considered a word boundary. (It would match \b
if used in the middle of a word, like Namés
.)
/([\s.,!?])(fancy namé|namé)([\s.,!?]|$)/
Yeah, that would be the usual workaround for JS (though probably with more punctuation characters). For other languages you'd generally use lookahead/lookbehind to avoid matching the pre and post boundary characters, but these are poorly supported/buggy in JS so best avoided.
Rob is correct. Quoted from the ECMAScript 3rd edition:
15.10.2.6 Assertion:
The production Assertion
\b
evaluates by ...2. Call IsWordChar(e−1) and let a be the boolean result
3. Call IsWordChar(e) and let b be the boolean result
and
The internal helper function IsWordChar ... performs the following:
3. If c is one of the sixty-three characters in the table below, return true.
a b c d e f g h i j k l m n o p q r s t u v w x y z A B C D E F G H I J K L M N O P Q R S T U V W X Y Z 0 1 2 3 4 5 6 7 8 9 _
Since é
is not one of these 63 characters, the location between é
and a
will be considered a word boundary.
If you know the class of characters, you may use a negative look-forward assertion, e.g.
/(^|[^\wÀ-ÖØ-öø-ſ])(Fancy Namé|Namé)(?![\wÀ-ÖØ-öø-ſ])/
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With