Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why can't I use accented characters next to a word boundary?

I'm trying to make a dynamic regex that matches a person's name. It works without problems on most names, until I ran into accented characters at the end of the name.

Example: Some Fancy Namé

The regex I've used so far is:

/\b(Fancy Namé|Namé)\b/i

Used like this:

"Goal: Some Fancy Namé. Awesome.".replace(/\b(Fancy Namé|Namé)\b/i, '<a href="#">$1</a>');

This simply won't match. If I replace the é with a e, it matches just fine. If I try to match a name such as "Some Fancy Naméa", it works just fine. If I remove the word last word boundary anchor, it works just fine.

Why doesn't the word boundary flag work here? Any suggestions on how I would get around this problem?

I have considered using something like this, but I'm not sure what the performance penalties would be like:

"Some fancy namé. Allow me to ellaborate.".replace(/([\s.,!?])(fancy namé|namé)([\s.,!?]|$)/g, '$1<a href="#">$2</a>$3')

Suggestions? Ideas?

like image 925
Rexxars Avatar asked Mar 15 '10 19:03

Rexxars


People also ask

What character's do you use to match on a word boundary?

The metacharacter \b is an anchor like the caret and the dollar sign. It matches at a position that is called a “word boundary”.

How does word boundary work in regex?

The word boundary \b matches positions where one side is a word character (usually a letter, digit or underscore—but see below for variations across engines) and the other side is not a word character (for instance, it may be the beginning of the string or a space character).

What is character boundary?

A word boundary, in most regex dialects, is a position between \w and \W (non-word char), or at the beginning or end of a string if it begins or ends (respectively) with a word character ( [0-9A-Za-z_] ). So, in the string "-12" , it would match before the 1 or after the 2. The dash is not a word character.

What does\ b means in regex?

\b is a zero width match of a word boundary. (Either start of end of a word, where "word" is defined as \w+ ) Note: "zero width" means if the \b is within a regex that matches, it does not add any characters to the text captured by that match.


2 Answers

JavaScript's regex implementation is not Unicode-aware. It only knows the ‘word characters’ in standard low-byte ASCII, which does not include é or any other accented or non-English letters.

Because é is not a word character to JS, é followed by a space can never be considered a word boundary. (It would match \b if used in the middle of a word, like Namés.)

/([\s.,!?])(fancy namé|namé)([\s.,!?]|$)/

Yeah, that would be the usual workaround for JS (though probably with more punctuation characters). For other languages you'd generally use lookahead/lookbehind to avoid matching the pre and post boundary characters, but these are poorly supported/buggy in JS so best avoided.

like image 184
bobince Avatar answered Oct 19 '22 23:10

bobince


Rob is correct. Quoted from the ECMAScript 3rd edition:

15.10.2.6 Assertion:

The production Assertion \b evaluates by ...

2. Call IsWordChar(e−1) and let a be the boolean result
3. Call IsWordChar(e) and let b be the boolean result

and

The internal helper function IsWordChar ... performs the following:

3. If c is one of the sixty-three characters in the table below, return true.

a b c d e f g h i j k l m n o p q r s t u v w x y z
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
0 1 2 3 4 5 6 7 8 9 _

Since é is not one of these 63 characters, the location between é and a will be considered a word boundary.

If you know the class of characters, you may use a negative look-forward assertion, e.g.

/(^|[^\wÀ-ÖØ-öø-ſ])(Fancy Namé|Namé)(?![\wÀ-ÖØ-öø-ſ])/
like image 43
kennytm Avatar answered Oct 20 '22 00:10

kennytm