Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Javascript RegExp + Word boundaries + unicode characters

I am building search and I am going to use javascript autocomplete with it. I am from Finland (finnish language) so I have to deal with some special characters like ä, ö and å

When user types text in to the search input field I try to match the text to data.

Here is simple example that is not working correctly if user types for example "ää". Same thing with "äl"

var title = "this is simple string with finnish word tämä on ääkköstesti älkää ihmetelkö"; // Does not work var searchterm = "äl";  // does not work //var searchterm = "ää";  // Works //var searchterm = "wi";  if ( new RegExp("\\b"+searchterm, "gi").test(title) ) {     $("#result").html("Match: ("+searchterm+"): "+title); } else {     $("#result").html("nothing found with term: "+searchterm);    } 

http://jsfiddle.net/7TsxB/

So how can I get those ä,ö and å characters to work with javascript regex?

I think I should use unicode codes but how should I do that? Codes for those characters are: [\u00C4,\u00E4,\u00C5,\u00E5,\u00D6,\u00F6]

=> äÄåÅöÖ

like image 830
user1394520 Avatar asked May 14 '12 19:05

user1394520


People also ask

Does regex work with Unicode?

This will make your regular expressions work with all Unicode regex engines. In addition to the standard notation, \p{L}, Java, Perl, PCRE, the JGsoft engine, and XRegExp 3 allow you to use the shorthand \pL. The shorthand only works with single-letter Unicode properties.

How do you match a word boundary in regex?

\m matches only at the start of a word. That is, it matches at any position that has a non-word character to the left of it, and a word character to the right of it. It also matches at the start of the string if the first character in the string is a word character. \M matches only at the end of a word.

Does JavaScript support Unicode regex?

The only Unicode support in JavaScript regexes is matching specific code points with \uFFFF. You can use those in ranges in character classes.


1 Answers

There appears to be a problem with Regex and the word boundary \b matching the beginning of a string with a starting character out of the normal 256 byte range.

Instead of using \b, try using (?:^|\\s)

var title = "this is simple string with finnish word tämä on ääkköstesti älkää ihmetelkö"; // Does not work var searchterm = "äl";  // does not work //var searchterm = "ää";  // Works //var searchterm = "wi";  if ( new RegExp("(?:^|\\s)"+searchterm, "gi").test(title) ) {     $("#result").html("Match: ("+searchterm+"): "+title); } else {     $("#result").html("nothing found with term: "+searchterm);    } 

Breakdown:

(?: parenthesis () form a capture group in Regex. Parenthesis started with a question mark and colon ?: form a non-capturing group. They just group the terms together

^ the caret symbol matches the beginning of a string

| the bar is the "or" operator.

\s matches whitespace (appears as \\s in the string because we have to escape the backslash)

) closes the group

So instead of using \b, which matches word boundaries and doesn't work for unicode characters, we use a non-capturing group which matches the beginning of a string OR whitespace.

like image 56
mowwwalker Avatar answered Sep 23 '22 05:09

mowwwalker