I would like to use this regular expression new RegExp("\b"+pat+"\b") in greek text but the "\b" metacharacter supports only ASCII characters.
I tried XregExp library but i didnt manage to solve the issue.
Any suggestions would be greatly appreciated.
JavaScript, which does not offer any Unicode support through its RegExp class, does support \uFFFF for matching a single Unicode code point as part of its string syntax. XML Schema and XPath do not have a regex token for matching Unicode code points.
\p{L} matches a single code point in the category "letter". \p{N} matches any kind of numeric character in any script. Source: regular-expressions.info.
\u000d — Carriage return — \r. \u2028 — Line separator. \u2029 — Paragraph separator.
Flag u enables the support of Unicode in regular expressions. That means two things: Characters of 4 bytes are handled correctly: as a single character, not two 2-byte characters. Unicode properties can be used in the search: \p{…} .
I think this was helpful to your answer.,
<script src="xregexp.js"></script>
<script src="xregexp-unicode-base.js"></script>
<script>
var unicodeWord = XRegExp("^\\p{L}+$");
unicodeWord.test("Русский"); // true
unicodeWord.test("日本語"); // true
unicodeWord.test("العربية"); // true
</script>
<!-- \p{L} is included in the base script, but other categories, scripts,
and blocks require token packages -->
<script src="xregexp-unicode-scripts.js"></script>
<script>
XRegExp("^\\p{Katakana}+$").test("カタカナ"); // true
</script>
Please refer the following location : http://xregexp.com/plugins/
So the answer is just, that you can not use the JavaScript native mechanisms or any library which uses those mechanisms to match words the way you want to. As you already stated, \b matches words. Words must consists of word characters. And in JavaScript (and actually other regex implementations word characters are a-z, A-Z, 0-9 and _. But many other Languages just implement the \b metacharacter in a different way JavaScript does.
The answer "JavaScript does not support Unicode" is a bit to easy and in fact completely wrong. JavaScript just doesn't use unicode for the character classes. If JavaScript wouldn't support unicode you couldn't even use unicode Characters in String literals and of course this is possible in JavaScript.
According to the ECMA 262 Standard (ECMAScript) (Section 15.10.2.6):
[...] The production Assertion :: \ b evaluates by returning an internal AssertionTester closure that takes a State argument x and performs the following:
The abstract operation IsWordChar takes an integer parameter e and performs the following:
This just shows, that the \b uses the Algorithm of "isWordChar" to check if what you try to match is actually a word. Int he definition of "isWordChar" you can see the exact definition of which characters will return true for "isWordChar".
In my Opinion this has absolutely nothing to do with the character set being used. It's neither ASCII nor UNICODE compilant here. It's just these 63 characters.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With