Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

regular expression \b in java and javascript

Is there any difference of use regular expression \b in java and js?
I tried below test:
in javascript:

console.log(/\w+\b/.test("test中文"));//true  

in java:

String regEx = "\\w+\\b";
text = "test中文";
Pattern pattern = Pattern.compile(regEx);
Matcher matcher = pattern.matcher(text);
while(matcher.find()) {
    System.out.println("matched");//never executed
}

Why the result of the two example above are not same?

like image 641
Gary Chen Avatar asked May 15 '26 10:05

Gary Chen


1 Answers

That is because by default Java supports Unicode for \b but not for \w, while JavaScript doesn't support Unicode for both.

So \w can only match [a-zA-Z0-9_] characters (in our case test) but \b can't accept place (marked with |)

test|中文

as between alphabetic and non-alphabetic Unicode standards because both t and are considered alphabetic characters by Unicode.

If you want to have \b which will ignore Unicode you can use look-around mechanism and rewrite it as (?:(?<=\\w)(?!\\w)|(?<!\\w)(?=\\w)), or in case of this example simple (?!\\w) instead of \\b will also work.

If you want \w to also support Unicode compile your pattern with Pattern.UNICODE_CHARACTER_CLASS flag (which can also be written as flag expression (?U))

like image 101
Pshemo Avatar answered May 17 '26 22:05

Pshemo



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!