Is there any difference of use regular expression \b in java and js?
I tried below test:
in javascript:
console.log(/\w+\b/.test("test中文"));//true
in java:
String regEx = "\\w+\\b";
text = "test中文";
Pattern pattern = Pattern.compile(regEx);
Matcher matcher = pattern.matcher(text);
while(matcher.find()) {
System.out.println("matched");//never executed
}
Why the result of the two example above are not same?
That is because by default Java supports Unicode for \b but not for \w, while JavaScript doesn't support Unicode for both.
So \w can only match [a-zA-Z0-9_] characters (in our case test) but \b can't accept place (marked with |)
test|中文
as between alphabetic and non-alphabetic Unicode standards because both t and 中 are considered alphabetic characters by Unicode.
If you want to have \b which will ignore Unicode you can use look-around mechanism and rewrite it as (?:(?<=\\w)(?!\\w)|(?<!\\w)(?=\\w)), or in case of this example simple (?!\\w) instead of \\b will also work.
If you want \w to also support Unicode compile your pattern with Pattern.UNICODE_CHARACTER_CLASS flag (which can also be written as flag expression (?U))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With