To match A to Z, we will use regex:
[A-Za-z]
How to allow regex to match utf8 characters entered by user? For example Chinese words like 环保部
RegexBuddy's regex engine is fully Unicode-based starting with version 2.0. 0.
Unicode sequences can be used everywhere in Java code. As long as it contains Unicode characters, it can be used as an identifier. You may use Unicode to convey comments, ids, character content, and string literals, as well as other information. However, note that they are interpreted by the compiler early.
\u000a — Line feed — \n. \u000d — Carriage return — \r. \u2028 — Line separator. \u2029 — Paragraph separator.
$ means "Match the end of the string" (the position after the last character in the string). Both are called anchors and ensure that the entire string is matched instead of just a substring.
What you are looking for are Unicode properties.
e.g. \p{L}
is any kind of letter from any language
So a regex to match such a Chinese word could be something like
\p{L}+
There are many such properties, for more details see regular-expressions.info
Another option is to use the modifier
Pattern.UNICODE_CHARACTER_CLASS
In Java 7 there is a new property Pattern.UNICODE_CHARACTER_CLASS
that enables the Unicode version of the predefined character classes see my answer here for some more details and links
You could do something like this
Pattern p = Pattern.compile("\\w+", Pattern.UNICODE_CHARACTER_CLASS);
and \w
would match all letters and all digits from any languages (and of course some word combining characters like _
).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With