Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Java regex for support Unicode?

To match A to Z, we will use regex:

[A-Za-z]

How to allow regex to match utf8 characters entered by user? For example Chinese words like 环保部

like image 808
cometta Avatar asked Jun 05 '12 08:06

cometta


People also ask

Does regex support Unicode?

RegexBuddy's regex engine is fully Unicode-based starting with version 2.0. 0.

Can you use Unicode in Java?

Unicode sequences can be used everywhere in Java code. As long as it contains Unicode characters, it can be used as an identifier. You may use Unicode to convey comments, ids, character content, and string literals, as well as other information. However, note that they are interpreted by the compiler early.

What is the regex for Unicode paragraph?

\u000a — Line feed — \n. \u000d — Carriage return — \r. \u2028 — Line separator. \u2029 — Paragraph separator.

What does '$' mean in regex?

$ means "Match the end of the string" (the position after the last character in the string). Both are called anchors and ensure that the entire string is matched instead of just a substring.


1 Answers

What you are looking for are Unicode properties.

e.g. \p{L} is any kind of letter from any language

So a regex to match such a Chinese word could be something like

\p{L}+ 

There are many such properties, for more details see regular-expressions.info

Another option is to use the modifier

Pattern.UNICODE_CHARACTER_CLASS

In Java 7 there is a new property Pattern.UNICODE_CHARACTER_CLASS that enables the Unicode version of the predefined character classes see my answer here for some more details and links

You could do something like this

Pattern p = Pattern.compile("\\w+", Pattern.UNICODE_CHARACTER_CLASS); 

and \w would match all letters and all digits from any languages (and of course some word combining characters like _).

like image 146
stema Avatar answered Oct 24 '22 16:10

stema