Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using Java regexes to match a range of Unicode code points _outside_ the BMP: it is possible at all?

While totally unrelated at first, this question made me wonder...

Java's regexes are based on Strings; Strings are sequences (arrays) of chars; and chars are ultimately UTF-16 code units.

The latter means that a single char can match any Unicode code point inside the BMP, ie from U+0000 to U+FFFF.

Outside the BMP however, two chars are required for a single code point (one for the leading surrogate, another for the trailing surrogate); from what I can see, apart from a dedicated grammar engine, I don't see a way for Java regexes (as defined by java.util.regex.Pattern) to define "character classes" for such code points, since there is no String literal for code points outside the BMP.

Notwithstanding that code can be written to produce regexes (well, string literals used as regexes) for such ranges, is there an existing mechanism in Pattern which is not documented and allows to do that?

like image 387
fge Avatar asked Nov 12 '14 22:11

fge


People also ask

Does regex work with Unicode?

This will make your regular expressions work with all Unicode regex engines. In addition to the standard notation, \p{L}, Java, Perl, PCRE, the JGsoft engine, and XRegExp 3 allow you to use the shorthand \pL. The shorthand only works with single-letter Unicode properties.

What does '$' mean in regex?

$ means "Match the end of the string" (the position after the last character in the string). Both are called anchors and ensure that the entire string is matched instead of just a substring.

What does \u mean in regex?

U (Unicode dependent), and re. X (verbose), for the entire regular expression. (The flags are described in Module Contents.) This is useful if you wish to include the flags as part of the regular expression, instead of passing a flag argument to the re.

What is the regex for Unicode paragraph seperator?

\u000d — Carriage return — \r. \u2028 — Line separator. \u2029 — Paragraph separator.


1 Answers

OK, so, answer to self; data extracted from this question and the associated answer which @DavidWallace pointed to.

It is possible. To paraphrase the answer, in such a character class as:

"[\uD83D\uDE01-\uD83D\uDE4F]"

the Java regex engine will be smart enough to notice that you specify a surrogate pair on both ends of the interval, and "compile" the regex accordingly.

In addition, starting with Java 7, you can also use \x{foo} where foo is the hexadecimal representation of the code point. Not forgetting the quoting necessary in Java string literals, the above can therefore be written:

"[\\x{1F601}-\\x{1F64F}]"
like image 159
fge Avatar answered Sep 23 '22 00:09

fge