While totally unrelated at first, this question made me wonder...
Java's regexes are based on String
s; String
s are sequences (arrays) of char
s; and char
s are ultimately UTF-16 code units.
The latter means that a single char
can match any Unicode code point inside the BMP, ie from U+0000 to U+FFFF.
Outside the BMP however, two char
s are required for a single code point (one for the leading surrogate, another for the trailing surrogate); from what I can see, apart from a dedicated grammar engine, I don't see a way for Java regexes (as defined by java.util.regex.Pattern
) to define "character classes" for such code points, since there is no String literal for code points outside the BMP.
Notwithstanding that code can be written to produce regexes (well, string literals used as regexes) for such ranges, is there an existing mechanism in Pattern
which is not documented and allows to do that?
This will make your regular expressions work with all Unicode regex engines. In addition to the standard notation, \p{L}, Java, Perl, PCRE, the JGsoft engine, and XRegExp 3 allow you to use the shorthand \pL. The shorthand only works with single-letter Unicode properties.
$ means "Match the end of the string" (the position after the last character in the string). Both are called anchors and ensure that the entire string is matched instead of just a substring.
U (Unicode dependent), and re. X (verbose), for the entire regular expression. (The flags are described in Module Contents.) This is useful if you wish to include the flags as part of the regular expression, instead of passing a flag argument to the re.
\u000d — Carriage return — \r. \u2028 — Line separator. \u2029 — Paragraph separator.
OK, so, answer to self; data extracted from this question and the associated answer which @DavidWallace pointed to.
It is possible. To paraphrase the answer, in such a character class as:
"[\uD83D\uDE01-\uD83D\uDE4F]"
the Java regex engine will be smart enough to notice that you specify a surrogate pair on both ends of the interval, and "compile" the regex accordingly.
In addition, starting with Java 7, you can also use \x{foo}
where foo
is the hexadecimal representation of the code point. Not forgetting the quoting necessary in Java string literals, the above can therefore be written:
"[\\x{1F601}-\\x{1F64F}]"
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With