Using Java regexes to match a range of Unicode code points _outside_ the BMP: it is possible at all?

Tags:

While totally unrelated at first, this question made me wonder...

Java's regexes are based on Strings; Strings are sequences (arrays) of chars; and chars are ultimately UTF-16 code units.

The latter means that a single char can match any Unicode code point inside the BMP, ie from U+0000 to U+FFFF.

Outside the BMP however, two chars are required for a single code point (one for the leading surrogate, another for the trailing surrogate); from what I can see, apart from a dedicated grammar engine, I don't see a way for Java regexes (as defined by java.util.regex.Pattern) to define "character classes" for such code points, since there is no String literal for code points outside the BMP.

Notwithstanding that code can be written to produce regexes (well, string literals used as regexes) for such ranges, is there an existing mechanism in Pattern which is not documented and allows to do that?

387

asked Nov 12 '14 22:11

fge

1 Answers

OK, so, answer to self; data extracted from this question and the associated answer which @DavidWallace pointed to.

It is possible. To paraphrase the answer, in such a character class as:

"[\uD83D\uDE01-\uD83D\uDE4F]"

the Java regex engine will be smart enough to notice that you specify a surrogate pair on both ends of the interval, and "compile" the regex accordingly.

In addition, starting with Java 7, you can also use \x{foo} where foo is the hexadecimal representation of the code point. Not forgetting the quoting necessary in Java string literals, the above can therefore be written:

"[\\x{1F601}-\\x{1F64F}]"

159

answered Sep 23 '22 00:09

fge

Related questions
                            
                                Read body of a request sent to a dropwizard service
                            
                                Catching Multiple Exceptions - calling methods not present in Exception on the caught exception
                            
                                Why treemap takes O(log(n)) time in Get/put
                            
                                Java OutputStream to Multiple files [duplicate]
                            
                                Is it possible to concate two int arrays without using a return type?
                            
                                How to implement a trigger Before / After save method of Spring-Data Repository
                            
                                Connections checking in c3p0 pool
                            
                                JavaFX 8 DatePicker style
                            
                                How to change the default format of JPA generated column names
                            
                                Limit date range in Joda Time
                            
                                Apache DateUtils truncate for WEEK
                            
                                Read xlsx file using POIFSFileSystem
                            
                                Customize detault html link color in java swing
                            
                                Assign to static final field of same name
                            
                                How URLs are write once?
                            
                                How to Run Annotation Processor without compiling sources using javac (Java 8 can't use Apt)
                            
                                Java 8 Spliterator (or similar) that returns a value iff there's only a single value
                            
                                How to add text to an image?
                            
                                Can multiple Java processes read the same file at the same time?
                            
                                Why is value of a value class as its hashCode "not a good idea"?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Using Java regexes to match a range of Unicode code points _outside_ the BMP: it is possible at all?

Tags:

java

regex

unicode

fge

People also ask

1 Answers

fge

Recent Activity

Donate For Us