C# Regular Expressions with \Uxxxxxxxx characters in the pattern

Question

Regex.IsMatch( "foo", "[\U00010000-\U0010FFFF]" )

Throws: System.ArgumentException: parsing "[-]" - [x-y] range in reverse order.

Looking at the hex values for \U00010000 and \U0010FFF I get: 0xd800 0xdc00 for the first character and 0xdbff 0xdfff for the second.

So I guess I have really have one problem. Why are the Unicode characters formed with \U split into two chars in the string?

Jon Skeet · Accepted Answer

They're surrogate pairs. Look at the values - they're over 65535. A char is only a 16 bit value. How would you expression 65536 in only 16 bits?

Unfortunately it's not clear from the documentation how (or whether) the regular expression engine in .NET copes with characters which aren't in the basic multilingual plane. (The \uxxxx pattern in the regular expression documentation only covers 0-65535, just like \uxxxx as a C# escape sequence.)

Is your real regular expression bigger, or are you actually just trying to see if there are any non-BMP characters in there?

Andriy K · Answer

To workaround such things with .Net regex engine, I'm using following trick: "[\U010000-\U10FFFF]" is replaced with [\uD800-\uDBFF][\uDC00-\uDFFF] The idea behind this is that as .Net regexes handle code units instead of code points, we're providing it with surrogate ranges as regular characters. It's also possible to specify more narrow ranges by operating with edges, e.g.: [\U011DEF-\U013E07] is same as (?:\uD807[\uDDEF-\uDFFF])|(?:[\uD808-\uD80E][\uDC00-\uDFFF])|(?:\uD80F[\uDC00-uDE07])

It's harder to read and operate with, and it's not that flexible, but still fits as workaround.

C# Regular Expressions with \Uxxxxxxxx characters in the pattern

Tags:

c#

regex

unicode

astral-plane

Ben McNiel

2 Answers

Jon Skeet

Andriy K

Recent Activity

Donate For Us

C# Regular Expressions with \Uxxxxxxxx characters in the pattern

Tags:

c#

regex

unicode

astral-plane

Ben McNiel

2 Answers

Jon Skeet

Andriy K

Related questions

Recent Activity

Donate For Us