I am trying to find why this regex in JAVA ([\ud800-\udbff\udc00-\udfff])
used in replaceAll(regexp,"")
is removing also the hyphen-minus character, along with the surrogate characters.
The Unicode for this one is \u002d
so it does not seem to be inside any of those ranges.
I could easily remove this behaviour adding &&[^\u002d]
resulting in ([\ud800-\udbff\udc00-\udfff&&[^\u002d]])
But, as I do not know why this \u002d
is removed, I think there could be more unnoticed chars being removed.
Example:
String text = "A\u002dB";
System.out.println(text);
String regex = "([\ud800-\udbff\udc00-\udfff])";
System.out.println(text.replaceAll(regex, "X"));
prints:
A-B
AXB
Matching characters in astral planes (code points U+10000 to U+10FFFF) has been an under-documented feature in Java regex.
This answer mainly deals with Oracle's implementation (reference implementation, which is also used in OpenJDK) for Java version 6 and above.
Please test the code yourself if you happen to use GNU Classpath or Android, since they use their own implementation.
Assuming that you are running your regex on Oracle's implementation, your regex
"([\ud800-\udbff\udc00-\udfff])"
is compiled as such:
StartS. Start unanchored match (minLength=1)
java.util.regex.Pattern$GroupHead
Pattern.union. A ∪ B:
Pattern.union. A ∪ B:
Pattern.rangeFor. U+D800 <= codePoint <= U+10FC00.
BitClass. Match any of these 1 character(s):
[U+002D]
SingleS. Match code point: U+DFFF LOW SURROGATES DFFF
java.util.regex.Pattern$GroupTail
java.util.regex.Pattern$LastNode
Node. Accept match
The character class is parsed as \ud800-\udbff\udc00
, -
, \udfff
. Since \udbff\udc00
forms a valid surrogate pairs, it represent the code point U+10FC00.
There is no point in writing:
"[\ud800-\udbff][\udc00-\udfff]"
Since Oracle's implementation matches by code point, and valid surrogate pairs will be converted to code point before matching, the regex above can't match anything, since it is searching for 2 consecutive lone surrogate which can form a valid pair.
If you want to match and remove all code points above U+FFFF in the astral planes (formed by a valid surrogate pair), plus the lone surrogates (which can't form a valid surrogate pair), you should write:
input.replaceAll("[\ud800\udc00-\udbff\udfff\ud800-\udfff]", "");
This solution has been tested to work in Java 6 and 7 (Oracle implementation).
The regex above compiles to:
StartS. Start unanchored match (minLength=1)
Pattern.union. A ∪ B:
Pattern.rangeFor. U+10000 <= codePoint <= U+10FFFF.
Pattern.rangeFor. U+D800 <= codePoint <= U+DFFF.
java.util.regex.Pattern$LastNode
Node. Accept match
Note that I am specifying the characters with string literal Unicode escape sequence, and not the escape sequence in regex syntax.
// Only works in Java 7
input.replaceAll("[\\ud800\\udc00-\\udbff\\udfff\\ud800-\\udfff]", "")
Java 6 doesn't recognize surrogate pairs when it is specified with regex syntax, so the regex recognize \\ud800
as one character and tries to compile the range \\udc00-\\udbff
where it fails. We are lucky that it throws an Exception for this input; otherwise, the error will go undetected. Java 7 parses this regex correctly and compiles to the same structure as above.
From Java 7 and above, the syntax \x{h..h}
has been added to support specifying characters beyond BMP (Basic Multilingual Plane) and it is the recommended method to specify characters in astral planes.
input.replaceAll("[\\x{10000}-\\x{10ffff}\ud800-\udfff]", "");
This regex also compiles to the same structure as above.
If you make the range
[\ud800-\udfff]
or
[\ud800-\udbff\udbff-\udfff]
it will leave the hyphen untouched. Seems like a bug to me.
Note there is no reason for the double range, in your example \udc00
is just the next code point after \udbff
so you could skip that. If you make the two ranges overlap one or more code points, it works again, but you could just as well leave it out (see my first example above).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With