I'd like to remove 4 byte UTF8 characters which starts with \xF0 (the char with the ASCII code 0xF0) from a string and tried
sText = Regex.Replace (sText, "\xF0...", "");
This doesn't work. Using two backslashes did not work neither.
The exact input is the content of https://de.wikipedia.org/w/index.php?title=Spezial:Exportieren&action=submit&pages=Unicode The 4 byte character ist the one after the text "[[Violinschlüssel]] ", in hex notation: .. 0x65 0x6c 0x5d 0x5d 0x20 0xf0 0x9d 0x84 0x9e 0x20 .. The expected output is 0x65 0x6c 0x5d 0x5d 0x20 0x20 ..
What's wrong?
Such characters will be surrogate pairs in .NET which uses UTF-16. Each of them will be two UTF-16 code units, that is two char
values.
To just remove them, you can do (using System.Linq;
):
sText = string.Concat(sText.Where(x => !char.IsSurrogate(x)));
(uses an overload of Concat
introduced in .NET 4.0 (Visual Studio 2010)).
Late addition: It may give better performance to use:
sText = new string(sText.Where(x => !char.IsSurrogate(x)).ToArray());
even if it looks worse. (Works in .NET 3.5 (Visual Studio 2008).)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With