Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

remove 4 byte UTF8 characters

Tags:

c#

regex

utf-8

I'd like to remove 4 byte UTF8 characters which starts with \xF0 (the char with the ASCII code 0xF0) from a string and tried

sText = Regex.Replace (sText, "\xF0...", "");

This doesn't work. Using two backslashes did not work neither.

The exact input is the content of https://de.wikipedia.org/w/index.php?title=Spezial:Exportieren&action=submit&pages=Unicode The 4 byte character ist the one after the text "[[Violinschlüssel]] ", in hex notation: .. 0x65 0x6c 0x5d 0x5d 0x20 0xf0 0x9d 0x84 0x9e 0x20 .. The expected output is 0x65 0x6c 0x5d 0x5d 0x20 0x20 ..

What's wrong?

like image 716
André Avatar asked Mar 11 '23 13:03

André


1 Answers

Such characters will be surrogate pairs in .NET which uses UTF-16. Each of them will be two UTF-16 code units, that is two char values.

To just remove them, you can do (using System.Linq;):

sText = string.Concat(sText.Where(x => !char.IsSurrogate(x)));

(uses an overload of Concat introduced in .NET 4.0 (Visual Studio 2010)).


Late addition: It may give better performance to use:

sText = new string(sText.Where(x => !char.IsSurrogate(x)).ToArray());

even if it looks worse. (Works in .NET 3.5 (Visual Studio 2008).)

like image 168
Jeppe Stig Nielsen Avatar answered Mar 23 '23 01:03

Jeppe Stig Nielsen