Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Simplest way to get rid of zero-width-space in c# string

I am parsing emails using a regex in a c# VSTO project. Once in a while, the regex does not seem to work (although if I paste the text and regex in regexbuddy, the regex correctly matches the text). If I look at the email in gmail, I see

=E2=80=8B

at the beginning and end of some lines (which I understand is the UTF8 zero width space); this appears to be what is messing up the regex. This seems to be only sequence showing up.

What is the easiest way to get rid of this exact sequence? I cannot do the obvious

MailItem.Body.Replace("=E2=80=8B", "")

because those characters don't show up in the c# string.

I also tried

byte[] bytes = Encoding.Default.GetBytes(MailItem.TextBody);
string myString = Encoding.UTF8.GetString(bytes);

But the zero-width spaces just show up as ?. I suppose I could go through the bytes array and remove the bytes comprising the zero width space, but I don't know what the bytes would look like (it does not seem as simple as converting E2 80 8B to decimal and searching for that).

like image 340
Jimmy Avatar asked Jul 24 '14 19:07

Jimmy


2 Answers

As strings in C# are stored in Unicode (not UTF-8) the following might do the trick:

MailItem.Body.Replace("\u200B", "");
like image 71
Robert S. Avatar answered Oct 15 '22 21:10

Robert S.


As all the Regex.Replace() methods operate on strings, that's not going to be useful here.

The string indexer returns a char, so for want of a better solution (and if you can't predict where these characters are going to be), as long-winded as it seems, you may be best off with:

        StringBuilder newText = new StringBuilder();

        for (int i = 0; i < MailItem.Body.Length; i++)
        {
            if (a[i] != '\u200b')
            {
                newText.Append(a[i]);
            }
        } 
like image 36
barrick Avatar answered Oct 15 '22 23:10

barrick