Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Decoding a special character in C#

Tags:

html

c#

I am wondering how I could decode the special character • to HTML?

I have tried using System.Web.HttpUtility.HtmlDecode but not luck yet.

like image 521
user2388013 Avatar asked May 16 '13 01:05

user2388013


1 Answers

The issue here is not HTML decoding, but rather that the text was encoded in one character set (e.g., windows-1252) and then encoded again as a second (UTF-8).

In UTF-8, is decoded as E2 80 A2. When this byte sequence is read using windows-1252 encoding, E2 80 A2 encodes as •. (Saved again as UTF-8 • becomes C3 A2 E2 82 AC C2 A2 20 54 65 73 74.)

If the file is a windows-1252-encoded file, the file can simply be read with the correct encoding (e.g., as an argument to a StreamReader constructor.):

new StreamReader(..., Encoding.GetEncoding("windows-1252"));

If the file was saved with an incorrect encoding, the encoding can be reversed in some cases. For instance, for the string sequence in your question, you can write:

string s = "•"; // the string sequence that is not properly encoded
var b = Encoding.GetEncoding("windows-1252").GetBytes(s); // b = `E2 80 A2`
string c = Encoding.UTF8.GetString(b);  // c = `•`

Note that many common nonprinting characters are in the range U+2000 to U+2044 (Reference), such as "smart quotes", bullets, and dashes. Thus, the sequence �, where ? is any character, will typically signify this type of encoding error. This allows this type of error to be corrected more broadly:

static string CorrectText(string input)
{
    var winencoding = Encoding.GetEncoding("windows-1252");
    return Regex.Replace(input, "â€.",
        m => Encoding.UTF8.GetString(winencoding.GetBytes(m.Value)));
}

Calling this function with text malformed in this way will correct some (but not all) errors. For instance CorrectText("•Test–or“") will return the intended •Test–or“.

like image 114
drf Avatar answered Sep 30 '22 15:09

drf