I am wondering how I could decode the special character •
to HTML?
I have tried using System.Web.HttpUtility.HtmlDecode
but not luck yet.
The issue here is not HTML decoding, but rather that the text was encoded in one character set (e.g., windows-1252) and then encoded again as a second (UTF-8).
In UTF-8, •
is decoded as E2 80 A2
. When this byte sequence is read using windows-1252 encoding, E2 80 A2
encodes as •
. (Saved again as UTF-8 •
becomes C3 A2 E2 82 AC C2 A2 20 54 65 73 74
.)
If the file is a windows-1252-encoded file, the file can simply be read with the correct encoding (e.g., as an argument to a StreamReader constructor.):
new StreamReader(..., Encoding.GetEncoding("windows-1252"));
If the file was saved with an incorrect encoding, the encoding can be reversed in some cases. For instance, for the string sequence in your question, you can write:
string s = "•"; // the string sequence that is not properly encoded
var b = Encoding.GetEncoding("windows-1252").GetBytes(s); // b = `E2 80 A2`
string c = Encoding.UTF8.GetString(b); // c = `•`
Note that many common nonprinting characters are in the range U+2000
to U+2044
(Reference), such as "smart quotes", bullets, and dashes. Thus, the sequence �
, where ?
is any character, will typically signify this type of encoding error. This allows this type of error to be corrected more broadly:
static string CorrectText(string input)
{
var winencoding = Encoding.GetEncoding("windows-1252");
return Regex.Replace(input, "â€.",
m => Encoding.UTF8.GetString(winencoding.GetBytes(m.Value)));
}
Calling this function with text malformed in this way will correct some (but not all) errors. For instance CorrectText("•Test–or“")
will return the intended •Test–or“
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With