Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why are HtmlEncode and HtmlDecode not isomorphic in .NET?

I find this surprising, and rather annoying.

Example:

Decode(”) => ”
Encode(”)       => ”

Relevant classes:

.NET 4:   System.Net.WebUtility
.NET 3.5: System.Web.HttpUtility

I can understand that a web page can be Unicode, but it my case the output cannot be UTF8.

Is there something (perhaps a HtmlWriter class) that could do this without me having to re-invent the wheel?

Alternative solution:

string HtmlUnicodeEncode(string input)
{
    var sb = new StringBuilder();

    foreach (var c in input)
    {
        if (c > 127)
        {
            sb.AppendFormat("&#x{0:X4};", (int)c);
        }
        else
        {
            sb.Append(c);
        }
    }

    return sb.ToString();
}
like image 870
leppie Avatar asked Jan 13 '23 15:01

leppie


1 Answers

It is impossible to produce an isomorphic HTML codec pair. Consider:

HtmlDecode("”””””") -> ”””””

how do you get back from ””””” to the original string?

HtmlEncode has to pick one encoding for , and it goes for as the shortest, most readable alternative. As long as you've got working Unicode, that's almost certainly the best choice.

If you don't, that's another argument... the advantage of ” is that it's slightly more readable than ”, but it only works in HTML (not XML) and you still have to fall back to character references for all the Unicode characters that don't have built-in entity names, so it's less consistent. For a character-reference encoder, create an XmlTextWriter using the ASCII encoding and call writeString on it.

like image 173
bobince Avatar answered Jan 31 '23 11:01

bobince