Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

C# HtmlEncode - ISO-8859-1 Entity Names vs Numbers

According to the following table for the ISO-8859-1 standard, there seems to be an entity name and an entity number associated with each reserved HTML character.

So for example, for the character é :

Entity Name : é

Entity Number : é

Similarly, for the character > :

Entity Name : >

Entity Number : >

For a given string, the HttpUtility.HtmlEncode returns an HTML encoded String, but I can't figure out how it works. Here is what I mean :

Console.WriteLine(HtmlEncode("é>"));
//Outputs é>

It seems to be using the entity number for the é character but the entity name for the > character.

So does the HtmlEncode method really work with the ISO-8859-1 standard? If it does, is there a reason why it sometimes uses the entity name and other times the entity number? More importantly, can I force it to give me the entity name reliably?

EDIT : Thanks for the answers guys. I cannot decode the string before I perform the search though. Without getting into too many details, the text is stored in a SharePoint List and the "search" is done by SharePoint itself (using a CAML query). So basically, I can't.

I'm trying to think of a way to convert the entity numbers into names, is there a function in .NET that does that? Or any other idea?

like image 987
Hugo Migneron Avatar asked Jan 31 '11 17:01

Hugo Migneron


People also ask

What C is used for?

C programming language is a machine-independent programming language that is mainly used to create many types of applications and operating systems such as Windows, and other complicated programs such as the Oracle database, Git, Python interpreter, and games and is considered a programming foundation in the process of ...

What is the full name of C?

In the real sense it has no meaning or full form. It was developed by Dennis Ritchie and Ken Thompson at AT&T bell Lab. First, they used to call it as B language then later they made some improvement into it and renamed it as C and its superscript as C++ which was invented by Dr.

What is C in C language?

What is C? C is a general-purpose programming language created by Dennis Ritchie at the Bell Laboratories in 1972. It is a very popular language, despite being old. C is strongly associated with UNIX, as it was developed to write the UNIX operating system.

Is C language easy?

C is a general-purpose language that most programmers learn before moving on to more complex languages. From Unix and Windows to Tic Tac Toe and Photoshop, several of the most commonly used applications today have been built on C. It is easy to learn because: A simple syntax with only 32 keywords.


1 Answers

That's how the method has been implemented. For some known characters it uses the corresponding entity and for everything else it uses the corresponding hex value and there is not much you could do to modify this behavior. Excerpt from the implementation of System.Net.WebUtility.HtmlEncode (as seen with reflector):

...
if (ch <= '>')
{
    switch (ch)
    {
        case '&':
        {
            output.Write("&amp;");
            continue;
        }
        case '\'':
        {
            output.Write("&#39;");
            continue;
        }
        case '"':
        {
            output.Write("&quot;");
            continue;
        }
        case '<':
        {
            output.Write("&lt;");
            continue;
        }
        case '>':
        {
            output.Write("&gt;");
            continue;
        }
    }
    output.Write(ch);
    continue;
}
if ((ch >= '\x00a0') && (ch < 'Ā'))
{
    output.Write("&#");
    output.Write(((int) ch).ToString(NumberFormatInfo.InvariantInfo));
    output.Write(';');
}
...

This being said you shouldn't care as this method will always produce valid, safe and correctly encoded HTML.

like image 121
Darin Dimitrov Avatar answered Sep 28 '22 03:09

Darin Dimitrov