Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Unescaping XML entities using XmlReader in .NET?

I'm trying to unescape XML entities in a string in .NET (C#), but I don't seem to get it to work correctly.

For example, if I have the string AT&T, it should be translated to AT&T.

One way is to use HttpUtility.HtmlDecode(), but that's for HTML.

So I have two questions about this:

  1. Is it safe to use HttpUtility.HtmlDecode() for decoding XML entities?

  2. How do I use XmlReader (or something similar) to do this? I have tried the following, but that always returns an empty string:

    static string ReplaceEscapes(string text)
    {
        StringReader reader = new StringReader(text);
    
        XmlReaderSettings settings = new XmlReaderSettings();
    
        settings.ConformanceLevel = ConformanceLevel.Fragment;
    
        using (XmlReader xmlReader = XmlReader.Create(reader, settings))
        {
            return xmlReader.ReadString();
        }
    }
    
like image 318
Philippe Leybaert Avatar asked Mar 14 '11 20:03

Philippe Leybaert


2 Answers

HTML escaping and XML are closely related. as you have said, HttpUtility has both HtmlEncode and HtmlDecode methods. These will also operate on XML, as there are only a few entities that need escaping: <,>,\,' and & in both HTML and XML.

The downside of using the HttpUtility class is that you need a reference to the System.Web dll, which also brings in a lot of other stuff that you probably don't want.

Specifically for XML, the SecurityElement class has an Escape method that will do the encoding, but does not have a corresponding Unescape method. You therefore have a few options:

  1. use the HttpUtility.HtmlDecode() and put up with a reference to System.Web
  2. roll your own decode method that takes care of the special characters (as there are only a handful - look at the static constructor of SecurityElement in Reflector to see the full list)

  3. use a (hacky) solution like:

.

    public static string Unescape(string text)
    {
        XmlDocument doc = new XmlDocument();
        string xml = string.Format("<dummy>{0}</dummy>", text);
        doc.LoadXml(xml);
        return doc.DocumentElement.InnerText;
    }

Personally, I would use HttpUtility.HtmlDecode() if I already had a reference to System.Web, or roll my own if not. I don't like your XmlReader approach as it is Disposable, which usually indicate that it is using resources that need to be disposed, and so may be a costly operation.

like image 142
adrianbanks Avatar answered Sep 28 '22 06:09

adrianbanks


Your #2 solution can work, but you need to call xmlReader.Read(); (or xmlReader.MoveToContent();) prior to ReadString.

I guess #1 would be also acceptable, even though there are those edge cases like &reg; which is a valid HTML entity, but not an XML entity – what should your unescaper do with it? Throw an exception as a proper XML parser, or just return “®” as the HTML parser would do?

like image 22
Mormegil Avatar answered Sep 28 '22 05:09

Mormegil