Unescaping XML entities using XmlReader in .NET?

Question

I'm trying to unescape XML entities in a string in .NET (C#), but I don't seem to get it to work correctly.

For example, if I have the string AT&T, it should be translated to AT&T.

One way is to use HttpUtility.HtmlDecode(), but that's for HTML.

So I have two questions about this:

Is it safe to use HttpUtility.HtmlDecode() for decoding XML entities?

How do I use XmlReader (or something similar) to do this? I have tried the following, but that always returns an empty string:

static string ReplaceEscapes(string text)
{
    StringReader reader = new StringReader(text);

    XmlReaderSettings settings = new XmlReaderSettings();

    settings.ConformanceLevel = ConformanceLevel.Fragment;

    using (XmlReader xmlReader = XmlReader.Create(reader, settings))
    {
        return xmlReader.ReadString();
    }
}

adrianbanks · Accepted Answer

HTML escaping and XML are closely related. as you have said, HttpUtility has both HtmlEncode and HtmlDecode methods. These will also operate on XML, as there are only a few entities that need escaping: <,>,\,' and & in both HTML and XML.

The downside of using the HttpUtility class is that you need a reference to the System.Web dll, which also brings in a lot of other stuff that you probably don't want.

Specifically for XML, the SecurityElement class has an Escape method that will do the encoding, but does not have a corresponding Unescape method. You therefore have a few options:

use the HttpUtility.HtmlDecode() and put up with a reference to System.Web
roll your own decode method that takes care of the special characters (as there are only a handful - look at the static constructor of SecurityElement in Reflector to see the full list)
use a (hacky) solution like:

.

    public static string Unescape(string text)
    {
        XmlDocument doc = new XmlDocument();
        string xml = string.Format("<dummy>{0}</dummy>", text);
        doc.LoadXml(xml);
        return doc.DocumentElement.InnerText;
    }

Personally, I would use HttpUtility.HtmlDecode() if I already had a reference to System.Web, or roll my own if not. I don't like your XmlReader approach as it is Disposable, which usually indicate that it is using resources that need to be disposed, and so may be a costly operation.

Mormegil · Answer

Your #2 solution can work, but you need to call xmlReader.Read(); (or xmlReader.MoveToContent();) prior to ReadString.

I guess #1 would be also acceptable, even though there are those edge cases like ® which is a valid HTML entity, but not an XML entity – what should your unescaper do with it? Throw an exception as a proper XML parser, or just return “®” as the HTML parser would do?

Unescaping XML entities using XmlReader in .NET?

Tags:

.net

xml

entities

translate

Philippe Leybaert

2 Answers

adrianbanks

Mormegil

Recent Activity

Donate For Us

Unescaping XML entities using XmlReader in .NET?

Tags:

.net

xml

entities

translate

Philippe Leybaert

2 Answers

adrianbanks

Mormegil

Related questions

Recent Activity

Donate For Us