I'm trying to unescape XML entities in a string in .NET (C#), but I don't seem to get it to work correctly.
For example, if I have the string AT&T
, it should be translated to AT&T
.
One way is to use HttpUtility.HtmlDecode(), but that's for HTML.
So I have two questions about this:
Is it safe to use HttpUtility.HtmlDecode() for decoding XML entities?
How do I use XmlReader (or something similar) to do this? I have tried the following, but that always returns an empty string:
static string ReplaceEscapes(string text)
{
StringReader reader = new StringReader(text);
XmlReaderSettings settings = new XmlReaderSettings();
settings.ConformanceLevel = ConformanceLevel.Fragment;
using (XmlReader xmlReader = XmlReader.Create(reader, settings))
{
return xmlReader.ReadString();
}
}
HTML escaping and XML are closely related. as you have said, HttpUtility
has both HtmlEncode
and HtmlDecode
methods. These will also operate on XML, as there are only a few entities that need escaping: <
,>
,\
,'
and &
in both HTML and XML.
The downside of using the HttpUtility
class is that you need a reference to the System.Web
dll, which also brings in a lot of other stuff that you probably don't want.
Specifically for XML, the SecurityElement
class has an Escape
method that will do the encoding, but does not have a corresponding Unescape
method. You therefore have a few options:
HttpUtility.HtmlDecode()
and put up with a reference to System.Web
roll your own decode method that takes care of the special characters (as there are only a handful - look at the static constructor of SecurityElement
in Reflector to see the full list)
use a (hacky) solution like:
.
public static string Unescape(string text)
{
XmlDocument doc = new XmlDocument();
string xml = string.Format("<dummy>{0}</dummy>", text);
doc.LoadXml(xml);
return doc.DocumentElement.InnerText;
}
Personally, I would use HttpUtility.HtmlDecode()
if I already had a reference to System.Web
, or roll my own if not. I don't like your XmlReader
approach as it is Disposable
, which usually indicate that it is using resources that need to be disposed, and so may be a costly operation.
Your #2 solution can work, but you need to call xmlReader.Read();
(or xmlReader.MoveToContent();
) prior to ReadString
.
I guess #1 would be also acceptable, even though there are those edge cases like ®
which is a valid HTML entity, but not an XML entity – what should your unescaper do with it? Throw an exception as a proper XML parser, or just return “®” as the HTML parser would do?
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With