I have a string which contains XML, I just want to parse it into Xelement, but it has an ampersand. I still have a problem parseing it with HtmlDecode. Any suggestions?
string test = " <MyXML><SubXML><XmlEntry Element="test" value="wow&" /></SubXML></MyXML>";
XElement.Parse(HttpUtility.HtmlDecode(test));
I also added these methods to replace those characters, but I am still getting XMLException.
string encodedXml = test.Replace("&", "&").Replace("<", "<").Replace(">", ">").Replace("\"", """).Replace("'", "'");
XElement myXML = XElement.Parse(encodedXml);
t or Even tried it with this:
string newContent= SecurityElement.Escape(test);
XElement myXML = XElement.Parse(newContent);
Yes, the ampersand symbol may be used in Food Trust XML files as long as it is in XML Predefined Entity format. XML has a few special characters or symbols which are not available to be typed directly from the keyboard.
xml version="1.0"?> An ampersand a character reference can also be escaped as & in element content of XML.
The ampersand & character must be escaped. The greater than and less than characters do no have to be escaped but its good practice to do it. The double quotes in the data must be escaped.
To include special characters inside XML files you must use the numeric character reference instead of that character. The numeric character reference must be UTF-8 because the supported encoding for XML files is defined in the prolog as encoding="UTF-8" and should not be changed.
Ideally the XML is escaped properly prior to your code consuming it. If this is beyond your control you could write a regex. Do not use the String.Replace method unless you're absolutely sure the values do not contain other escaped items.
For example, "wow&".Replace("&", "&")
results in wow&amp;
which is clearly undesirable.
Regex.Replace can give you more control to avoid this scenario, and can be written to only match "&" symbols that are not part of other characters, such as <
, something like:
string result = Regex.Replace(test, "&(?!(amp|apos|quot|lt|gt);)", "&");
The above works, but admittedly it doesn't cover the variety of other characters that start with an ampersand, such as
and the list can grow.
A more flexible approach would be to decode the content of the value attribute, then re-encode it. If you have value="&wow&"
the decode process would return "&wow&"
then re-encoding it would return "&wow&"
, which is desirable. To pull this off you could use this:
string result = Regex.Replace(test, @"value=\""(.*?)\""", m => "value=\"" +
HttpUtility.HtmlEncode(HttpUtility.HtmlDecode(m.Groups[1].Value)) +
"\"");
var doc = XElement.Parse(result);
Bear in mind that the above regex only targets the contents of the value attribute. If there are other areas in the XML structure that suffer from the same issue then it can be tweaked to match them and replace their content in a similar fashion.
string pattern = "(?<start>>)(?<content>.+?(?<!>))(?<end><)|(?<start>\")(?<content>.+?)(?<end>\")";
string result = Regex.Replace(test, pattern, m =>
m.Groups["start"].Value +
HttpUtility.HtmlEncode(HttpUtility.HtmlDecode(m.Groups["content"].Value)) +
m.Groups["end"].Value);
var doc = XElement.Parse(result);
Your string doesn't contain valid XML, that's the issue. You need to change your string to:
<MyXML><SubXML><XmlEntry Element="test" value="wow&" /></SubXML></MyXML>"
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With