I am parsing some XML files from a third party provider and unfortunately it's not always well-formed XML as sometimes some elements contain duplicate attributes.
I don't have control over the source and I don't know which elements may have duplicate attributes nor do I know the duplicate attribute names in advance.
Obviously, loading the content into an XMLDocument
object raises an XmlException on the duplicate attributes so I though I could use an XmlReader
to step though the XML element by element and deal with the duplicate attributes when I get to the offending element.
However, the XmlException
is raised on reader.Read()
- before I get a chance to insepct the element's attributes.
Here's a sample method to demonstrate the issue:
public static void ParseTest()
{
const string xmlString =
@"<?xml version='1.0'?>
<!-- This is a sample XML document -->
<Items dupattr=""10"" id=""20"" dupattr=""33"">
<Item>test with a child element <more/> stuff</Item>
</Items>";
var output = new StringBuilder();
using (XmlReader reader = XmlReader.Create(new StringReader(xmlString)))
{
XmlWriterSettings ws = new XmlWriterSettings();
ws.Indent = true;
using (XmlWriter writer = XmlWriter.Create(output, ws))
{
while (reader.Read()) /* Exception throw here when Items element encountered */
{
switch (reader.NodeType)
{
case XmlNodeType.Element:
writer.WriteStartElement(reader.Name);
if (reader.HasAttributes){ /* CopyNonDuplicateAttributes(); */}
break;
case XmlNodeType.Text:
writer.WriteString(reader.Value);
break;
case XmlNodeType.XmlDeclaration:
case XmlNodeType.ProcessingInstruction:
writer.WriteProcessingInstruction(reader.Name, reader.Value);
break;
case XmlNodeType.Comment:
writer.WriteComment(reader.Value);
break;
case XmlNodeType.EndElement:
writer.WriteFullEndElement();
break;
}
}
}
}
string str = output.ToString();
}
Is there another way to parse the input and remove the duplicate attributes without having to use regular expressions and string manipulation?
I found a solution by thinking of the XML as an HTML document. Then using the open-source Html Agility Pack library, I was able to get valid XML.
The trick was to save the xml with a HTML header first.
So replace the XML declaration<?xml version="1.0" encoding="utf-8" ?>
with an HTML declaration like this:!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
Once the contents are saved to file, this method will return a valid XML Document.
// Requires reference to HtmlAgilityPack
public XmlDocument LoadHtmlAsXml(string url)
{
var web = new HtmlWeb();
var m = new MemoryStream();
var xtw = new XmlTextWriter(m, null);
// Load the content into the writer
web.LoadHtmlAsXml(url, xtw);
// Rewind the memory stream
m.Position = 0;
// Create, fill, and return the xml document
XmlDocument xmlDoc = new XmlDocument();
xmlDoc.LoadXml((new StreamReader(m)).ReadToEnd());
return xmlDoc;
}
The duplicate attribute nodes are automatically removed with the later attribute values overwriting the earlier ones.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With