Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to remove duplicate attributes from XML with C#

I am parsing some XML files from a third party provider and unfortunately it's not always well-formed XML as sometimes some elements contain duplicate attributes.

I don't have control over the source and I don't know which elements may have duplicate attributes nor do I know the duplicate attribute names in advance.

Obviously, loading the content into an XMLDocument object raises an XmlException on the duplicate attributes so I though I could use an XmlReader to step though the XML element by element and deal with the duplicate attributes when I get to the offending element.

However, the XmlException is raised on reader.Read() - before I get a chance to insepct the element's attributes.

Here's a sample method to demonstrate the issue:

public static void ParseTest()
{
    const string xmlString = 
        @"<?xml version='1.0'?>
        <!-- This is a sample XML document -->
        <Items dupattr=""10"" id=""20"" dupattr=""33"">
            <Item>test with a child element <more/> stuff</Item>
        </Items>";

    var output = new StringBuilder();
    using (XmlReader reader = XmlReader.Create(new StringReader(xmlString)))
    {
        XmlWriterSettings ws = new XmlWriterSettings();
        ws.Indent = true;
        using (XmlWriter writer = XmlWriter.Create(output, ws))
        {
            while (reader.Read())   /* Exception throw here when Items element encountered */
            {
                switch (reader.NodeType)
                {
                    case XmlNodeType.Element:
                        writer.WriteStartElement(reader.Name);
                        if (reader.HasAttributes){ /* CopyNonDuplicateAttributes(); */}
                        break;
                    case XmlNodeType.Text:
                        writer.WriteString(reader.Value);
                        break;
                    case XmlNodeType.XmlDeclaration:
                    case XmlNodeType.ProcessingInstruction:
                        writer.WriteProcessingInstruction(reader.Name, reader.Value);
                        break;
                    case XmlNodeType.Comment:
                        writer.WriteComment(reader.Value);
                        break;
                    case XmlNodeType.EndElement:
                        writer.WriteFullEndElement();
                        break;
                }
            }

        }
    }
    string str = output.ToString();
}

Is there another way to parse the input and remove the duplicate attributes without having to use regular expressions and string manipulation?

like image 804
Catch22 Avatar asked Jul 07 '11 11:07

Catch22


1 Answers

I found a solution by thinking of the XML as an HTML document. Then using the open-source Html Agility Pack library, I was able to get valid XML.

The trick was to save the xml with a HTML header first.
So replace the XML declaration
<?xml version="1.0" encoding="utf-8" ?>
with an HTML declaration like this:
!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">

Once the contents are saved to file, this method will return a valid XML Document.

// Requires reference to HtmlAgilityPack
public XmlDocument LoadHtmlAsXml(string url)
{
    var web = new HtmlWeb();

    var m = new MemoryStream();
    var xtw = new XmlTextWriter(m, null);

    // Load the content into the writer
    web.LoadHtmlAsXml(url, xtw);

    // Rewind the memory stream
    m.Position = 0;

    // Create, fill, and return the xml document
    XmlDocument xmlDoc = new XmlDocument();
    xmlDoc.LoadXml((new StreamReader(m)).ReadToEnd());
    return xmlDoc;
}

The duplicate attribute nodes are automatically removed with the later attribute values overwriting the earlier ones.

like image 124
Catch22 Avatar answered Oct 10 '22 01:10

Catch22