Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to correctly parse an XML document with arbitrary namespaces

Tags:

namespaces

c#

xml

I am trying to parse somewhat standard XML documents that use a schema called MARCXML from various sources.

Here are the first few lines of an example XML file that needs to be handled...

<?xml version="1.0" encoding="UTF-8" standalone="no" ?>
<marc:collection xmlns:marc="http://www.loc.gov/MARC21/slim" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.loc.gov/MARC21/slim http://www.loc.gov/standards/marcxml/schema/MARC21slim.xsd">
  <marc:record>
    <marc:leader>00925njm  22002777a 4500</marc:leader>

and one without namespace prefixes...

<?xml version="1.0" encoding="UTF-8" standalone="no" ?>
<collection xmlns="http://www.loc.gov/MARC21/slim">
  <record>
    <leader>01142cam  2200301 a 4500</leader>

Key point: in order to get the XPaths to resolve further along in the program I have to go through a regex routine to add the namespaces to the NameTable (which doesn't add them by default). This seems unnecessary to me.

Regex xmlNamespace = new Regex("xmlns:(?<PREFIX>[^=]+)=\"(?<URI>[^\"]+)\"", RegexOptions.Compiled);

XmlDocument xmlDoc = new XmlDocument();
xmlDoc.LoadXml(xmlRecord);
XmlNamespaceManager nsMgr = new XmlNamespaceManager(xmlDoc.NameTable);

MatchCollection namespaces = xmlNamespace.Matches(xmlRecord);
foreach (Match n in namespaces)
{
    nsMgr.AddNamespace(n.Groups["PREFIX"].ToString(), n.Groups["URI"].ToString());
}

The XPath call looks something like this...

XmlNode leaderNode = xmlDoc.SelectSingleNode(".//" + LeaderNode, nsMgr);

Where LeaderNode is a configurable value and would equal "marc:leader" in the first example and "leader" in the second example.

Is there a better, more efficient way to do this? Note: suggestions for solving this using LINQ are welcome, but I would mainly like to know how to solve this using XmlDocument.

EDIT: I took GrayWizardx's advice and now have the following code...

if (LeaderNode.Contains(":"))
{
    string prefix = LeaderNode.Substring(0, LeaderNode.IndexOf(':'));
    XmlNode root = xmlDoc.FirstChild;
    string nameSpace = root.GetNamespaceOfPrefix(prefix);
    nsMgr.AddNamespace(prefix, nameSpace);
}

Now there's no more dependency on Regex!

like image 516
Ryan Berger Avatar asked Oct 20 '10 19:10

Ryan Berger


1 Answers

If you know there is going to be a given element in the document (for instance the root element) you could try using GetNamespaceOfPrefix.

like image 124
GrayWizardx Avatar answered Nov 05 '22 12:11

GrayWizardx