Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I use xpath on nodes with a prefix but without a namespace?

I have an XML file that I need to parse. I have no control over the format of the file and cannot change it.

The file makes use of a prefix (call it a), but it doesn't define a namespace for that prefix anywhere. I can't seem to use xpath to query for nodes with the a namespace.

Here's the contents of the xml document

<?xml version="1.0" encoding="UTF-8"?>

<a:root>
  <a:thing>stuff0</a:thing>
  <a:thing>stuff1</a:thing>
  <a:thing>stuff2</a:thing>
  <a:thing>stuff3</a:thing>
  <a:thing>stuff4</a:thing>
  <a:thing>stuff5</a:thing>
  <a:thing>stuff6</a:thing>
  <a:thing>stuff7</a:thing>
  <a:thing>stuff8</a:thing>
  <a:thing>stuff9</a:thing>
</a:root>

I am using Nokogiri to query the document:

doc = Nokogiri::XML(open('text.xml'))
things = doc.xpath('//a:thing')

The fails giving the following error:

Nokogiri::XML::XPath::SyntaxError: Undefined namespace prefix: //a:thing

From my research, I found out that I could specify the namespace for the prefix in the xpath method:

things = doc.xpath('//a:thing', a: 'nobody knows')

This returns an empty array.

What would be the best way for me to get the nodes that I need?

like image 901
Boris Bera Avatar asked Nov 15 '13 15:11

Boris Bera


1 Answers

The problem is that the namespace is not properly defined in the XML document. As a result, Nokogiri sees the node names as being "a:root" instead of "a" being a namespace and "root" being the node name:

xml = %Q{
    <?xml version="1.0" encoding="UTF-8"?>
    <a:root>
      <a:thing>stuff0</a:thing>
      <a:thing>stuff1</a:thing>
    </a:root>
}
doc = Nokogiri::XML(xml)
puts doc.at_xpath('*').node_name
#=> "a:root"
puts doc.at_xpath('*').namespace
#=> ""

Solution 1 - Specify node name with colon

One solution is to search for nodes with the name "a:thing". You cannot do //a:thing since the XPath will treat the "a" as a namespace. You can get around this by doing //*[name()="a:thing"]:

xml = %Q{
    <?xml version="1.0" encoding="UTF-8"?>
    <a:root>
      <a:thing>stuff0</a:thing>
      <a:thing>stuff1</a:thing>
    </a:root>
}
doc = Nokogiri::XML(xml)
things = doc.xpath('//*[name()="a:thing"]')
puts things
#=> <a:thing>stuff0</a:thing>
#=> <a:thing>stuff1</a:thing>

Solution 2 - Modify the XML document to define the namespace

An alternative solution is to modify the XML file that you get to properly define the namespace. The document will then behave with namespaces as expected:

xml = %Q{
    <?xml version="1.0" encoding="UTF-8"?>
    <a:root>
      <a:thing>stuff0</a:thing>
      <a:thing>stuff1</a:thing>
    </a:root>
}
xml.gsub!('<a:root>', '<a:root xmlns:a="foo">')
doc = Nokogiri::XML(xml)
things = doc.xpath('//a:thing')
puts things
#=> <a:thing>stuff0</a:thing>
#=> <a:thing>stuff1</a:thing>
like image 82
Justin Ko Avatar answered Oct 02 '22 23:10

Justin Ko