Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

namespace-unaware XPath expression fails if Saxon is on the CLASSPATH

I have the following sample XML file:

<a xmlns="http://www.foo.com">
    <b>
    </b>
</a>

Using the XPath expression /foo:a/foo:b (with 'foo' properly configured in the NamespaceContext) I can correctly count the number of b nodes and the code works both when Saxon-HE-9.4.jar is on the CLASSPATH and when it's not.

When, however, I parse the same file with a namespace-unaware DocumentBuilderFactory, the XPath expression "/a/b" correctly counts the number of b nodes only when Saxon-HE-9.4.jar is not on the CLASSPATH.

Code below:

import java.io.*;
import java.util.*;
import javax.xml.xpath.*;
import javax.xml.parsers.*;
import org.w3c.dom.*;
import javax.xml.namespace.NamespaceContext;

public class FooMain {

    public static void main(String args[]) throws Exception {

        String xmlSample = "<a xmlns=\"http://www.foo.com\"><b></b></a>";
        {
            XPath xpath = namespaceUnawareXpath();
            System.out.printf("[NS-unaware] Number of 'b' nodes is: %d\n", 
                              ((NodeList) xpath.compile("/a/b").evaluate(stringToXML(xmlSample, false),
                              XPathConstants.NODESET)).getLength());
        }
        {
            XPath xpath = namespaceAwareXpath("foo", "http://www.foo.com");
            System.out.printf("[NS-aware  ] Number of 'b' nodes is: %d\n", 
                              ((NodeList) xpath.compile("/foo:a/foo:b").evaluate(stringToXML(xmlSample, true),
                               XPathConstants.NODESET)).getLength());
        }

    }


    public static XPath namespaceUnawareXpath() {
        XPathFactory xPathfactory = XPathFactory.newInstance();
        XPath xpath = xPathfactory.newXPath();
        return xpath;
    }

    public static XPath namespaceAwareXpath(final String prefix, final String nsURI) {
        XPathFactory xPathfactory = XPathFactory.newInstance();
        XPath xpath = xPathfactory.newXPath();
        NamespaceContext ctx = new NamespaceContext() {
                @Override
                public String getNamespaceURI(String aPrefix) {
                    if (aPrefix.equals(prefix))
                        return nsURI;
                    else
                        return null;
                }
                @Override
                public Iterator getPrefixes(String val) {
                    throw new UnsupportedOperationException();
                }
                @Override
                public String getPrefix(String uri) {
                    throw new UnsupportedOperationException();
                }
            };
        xpath.setNamespaceContext(ctx);
        return xpath;
    }    

    private static Document stringToXML(String s, boolean nsAware) throws Exception {
        DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
        factory.setNamespaceAware(nsAware);
        DocumentBuilder builder = factory.newDocumentBuilder();
        return builder.parse(new ByteArrayInputStream(s.getBytes("UTF-8")));
    }


}

Running the above with:

java -classpath dist/foo.jar FooMain

.. produces:

[NS-unaware] Number of 'b' nodes is: 1
[NS-aware  ] Number of 'b' nodes is: 1

Running with:

java -classpath Saxon-HE-9.4.jar:dist/foo.jar FooMain

... produces:

[NS-unaware] Number of 'b' nodes is: 0
[NS-aware  ] Number of 'b' nodes is: 1
like image 844
Marcus Junius Brutus Avatar asked Jan 14 '14 16:01

Marcus Junius Brutus


2 Answers

Correct observation. Saxon doesn't work with a namespace-unaware DOM. There's no reason why it should. If you can find an XSLT/XPath processor that works with a namespace-unaware DOM, then go ahead and use it if you want, but its behaviour isn't defined by any standard.

If it were possible for Saxon to detect that the DOM is namespace-unaware, then it would throw an error rather than giving spurious results. Sadly, one of DOM's many design failings is that if you didn't create the DOM yourself, you can't tell whether it's namespace-aware or not.

Your comment "I need to be lenient on namespaces since I have to handle 3rd-party XML instances that are not always XSD valid." is a complete non-sequitur. It's true that a document can't be XSD-valid unless it is namespace-valid, but the converse is not true; loads of documents are namespace-valid without being XSD-valid.

Finally, as your experience shows, relying on the JAXP mechanism to load whatever XPath processor happens to be lying around on the classpath is very error-prone. You can't even control whether you get an XPath 1.0 or 2.0 processor by this mechanism (and again, you can't find out easily which you have got). If your code is dependent on the quirks of a particular XPath implementation then you need to load that implementation explicitly rather than relying on the JAXP search.

UPDATE (Sep 2015): Saxon 9.6 no longer includes the meta-inf services file that advertises it as a JAXP XPath provider. This means you will never pick up Saxon as your XPath processor simply because it is on the classpath: you have to ask for it explicitly.

like image 85
Michael Kay Avatar answered Sep 21 '22 08:09

Michael Kay


The XPath language is only defined on namespace-well-formed XML, so the behaviour of different processors on a non-namespace-aware DOM tree (even one like <a><b/></a> that, had it been parsed in a namespace-aware manner, would not actually use any namespaces) is at best implementation-specific and at worst completely undefined.

like image 31
Ian Roberts Avatar answered Sep 18 '22 08:09

Ian Roberts