Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Standard method for parsing XML documents without downloading a DTD

Tags:

java

spring

xml

So, our application parses XML documents retrieved from a web service (specifically PubMed). Those documents declare a DTD (an example). By default, and contrary to my naive expectations, the XML library we use (JDom2, built on Xerces I believe) downloads that DTD before parsing the XML document. Downloads, as in makes an HTTP request over the internet to the address specified.

From reading other posts here, its my understanding that reading the DTD is necessary given that it may contain entity declarations required to parse the &foo; bits in the document (BTW, this is insanity in the XML standard, right?)

I thought that there must be some easy, standard, any-one-who-knows-what-they-are-doing-does-this way of specifying that I have the DTD locally. But, all I see is mentions of setting up an XML catalog (black magic), or creating a custom EntityResolver (pain in my ass).

For other problems that I encounter, I find in Spring or some other Java library a standard way of overcoming them without a lot of boiler plate. For this one however, I feel like I'm writing relatively sloppy brittle code to accomplish something that every other developer must encounter.

How do you write XML applications, using well-known libraries, that don't make web requests over-and-over again to fetch files that never change?

PS: I discovered this problem because PubMed was having connectivity issues earlier today, and my unit tests (that use mocked up documents based on real queries) were failing when the XML parser couldn't retrieve the DTD.

PPS: I find it really amusing that the W3C has issues with this when they are the ones that propagated a standard that practically begs for this sort of abuse.

like image 943
nstory Avatar asked Nov 03 '22 13:11

nstory


1 Answers

The best way I can think of to load the DTD from a different source is to use the EntityResolver, it shouldn't be that much of a pain in the rear. I load local xml resources using and EntityResolver for DOM4j and put the file inside my jar so its easily accessible with the following code.

new org.xml.sax.EntityResolver() 
{
    @Override
    public InputSource resolveEntity(String publicId, String systemId)
    {
        if (systemId != null && systemId.equals("http://something.com/xml.dtd"))
            return new InputSource(getClass().getResourceAsStream("../xml/local.dtd"));;
    }
};

I think that is the "standard" way.

Another way maybe to modify the xml document via string replace the dtd reference and inject any Entity references that maybe used.

like image 176
xer21 Avatar answered Nov 12 '22 17:11

xer21