Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Disable XML Entity resolving in JDOM / DOM

I am writing a Java application for the postprocessing of XML files. These xml files come from an RDF-Export of a Semantic Mediawiki, so they have rdf/xml syntax.

My problem is the following: When I read the xml file, all the entities in the file get resolved to their value which is specified in the Doctype. For example in the Doctype I have

<!DOCTYPE rdf:RDF[
<!ENTITY wiki 'http://example.org/smartgrid/index.php/Special:URIResolver/'>
..
]>

and in the root element

<rdf:RDF
xmlns:wiki="&wiki;"
..
>

This means

<swivt:Subject rdf:about="&wiki;Main_Page">

becomes

<swivt:Subject rdf:about="http://example.org/smartgrid/index.php/Special:URIResolver/Main_Page">

I have tried using JDOM and the standard Java DOM. The code I think is relevant here is for standard DOM:

DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
        factory.setExpandEntityReferences(false);
        factory.setFeature("http://apache.org/xml/features/nonvalidating/load-external-dtd", false);

and for JDOM

SAXBuilder builder = new SAXBuilder();
    builder.setExpandEntities(false); //Retain Entities
    builder.setValidation(false);
    builder.setFeature("http://xml.org/sax/features/resolve-dtd-uris", false);

But the Entities are resolved throughout the whole xml document none the less. Am I missing something? Hours of search has only led me to the 'ExpandEntities' commands, but they don't seem to work.

Any hint is highly appreciated :)

like image 367
StrongBad Avatar asked Jul 28 '11 15:07

StrongBad


1 Answers

I recommend the JDOM FAQ:

http://www.jdom.org/docs/faq.html#a0350

How do I keep the DTD from loading? Even when I turn off validation the parser tries to load the DTD file.

Even when validation is turned off, an XML parser will by default load the external DTD file in order to parse the DTD for external entity declarations. Xerces has a feature to turn off this behavior named http://apache.org/xml/features/nonvalidating/load-external-dtd and if you know you're using Xerces you can set this feature on the builder.

builder.setFeature(
  "http://apache.org/xml/features/nonvalidating/load-external-dtd", false);

If you're using another parser like Crimson, your best bet is to set up an EntityResolver that resolves the DTD without actually reading the separate file.

import org.xml.sax.*;
import java.io.*;

public class NoOpEntityResolver implements EntityResolver {
  public InputSource resolveEntity(String publicId, String systemId) {
    return new InputSource(new StringBufferInputStream(""));
  }
}

Then in the builder...

builder.setEntityResolver(new NoOpEntityResolver());

There is a downside to this approach. Any entities in the document will be resolved to the empty string, and will effectively disappear. If your document has entities, you need to setExpandEntities(false) code and ensure the EntityResolver only suppresses the DocType.

like image 126
Job Avatar answered Sep 24 '22 20:09

Job