I'm parsing (a lot of) XML files that contain entity references which i dont know in advance (can't change that fact).
For example:
xml = "<tag>I'm content with &funny; &entity; &references;.</tag>"
when i try to parse this using the following code:
final DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
final DocumentBuilder db = dbf.newDocumentBuilder();
final InputSource is = new InputSource(new StringReader(xml));
final Document d = db.parse(is);
i get the following exception:
org.xml.sax.SAXParseException: The entity "funny" was referenced, but not declared.
but, what i do want to achieve is, that the parser replaces every entity that is not declared (unknown to the parser) with an empty String ''. Or even better, is there a way to pass a map to the parser like:
Map<String,String> entityMapping = ...
entityMapping.put("funny","very");
entityMapping.put("entity","important");
entityMapping.put("references","stuff");
so that i could do the following:
final DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
final DocumentBuilder db = dbf.newDocumentBuilder();
final InputSource is = new InputSource(new StringReader(xml));
db.setEntityResolver(entityMapping);
final Document d = db.parse(is);
if i would obtain the text from the document using this example code i should receive:
I'm content with very important stuff.
Any suggestions? Of course, i already would be happy to just replace the unknown entity's with empty strings.
Thanks,
The StAX API has support for this. Have a look at XMLInputFactory, it has a runtime property which dictates whether or not internal entities are expanded, or left in place. If set to false
, then the StAX event stream will contain instances of EntityReference
to represent the unexpanded entities.
If you still want a DOM as the end result, you can chain it together like this:
XMLInputFactory inputFactory = XMLInputFactory.newInstance();
inputFactory.setProperty(XMLInputFactory.IS_REPLACING_ENTITY_REFERENCES, false);
Transformer transformer = TransformerFactory.newInstance().newTransformer();
String xml = "my xml";
StringReader xmlReader = new StringReader(xml);
XMLEventReader eventReader = inputFactory.createXMLEventReader(xmlReader);
StAXSource source = new StAXSource(eventReader);
DOMResult result = new DOMResult();
transformer.transform(source, result);
Node document = result.getNode();
In this case, the resulting DOM will contain nodes of org.w3c.dom.EntityReference
mixed in with the text nodes. You can then process these as you see fit.
Since your XML input seems to be available as a String, could you not do a simple pre-processing with regular expression replacement?
xml = "...";
/* replace entities before parsing */
for (Map.Entry<String,String> entry : entityMapping.entrySet()) {
xml = xml.replaceAll("&" + entry.getKey() + ";", entry.getValue());
}
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
...
It's quite hacky, and you may want to spend some extra effort to ensure that the regexps only match where they really should (think <entity name="&don't-match-me;"/>
), but at least it's something...
Of course, there are more efficient ways to achieve the same effect than calling replaceAll()
a lot of times.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With