I'm trying to read a file to produce a DOM Document, but the file has whitespace and newlines and I'm trying to ignore them, but I couldn't:
DocumentBuilderFactory docfactory=DocumentBuilderFactory.newInstance();
docfactory.setIgnoringElementContentWhitespace(true);
I see in Javadoc that setIgnoringElementContentWhitespace method operates only when the validating flag is enabled, but I haven't the DTD or XML Schema for the document.
What can I do?
Update
I don't like the idea of introduce mySelf < !ELEMENT... declarations and i have tried the solution proposed in the forum pointed by Tomalak, but it doesn't work, i have used java 1.6 in an linux environment. I think if no more is proposed i will make a few methods to ignore whitespace text nodes
In XML documents, there are two types of whitespace: Significant whitespace is part of the document content and should be preserved. Insignificant whitespace is used when editing XML documents for readability. These whitespaces are typically not intended for inclusion in the delivery of the document.
XML ignores the first sequence of white space immediately after the opening tag and the last sequence of white space immediately before the closing tag. XML translates non-space characters (tab and new-line) into a space character and consolidates all multiple space characters into a single space.
White space is used in XML for readability and has no business meaning. Input XML messages can include line breaks, blanks lines, and spaces between tags (all shown in the following example). If you process XML messages that contain any of these spaces, they are represented as elements in the message tree.
‘IgnoringElementContentWhitespace’ is not about removing all pure-whitespace text nodes, only whitespace nodes whose parents are described in the schema as having ELEMENT content — that is to say, they only contain other elements and never text.
If you don't have a schema (DTD or XSD) in use, element content defaults to MIXED, so this parameter will never have any effect. (Unless the parser provides a non-standard DOM extension to treat all unknown elements as containing ELEMENT content, which as far as I know the ones available for Java do not.)
You could hack the document on the way into the parser to include the schema information, for example by adding an internal subset to the < !DOCTYPE ... [...] > declaration containing < !ELEMENT ... > declarations, then use the IgnoringElementContentWhitespace parameter.
Or, possibly easier, you could just strip out the whitespace nodes, either in a post-process, or as they come in using an LSParserFilter.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With