Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to ignore whitespace while reading a file to produce an XML DOM

I'm trying to read a file to produce a DOM Document, but the file has whitespace and newlines and I'm trying to ignore them, but I couldn't:

DocumentBuilderFactory docfactory=DocumentBuilderFactory.newInstance();
docfactory.setIgnoringElementContentWhitespace(true);

I see in Javadoc that setIgnoringElementContentWhitespace method operates only when the validating flag is enabled, but I haven't the DTD or XML Schema for the document.

What can I do?

Update

I don't like the idea of introduce mySelf < !ELEMENT... declarations and i have tried the solution proposed in the forum pointed by Tomalak, but it doesn't work, i have used java 1.6 in an linux environment. I think if no more is proposed i will make a few methods to ignore whitespace text nodes

like image 675
Telcontar Avatar asked Oct 23 '08 10:10

Telcontar


People also ask

Does XML ignore white space?

In XML documents, there are two types of whitespace: Significant whitespace is part of the document content and should be preserved. Insignificant whitespace is used when editing XML documents for readability. These whitespaces are typically not intended for inclusion in the delivery of the document.

How do you handle a space in XML?

XML ignores the first sequence of white space immediately after the opening tag and the last sequence of white space immediately before the closing tag. XML translates non-space characters (tab and new-line) into a space character and consolidates all multiple space characters into a single space.

What do you mean by whitespace in XML?

White space is used in XML for readability and has no business meaning. Input XML messages can include line breaks, blanks lines, and spaces between tags (all shown in the following example). If you process XML messages that contain any of these spaces, they are represented as elements in the message tree.


1 Answers

‘IgnoringElementContentWhitespace’ is not about removing all pure-whitespace text nodes, only whitespace nodes whose parents are described in the schema as having ELEMENT content — that is to say, they only contain other elements and never text.

If you don't have a schema (DTD or XSD) in use, element content defaults to MIXED, so this parameter will never have any effect. (Unless the parser provides a non-standard DOM extension to treat all unknown elements as containing ELEMENT content, which as far as I know the ones available for Java do not.)

You could hack the document on the way into the parser to include the schema information, for example by adding an internal subset to the < !DOCTYPE ... [...] > declaration containing < !ELEMENT ... > declarations, then use the IgnoringElementContentWhitespace parameter.

Or, possibly easier, you could just strip out the whitespace nodes, either in a post-process, or as they come in using an LSParserFilter.

like image 166
bobince Avatar answered Oct 06 '22 00:10

bobince