Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Removing duplicated newlines/tabs/whitespaces in XML character element

<node> test
    test
    test
</node>

I want my XML parser read characters in <node> and:

  1. replace newlines and tabs to spaces and compose multiple spaces into one. At result, the text should look similar to "test test test".
  2. If the node contains XML encoded characters: tabs (&#x9;), newlines (&#xA;) or whitespaces (&#20;) - they should be left.

I'm trying a code below, but it preserve duplicated whitespaces.

  dbf = DocumentBuilderFactory.newInstance();
  dbf.setIgnoringComments( true );
  dbf.setNamespaceAware( namespaceAware );
  db = dbf.newDocumentBuilder();
  doc = db.parse( inputStream );

Is the any way to do what I want?

Thanks!

like image 931
Dzmitry Bahdanovich Avatar asked Apr 18 '14 15:04

Dzmitry Bahdanovich


1 Answers

The first part - replacing multiple white-space - is relatively easy though I don't think the parser will do it for you:

InputSource stream = new InputSource(inputStream);
XPath xpath = XPathFactory.newInstance().newXPath();
Document doc = (Document) xpath.evaluate("/", stream, XPathConstants.NODE);

NodeList nodes = (NodeList) xpath.evaluate("//text()", doc,
    XPathConstants.NODESET);
for (int i = 0; i < nodes.getLength(); i++) {
  Text text = (Text) nodes.item(i);
  text.setTextContent(text.getTextContent().replaceAll("\\s{2,}", " "));
}

// check results
TransformerFactory.newInstance()
    .newTransformer()
    .transform(new DOMSource(doc), new StreamResult(System.out));

This is the hard part:

If the node contains XML encoded characters: tabs (&#x9;), newlines (&#xA;) or whitespaces (&#20;) - they should be left.

The parser will always turn "&#x9;" into "\t" - you may need to write your own XML parser.

According to the author of Saxon:

I don't think any XML parser will report numeric character references to the application - they will always be expanded. Really, your application shouldn't care about this any more than it cares about how much whitespace there is between attributes.

like image 120
McDowell Avatar answered Oct 04 '22 04:10

McDowell