Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to parse the full content of a XML Tag in java

I have some kind of complex XML data structure. The structure contains different fragments like in the following example:

<data>
  <content-part-1>
   <h1>Hello <strong>World</strong>. This is some text.</h1>
   <h2>.....</h2>
  </content-part1>
  ....
</data>

The h1 tag within the tag 'content-part-1' is of interest. I want to get the full content of the xml tag 'h1'.

In java I used the javax.xml.parsers.DocumentBuilder and tried something like this:

String my_content="<h1>Hello <strong>World</strong>. This is some text.</h1>"; 
// parse h1 tag..
DocumentBuilder documentBuilder = DocumentBuilderFactory.newInstance().newDocumentBuilder();
Document doc = documentBuilder.parse(new InputSource(new StringReader(my_content)));
Node node = doc.importNode(doc.getDocumentElement(), true);
if (node != null && node.getNodeName().equals("h1")) {
    return node.getTextContent();
}

But the method 'getTextContent()' will return:

Hello World. This is some text.

The tag "strong" is removed by the xml parser (as it is the documented behavior).

My question is how I can extract the full content of a single XML Node within a org.w3c.dom.Document without any further parsing the node content?

like image 462
Ralph Avatar asked Nov 26 '25 04:11

Ralph


1 Answers

Although java DOM parser provides functionality for parsing mixed content, in this particular case it could be more convenient to use Jsoup library. When using it code to extract h1 element content would be as follows:

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

String text = "<data>\n"
+ "  <content-part1>\n"
+ "   <h1>Hello <strong>World</strong>. This is some text.</h1>\n"
+ "   <h2></h2>\n"
+ "  </content-part1>\n"
+ "</data>";

Document doc = Jsoup.parse(text);

Elements h1Elements = doc.select("h1");

for (Element h1 : h1Elements) {
    System.out.println(h1.html());
}

Output in this case will be "Hello <strong>World</strong>. This is some text."

like image 190
Daniil Avatar answered Nov 27 '25 18:11

Daniil