I get the following XML which represents a news article:
<content>
Some text blalalala
<h2>Small subtitle</h2>
Some more text blbla
<ul class="list">
<li>List item 1</li>
<li>List item 2</li>
</ul>
<br />
Even more freakin text
</content>
I know the format isn't ideal but for now I have to take it.
The Article should look like:
I parse this XML with Jsoup. I can get the text within the <content>
tag with doc.ownText()
but then I have no idea where the other stuff (subtitle) is placed, I get only one big String
.
Would it be better to use an event based parser for this (I hate them :() or is there a possibility to do something like doc.getTextUntilTagAppears("tagName")
?
Edit: For clarification, I know hot to get the elements under <content>
, my problem is with getting the text within <content>
, broken up every time when its interrupted by an element.
I learned that I can get all the text within content with .textNodes()
, works great, but then again I don't know where which text node belongs in my article (one at the top before h2, the other one at the bottom).
Jsoup has a fantastic selector based syntax. See here
If you want the subtitle
Document doc = Jsoup.parse("path-to-your-xml"); // get the document node
You know that subtitle is in the h2
element
Element subtitle = doc.select("h2").first(); // first h2 element that appears
And if you like to have the list:
Elements listItems = doc.select("ul.list > li");
for(Element item: listItems)
System.out.println(item.text()); // print list's items one after another
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With