Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Parsing XML with Jsoup

Tags:

java

xml

jsoup

I get the following XML which represents a news article:

<content>
   Some text blalalala
   <h2>Small subtitle</h2>
   Some more text blbla
   <ul class="list">
      <li>List item 1</li>
      <li>List item 2</li>
   </ul>
   <br />
   Even more freakin text
</content>

I know the format isn't ideal but for now I have to take it.

The Article should look like:

  • Some text blalalala
  • Small subtitle
  • List with items
  • Even more freakin text

I parse this XML with Jsoup. I can get the text within the <content> tag with doc.ownText() but then I have no idea where the other stuff (subtitle) is placed, I get only one big String.

Would it be better to use an event based parser for this (I hate them :() or is there a possibility to do something like doc.getTextUntilTagAppears("tagName")?

Edit: For clarification, I know hot to get the elements under <content>, my problem is with getting the text within <content>, broken up every time when its interrupted by an element.

I learned that I can get all the text within content with .textNodes(), works great, but then again I don't know where which text node belongs in my article (one at the top before h2, the other one at the bottom).

like image 968
fweigl Avatar asked Dec 16 '22 08:12

fweigl


1 Answers

Jsoup has a fantastic selector based syntax. See here

If you want the subtitle

Document doc = Jsoup.parse("path-to-your-xml"); // get the document node

You know that subtitle is in the h2 element

Element subtitle = doc.select("h2").first();  // first h2 element that appears

And if you like to have the list:

Elements listItems = doc.select("ul.list > li");
for(Element item: listItems)
    System.out.println(item.text());  // print list's items one after another
like image 137
zEro Avatar answered Dec 30 '22 05:12

zEro