Parsing XML with Jsoup

Question

I get the following XML which represents a news article:

<content>
   Some text blalalala
   <h2>Small subtitle</h2>
   Some more text blbla
   <ul class="list">
      <li>List item 1</li>
      <li>List item 2</li>
   </ul>
   <br />
   Even more freakin text
</content>

I know the format isn't ideal but for now I have to take it.

The Article should look like:

Some text blalalala
Small subtitle
List with items
Even more freakin text

I parse this XML with Jsoup. I can get the text within the <content> tag with doc.ownText() but then I have no idea where the other stuff (subtitle) is placed, I get only one big String.

Would it be better to use an event based parser for this (I hate them :() or is there a possibility to do something like doc.getTextUntilTagAppears("tagName")?

Edit: For clarification, I know hot to get the elements under <content>, my problem is with getting the text within <content>, broken up every time when its interrupted by an element.

I learned that I can get all the text within content with .textNodes(), works great, but then again I don't know where which text node belongs in my article (one at the top before h2, the other one at the bottom).

zEro · Accepted Answer

Jsoup has a fantastic selector based syntax. See here

If you want the subtitle

Document doc = Jsoup.parse("path-to-your-xml"); // get the document node

You know that subtitle is in the h2 element

Element subtitle = doc.select("h2").first();  // first h2 element that appears

And if you like to have the list:

Elements listItems = doc.select("ul.list > li");
for(Element item: listItems)
    System.out.println(item.text());  // print list's items one after another

Parsing XML with Jsoup

Tags:

java

xml

jsoup

fweigl

1 Answers

zEro

Recent Activity

Donate For Us

Parsing XML with Jsoup

Tags:

java

xml

jsoup

fweigl

1 Answers

zEro

Related questions

Recent Activity

Donate For Us