Let's say i have a html fragment like this:
<p> <span> foo </span> <em> bar <a> foobar </a> baz </em> </p>
What i want to extract from that is:
foo bar foobar baz
So my question is: how can i strip all the wrapping tags from a html and get only the text in the same order as it is in the html? As you can see in the title, i want to use jsoup for the parsing.
Example for accented html (note the 'á' character):
<p><strong>Tarthatatlan biztonsági viszonyok</strong></p> <p><strong>Tarthatatlan biztonsági viszonyok</strong></p>
What i want:
Tarthatatlan biztonsági viszonyok Tarthatatlan biztonsági viszonyok
This html is not static, generally i just want every text of a generic html fragment in decoded human readable form, width line breaks.
With Jsoup:
final String html = "<p> <span> foo </span> <em> bar <a> foobar </a> baz </em> </p>"; Document doc = Jsoup.parse(html); System.out.println(doc.text());
Output:
foo bar foobar baz
If you want only the text of p-tag, use this instead of doc.text()
:
doc.select("p").text();
... or only body:
doc.body().text();
final String html = "<p><strong>Tarthatatlan biztonsági viszonyok</strong></p>" + "<p><strong>Tarthatatlan biztonsági viszonyok</strong></p>"; Document doc = Jsoup.parse(html); for( Element element : doc.select("p") ) { System.out.println(element.text()); // eg. you can use a StringBuilder and append lines here ... }
Output:
Tarthatatlan biztonsági viszonyok Tarthatatlan biztonsági viszonyok
Using Regex: -
String str = "<p> <span> foo </span> <em> bar <a> foobar </a> baz </em> </p>"; str = str.replaceAll("<[^>]*>", ""); System.out.println(str);
OUTPUT: -
foo bar foobar baz
Using Jsoup: -
Document doc = Jsoup.parse(str); String text = doc.text();
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With