Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

jsoup - strip all formatting and link tags, keep text only

Tags:

java

html

jsoup

Let's say i have a html fragment like this:

<p> <span> foo </span> <em> bar <a> foobar </a> baz </em> </p> 

What i want to extract from that is:

foo bar foobar baz 

So my question is: how can i strip all the wrapping tags from a html and get only the text in the same order as it is in the html? As you can see in the title, i want to use jsoup for the parsing.

Example for accented html (note the 'á' character):

<p><strong>Tarthatatlan biztonsági viszonyok</strong></p> <p><strong>Tarthatatlan biztonsági viszonyok</strong></p> 

What i want:

Tarthatatlan biztonsági viszonyok Tarthatatlan biztonsági viszonyok 

This html is not static, generally i just want every text of a generic html fragment in decoded human readable form, width line breaks.

like image 661
WonderCsabo Avatar asked Oct 17 '12 21:10

WonderCsabo


2 Answers

With Jsoup:

final String html = "<p> <span> foo </span> <em> bar <a> foobar </a> baz </em> </p>"; Document doc = Jsoup.parse(html);  System.out.println(doc.text()); 

Output:

foo bar foobar baz 

If you want only the text of p-tag, use this instead of doc.text():

doc.select("p").text(); 

... or only body:

doc.body().text(); 

Linebreak:

final String html = "<p><strong>Tarthatatlan biztonsági viszonyok</strong></p>"         + "<p><strong>Tarthatatlan biztonsági viszonyok</strong></p>"; Document doc = Jsoup.parse(html);  for( Element element : doc.select("p") ) {     System.out.println(element.text());     // eg. you can use a StringBuilder and append lines here ... } 

Output:

Tarthatatlan biztonsági viszonyok   Tarthatatlan biztonsági viszonyok 
like image 87
ollo Avatar answered Sep 20 '22 12:09

ollo


Using Regex: -

String str = "<p> <span> foo </span> <em> bar <a> foobar </a> baz </em> </p>"; str = str.replaceAll("<[^>]*>", ""); System.out.println(str); 

OUTPUT: -

  foo   bar  foobar  baz  

Using Jsoup: -

Document doc = Jsoup.parse(str);  String text = doc.text(); 
like image 42
Rohit Jain Avatar answered Sep 18 '22 12:09

Rohit Jain