Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to extract text ONLY from html file using jsoup

Tags:

html

jsoup

I have used this code :

String innerHtml = Jsoup.parse(htmlCode,"ISO-8859-1").select("body").html();

But it only removes <html> tags

Any HTML tags inside the body will still appear

like image 617
Adham Avatar asked Mar 15 '13 16:03

Adham


2 Answers

Use .text() instead of .html() to get the combined text of the element and all of its children.

like image 169
Matt Cain Avatar answered Oct 13 '22 00:10

Matt Cain


Try using .text():

Jsoup.parse(htmlCode,"ISO-8859-1").select("body").text();

Instead of .html().

like image 40
James Donnelly Avatar answered Oct 13 '22 01:10

James Donnelly