Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to Parse Only Text from HTML

Tags:

java

jsoup

how can i parse only text from a web page using jsoup using java?

like image 507
Jesvin Avatar asked Aug 17 '10 22:08

Jesvin


People also ask

Can HTML be parsed?

HTML parsing involves tokenization and tree construction. HTML tokens include start and end tags, as well as attribute names and values. If the document is well-formed, parsing it is straightforward and faster. The parser parses tokenized input into the document, building up the document tree.

How do I remove a string in HTML?

The HTML tags can be removed from a given string by using replaceAll() method of String class. We can remove the HTML tags from a given string by using a regular expression. After removing the HTML tags from a string, it will return a string as normal text.


1 Answers

From jsoup cookbook: http://jsoup.org/cookbook/extracting-data/attributes-text-html

String html = "<p>An <a href='http://example.com/'><b>example</b></a> link.</p>";
Document doc = Jsoup.parse(html);
String text = doc.body().text(); // "An example link"
like image 139
Ryan Berger Avatar answered Oct 09 '22 01:10

Ryan Berger