Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Removing text enclosed between HTML tags using JSoup

In some cases of HTML cleaning, I would like to retain the text enclosed between the tags(which is the default behaviour of Jsoup) and in some cases, I would like to remove the text as well as the HTML tags. Can someone please throw some light on how I can remove the text enclosed between the HTML tags using Jsoup?

like image 988
Raghu Avatar asked Jul 18 '11 20:07

Raghu


People also ask

How do I remove text tags in HTML?

The HTML tags can be removed from a given string by using replaceAll() method of String class. We can remove the HTML tags from a given string by using a regular expression. After removing the HTML tags from a string, it will return a string as normal text.

What is jsoup element?

A HTML element consists of a tag name, attributes, and child nodes (including text nodes and other elements). From an Element, you can extract data, traverse the node graph, and manipulate the HTML.

What is jsoup parse?

jsoup can parse HTML files, input streams, URLs, or even strings. It eases data extraction from HTML by offering Document Object Model (DOM) traversal methods and CSS and jQuery-like selectors. jsoup can manipulate the content: the HTML element itself, its attributes, or its text.


1 Answers

The Cleaner will always drop tags and preserve text. If you need to drop elements (i.e. tags and text / nested elements), you can pre-parse the HTML, remove the elements using either remove() or empty(), then run the resulting through the cleaner.

For example:

String html = "Clean <div>Text dropped</div>";
Document doc = Jsoup.parse(html);
doc.select("div").remove();

// if not removed, the cleaner will drop the <div> but leave the inner text
String clean = Jsoup.clean(doc.body().html(), Whitelist.basic());

If you are using JSoup 1.14.1+ then use Safelist instead of Whitelist, as Whitelist has been deprecated and will be removed in 1.15.1.

String clean = Jsoup.clean(doc.body().html(), Safelist.basic());
like image 148
Jonathan Hedley Avatar answered Oct 26 '22 14:10

Jonathan Hedley