I'd like to replace all the tag-looking parts in a String if those are not valid HTML tags.
A tag-looking part is something enclosed in <>
brackets. Eg. <[email protected]>
or <hello>
but <br>
, <div>
, and so on has to be kept.
Do you have any idea how to achieve this?
Any help is appreciated!
cheers,
balázs
The HTML tags can be removed from a given string by using replaceAll() method of String class. We can remove the HTML tags from a given string by using a regular expression. After removing the HTML tags from a string, it will return a string as normal text.
jsoup can parse HTML files, input streams, URLs, or even strings. It eases data extraction from HTML by offering Document Object Model (DOM) traversal methods and CSS and jQuery-like selectors. jsoup can manipulate the content: the HTML element itself, its attributes, or its text.
The strip_tags() function strips a string from HTML, XML, and PHP tags. Note: HTML comments are always stripped.
You can use JSoup to clean HTML.
String cleaned = Jsoup.clean(html, Whitelist.relaxed());
You can either use one of the defined Whitelists or you can create your own custom one in which you specify which HTML elements you wish to allow through the cleaner. Everything else is removed.
Your specific example would be:
String html = "one two three <blabla> four <text> five <div class=\"bold\">six</div>";
String cleaned = Jsoup.clean(html, Whitelist.relaxed().addAttributes("div", "class"));
System.out.println(cleaned);
Output:
one two three four five
<div class="bold">
six
</div>
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With