Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Java replace all non-HTML Tags in a String

I'd like to replace all the tag-looking parts in a String if those are not valid HTML tags. A tag-looking part is something enclosed in <> brackets. Eg. <[email protected]> or <hello> but <br>, <div>, and so on has to be kept.

Do you have any idea how to achieve this?

Any help is appreciated!

cheers,

balázs

like image 494
Balázs Németh Avatar asked Jan 14 '11 13:01

Balázs Németh


People also ask

How do you replace HTML tag from string in Java?

The HTML tags can be removed from a given string by using replaceAll() method of String class. We can remove the HTML tags from a given string by using a regular expression. After removing the HTML tags from a string, it will return a string as normal text.

What does Jsoup parse do?

jsoup can parse HTML files, input streams, URLs, or even strings. It eases data extraction from HTML by offering Document Object Model (DOM) traversal methods and CSS and jQuery-like selectors. jsoup can manipulate the content: the HTML element itself, its attributes, or its text.

Which function is used to remove all HTML tags from a string passed to a form?

The strip_tags() function strips a string from HTML, XML, and PHP tags. Note: HTML comments are always stripped.


1 Answers

You can use JSoup to clean HTML.

String cleaned = Jsoup.clean(html, Whitelist.relaxed());

You can either use one of the defined Whitelists or you can create your own custom one in which you specify which HTML elements you wish to allow through the cleaner. Everything else is removed.


Your specific example would be:

String html = "one two three <blabla> four <text> five <div class=\"bold\">six</div>";
String cleaned = Jsoup.clean(html, Whitelist.relaxed().addAttributes("div", "class"));
System.out.println(cleaned);

Output:

one two three  four  five 
<div class="bold">
 six
</div>
like image 109
dogbane Avatar answered Oct 06 '22 09:10

dogbane