Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Remove HTML tags from a String

Is there a good way to remove HTML from a Java string? A simple regex like

replaceAll("\\<.*?>", "")  

will work, but things like &amp; wont be converted correctly and non-HTML between the two angle brackets will be removed (i.e. the .*? in the regex will disappear).

like image 472
Mason Avatar asked Oct 27 '08 16:10

Mason


People also ask

Is it possible to remove the HTML tags from data?

Strip_tags() is a function that allows you to strip out all HTML and PHP tags from a given string (parameter one), however you can also use parameter two to specify a list of HTML tags you want.

How remove HTML tag from string in react?

To remove html tags from string in react js, just use the /(<([^>]+)>)/ig regex with replace() method it will remove tags with their attribute and return new string.

Which function is used to remove all HTML tags from string?

The strip_tags() function strips a string from HTML, XML, and PHP tags. Note: HTML comments are always stripped. This cannot be changed with the allow parameter.


1 Answers

Use a HTML parser instead of regex. This is dead simple with Jsoup.

public static String html2text(String html) {     return Jsoup.parse(html).text(); } 

Jsoup also supports removing HTML tags against a customizable whitelist, which is very useful if you want to allow only e.g. <b>, <i> and <u>.

See also:

  • RegEx match open tags except XHTML self-contained tags
  • What are the pros and cons of the leading Java HTML parsers?
  • XSS prevention in JSP/Servlet web application
like image 120
BalusC Avatar answered Oct 11 '22 07:10

BalusC