Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Regular expression for removing HTML tags from a string

Tags:

java

html

regex

jsp

I am looking for a regular expression to removing all HTML tags from a string in JSP.

Example 1

sampleString = "test string <i>in italics</i> continues";

Example 2

sampleString = "test string <i>in italics";

Example 3

sampleString = "test string <i";

The HTML tag might be complete, partial (without closing tag) or without proper starting tag (missing closing angle bracket in 3rd example) itself.

Thanks in advance

like image 298
rahul Avatar asked Jun 18 '26 10:06

rahul


1 Answers

Case 3 is not possible with regex or a parser. It might represent legitimate content. So forget it.

As to the concrete question which covers cases 1 and 2, just use a HTML parser. My favourite is Jsoup.

String text = Jsoup.parse(html).text();

That's it. It has by the way also a HTML cleaner, if that is what you're actually after.

Since you're using JSP, you could also just use JSTL <c:out> or fn:escapeXml() to avoid that user-controlled HTML input get inlined among your HTML (which may thus open XSS holes).

<c:out value="${bean.property}" />
<input type="text" name="foo" value="${fn:escapeXml(param.foo)}" />

HTML tags will then not be interpreted, but just displayed as plain text.

like image 167
BalusC Avatar answered Jun 19 '26 22:06

BalusC



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!