I am looking for a regular expression to removing all HTML tags from a string in JSP.
Example 1
sampleString = "test string <i>in italics</i> continues";
Example 2
sampleString = "test string <i>in italics";
Example 3
sampleString = "test string <i";
The HTML tag might be complete, partial (without closing tag) or without proper starting tag (missing closing angle bracket in 3rd example) itself.
Thanks in advance
Case 3 is not possible with regex or a parser. It might represent legitimate content. So forget it.
As to the concrete question which covers cases 1 and 2, just use a HTML parser. My favourite is Jsoup.
String text = Jsoup.parse(html).text();
That's it. It has by the way also a HTML cleaner, if that is what you're actually after.
Since you're using JSP, you could also just use JSTL <c:out> or fn:escapeXml() to avoid that user-controlled HTML input get inlined among your HTML (which may thus open XSS holes).
<c:out value="${bean.property}" />
<input type="text" name="foo" value="${fn:escapeXml(param.foo)}" />
HTML tags will then not be interpreted, but just displayed as plain text.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With