Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Regex to strip HTML tags

Tags:

java

html

regex

I have this HTML input:

<font size="5"><p>some text</p>
<p> another text</p></font>

I'd like to use regex to remove the HTML tags so that the output is:

some text
another text

Can anyone suggest how to do this with regex?

like image 471
ADIT Avatar asked Nov 02 '10 07:11

ADIT


3 Answers

Since you asked, here's a quick and dirty solution:

String stripped = input.replaceAll("<[^>]*>", "");

(Ideone.com demo)

Using regexps to deal with HTML is a pretty bad idea though. The above hack won't deal with stuff like

  • <tag attribute=">">Hello</tag>
  • <script>if (a < b) alert('Hello>');</script>

etc.

A better approach would be to use for instance Jsoup. To remove all tags from a string, you can for instance do Jsoup.parse(html).text().

like image 190
aioobe Avatar answered Sep 19 '22 16:09

aioobe


Use a HTML parser. Here's a Jsoup example.

String input = "<font size=\"5\"><p>some text</p>\n<p>another text</p></font>";
String stripped = Jsoup.parse(input).text();
System.out.println(stripped);

Result:

some text another text

Or if you want to preserve newlines:

String input = "<font size=\"5\"><p>some text</p>\n<p>another text</p></font>";
for (String line : input.split("\n")) {
    String stripped = Jsoup.parse(line).text();
    System.out.println(stripped);
}

Result:

some text
another text

Jsoup offers more advantages as well. You could easily extract specific parts of the HTML document using the select() method which accepts jQuery-like CSS selectors. It only requires the document to be semantically well-formed. The presence of the since 1998 deprecated <font> tag is already not a very good indication, but if you know the HTML structure in depth detail beforehand, it'll still be doable.

See also:

  • Pros and cons of leading HTML parsers in Java
like image 39
BalusC Avatar answered Sep 20 '22 16:09

BalusC


You can go with HTML parser called Jericho Html parser.

you can download it from here - http://jericho.htmlparser.net/docs/index.html

Jericho HTML Parser is a java library allowing analysis and manipulation of parts of an HTML document, including server-side tags, while reproducing verbatim any unrecognized or invalid HTML. It also provides high-level HTML form manipulation functions.

The presence of badly formatted HTML does not interfere with the parsing

like image 23
Prabhakaran Avatar answered Sep 19 '22 16:09

Prabhakaran