I have this HTML input:
<font size="5"><p>some text</p>
<p> another text</p></font>
I'd like to use regex to remove the HTML tags so that the output is:
some text
another text
Can anyone suggest how to do this with regex?
Since you asked, here's a quick and dirty solution:
String stripped = input.replaceAll("<[^>]*>", "");
(Ideone.com demo)
Using regexps to deal with HTML is a pretty bad idea though. The above hack won't deal with stuff like
<tag attribute=">">Hello</tag>
<script>if (a < b) alert('Hello>');</script>
etc.
A better approach would be to use for instance Jsoup. To remove all tags from a string, you can for instance do Jsoup.parse(html).text()
.
Use a HTML parser. Here's a Jsoup example.
String input = "<font size=\"5\"><p>some text</p>\n<p>another text</p></font>";
String stripped = Jsoup.parse(input).text();
System.out.println(stripped);
Result:
some text another text
Or if you want to preserve newlines:
String input = "<font size=\"5\"><p>some text</p>\n<p>another text</p></font>";
for (String line : input.split("\n")) {
String stripped = Jsoup.parse(line).text();
System.out.println(stripped);
}
Result:
some text another text
Jsoup offers more advantages as well. You could easily extract specific parts of the HTML document using the select()
method which accepts jQuery-like CSS selectors. It only requires the document to be semantically well-formed. The presence of the since 1998 deprecated <font>
tag is already not a very good indication, but if you know the HTML structure in depth detail beforehand, it'll still be doable.
You can go with HTML parser called Jericho Html parser.
you can download it from here - http://jericho.htmlparser.net/docs/index.html
Jericho HTML Parser is a java library allowing analysis and manipulation of parts of an HTML document, including server-side tags, while reproducing verbatim any unrecognized or invalid HTML. It also provides high-level HTML form manipulation functions.
The presence of badly formatted HTML does not interfere with the parsing
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With