Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to remove HTML tag in Java [duplicate]

Tags:

java

html

regex

Is there the regular expression that can completely remove a HTML tag? By the way, I'm using Java.

like image 792
freddiefujiwara Avatar asked Nov 09 '09 06:11

freddiefujiwara


3 Answers

There is JSoup which is a java library made for HTML manipulation. Look at the clean() method and the WhiteList object. Easy to use solution!

like image 167
Simon Avatar answered Nov 12 '22 18:11

Simon


You should use a HTML parser instead. I like htmlCleaner, because it gives me a pretty printed version of the HTML.

With htmlCleaner you can do:

TagNode root = htmlCleaner.clean( stream );
Object[] found = root.evaluateXPath( "//div[id='something']" );
if( found.length > 0 && found instanceof TagNode ) {
    ((TagNode)found[0]).removeFromTree();
}
like image 32
tangens Avatar answered Nov 12 '22 16:11

tangens


If you just need to remove tags then you can use this regular expression:

content = content.replaceAll("<[^>]+>", "");

It will remove only tags, but not other HTML stuff. For more complex things you should use parser.

EDIT: To avoid problems with HTML comments you can do the following:

content = content.replaceAll("<!--.*?-->", "").replaceAll("<[^>]+>", "");
like image 6
Andrey Adamovich Avatar answered Nov 12 '22 16:11

Andrey Adamovich