I would like to remove those tags with their content from source HTML.
When searching you basically use Elements.select(selector)
where selector
is defined by this API. However comments are not elements technically, so you may be confused here, still they are nodes identified by the node name #comment
.
Let's see how that might work:
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Node;
public class RemoveComments {
public static void main(String... args) {
String h = "<html><head></head><body>" +
"<div><!-- foo --><p>bar<!-- baz --></div><!--qux--></body></html>";
Document doc = Jsoup.parse(h);
removeComments(doc);
doc.html(System.out);
}
private static void removeComments(Node node) {
for (int i = 0; i < node.childNodeSize();) {
Node child = node.childNode(i);
if (child.nodeName().equals("#comment"))
child.remove();
else {
removeComments(child);
i++;
}
}
}
}
With JSoup 1.11+ (possibly older version) you can apply a filter:
private void removeComments(Element article) {
article.filter(new NodeFilter() {
@Override
public FilterResult tail(Node node, int depth) {
if (node instanceof Comment) {
return FilterResult.REMOVE;
}
return FilterResult.CONTINUE;
}
@Override
public FilterResult head(Node node, int depth) {
if (node instanceof Comment) {
return FilterResult.REMOVE;
}
return FilterResult.CONTINUE;
}
});
}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With