Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

remove empty tag pairs from HTML fragment

I have a user-submitted string that contains HTML content such as

"<p></p><div></div><p>Hello<br/>world</p><p></p>"

I would like to transform this string such that empty tag pairs are removed (but empty tags like <br/> are retained). For example, the result of this transformation should convert the string above to

"<p>Hello<br/>world</p>"

I'd like to use JSoup to do this, as I already have this on my classpath, and it would be easiest for me to perform this transformation on the server-side.

like image 438
Dónal Avatar asked Jan 03 '12 10:01

Dónal


2 Answers

Here is an example that does just that (using JSoup):

String html = "<p></p><div></div><p>Hello<br/>world</p><p></p>";
Document doc = Jsoup.parse(html);

for (Element element : doc.select("*")) {
    if (!element.hasText() && element.isBlock()) {
        element.remove();
    }
}

System.out.println(doc.body().html())

The output of the code above is what you are looking for:

<p>Hello<br />world</p>
like image 65
PrimosK Avatar answered Sep 17 '22 17:09

PrimosK


Not really familiar with jsoup, but you could do this with a simple regex replace:

String html = "<p></p><div></div><p>Hello<br/>world</p><p></p>";
html = html.replaceAll("<([^>]*)></\\1>", "");

Although with a full parser you could probably just drop empty content during processing, depending on what you're eventually going to do with it.

like image 42
Tom Elliott Avatar answered Sep 16 '22 17:09

Tom Elliott