I am trying to parse a html string using jsoup:
<div class="test">
<br>From: <b class="sendername">Divya</b>
<span dir="ltr"><<a href="mailto:[email protected]" target="_blank">[email protected]</a>></span>
<br>Date: Wed, May 27, 2015 at 11:10 AM
<br>Subject: Plan for the day 27/05/2015
<br>To: Abhishek<<a href="mailto:[email protected]" target="_blank">abhishek.sharma@abc.<wbr>com</a>>,
<a href="mailto:[email protected]" target="_blank">[email protected]</a>>
<br>Cc: Ram <<a href="mailto:[email protected]" target="_blank">[email protected]</a>>
<br>
<br>
<br>
<div dir="ltr">Hi,</div>
</div>
Document doc = Jsoup.parse( mailBody.getBodyHtml().get( 0 ) );
Elements elem = doc.getElementsByClass( "test" );
int totalElements = 0;
Elements childElements = elem.get( 0 ).;
int brCount = 0;
for( Element childElement : childElements )
{
totalElements++;
if( childElement.tagName().equalsIgnoreCase( "br" ) )
{
brCount++;
if( brCount == 3 )
break;
}
else
brCount = 0;
}
for( int i = 1; i <= totalElements; i++ )
{
childElements.get( i ).remove();
}
I want to get rid of all content before three consecutive br tags and there should be no text node between them.
i.e. In above case, It will remove all tags(html tags and textnode) and output will be as follows:
<div class="test">
<div dir="ltr">Hi,</div>
</div>
The structure of the html seems to be constant. So you can try the following CSS selector:
div.test br + br + br + div
http://try.jsoup.org/~DiBi9Q_Ye88gi6Hq29Z44ar6xus
String html = "<div class=\"test\">\n <br>From: <b class=\"sendername\">Divya</b> \n <span dir=\"ltr\"><<a href=\"mailto:[email protected]\" target=\"_blank\">[email protected]</a>></span>\n <br>Date: Wed, May 27, 2015 at 11:10 AM\n <br>Subject: Plan for the day 27/05/2015\n <br>To: Abhishek<<a href=\"mailto:[email protected]\" target=\"_blank\">abhishek.sharma@abc.<wbr>com</a>>, \n <a href=\"mailto:[email protected]\" target=\"_blank\">[email protected]</a>>\n <br>Cc: Ram <<a href=\"mailto:[email protected]\" target=\"_blank\">[email protected]</a>>\n <br>\n <br>\n <br>\n <div dir=\"ltr\">Hi,</div>\n </div>";
Document doc = Jsoup.parse(html);
Element mailBody = doc.select("div.test br + br + br + div").first();
if (mailBody == null) {
throw new RuntimeException("Unable to locate mail body.");
}
System.out.println("** BEFORE:\n" + doc);
Document tmp = Jsoup.parseBodyFragment("<div class='test'>" + mailBody.outerHtml() + "</div>");
mailBody.parent().replaceWith(tmp.select("div.test").first());
System.out.println("\n** AFTER:\n" + doc);
** BEFORE:
<html>
<head></head>
<body>
<div class="test">
<br>From:
<b class="sendername">Divya</b>
<span dir="ltr"><<a href="mailto:[email protected]" target="_blank">[email protected]</a>></span>
<br>Date: Wed, May 27, 2015 at 11:10 AM
<br>Subject: Plan for the day 27/05/2015
<br>To: Abhishek<
<a href="mailto:[email protected]" target="_blank">abhishek.sharma@abc.<wbr>com</a>>,
<a href="mailto:[email protected]" target="_blank">[email protected]</a>>
<br>Cc: Ram <
<a href="mailto:[email protected]" target="_blank">[email protected]</a>>
<br>
<br>
<br>
<div dir="ltr">
Hi,
</div>
</div>
</body>
</html>
** AFTER:
<html>
<head></head>
<body>
<div class="test">
<div dir="ltr">
Hi,
</div>
</div>
</body>
</html>
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With