Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Parse html with jsoup and remove the tag block

Tags:

java

jsoup

I want to remove everything between a tag. An example input may be

Input:

<body>
  start
  <div>
    delete from below
    <div class="XYZ">
      first div having this class
      <div>
        waste
      </div>
      <div class="XYZ">
        second div having this class
      </div>
      waste
    </div>
    delete till above
  </div>
  <div>
    this will also remain
  </div>
  end
</body>

The output will be:

<body>
  start
  <div>
    delete from below
    delete till above
  </div>
  <div>
    this will also remain
  </div>
  end
</body>

Basically, I have to remove the entire block for the first occurrence of <div class="XYZ">

Thanks,

like image 610
user2200660 Avatar asked Apr 03 '13 19:04

user2200660


3 Answers

You better iterate over all elements found. so you can be shure that

  • a.) all elements are removed and
  • b.) there's nothing done if there's no element.

Example:

Document doc = ...

for( Element element : doc.select("div.XYZ") )
{
    element.remove();
}

Edit:

( An addition to my comment )

Don't use exception handling when a simple null- / range check is enough here:

doc.select("div.XYZ").first().remove();

instead:

Elements divs = doc.select("div.XYZ");

if( !divs.isEmpty() )
{
    /*
     * Here it's safe to call 'first()' since there at least one element.
     */
}
like image 167
ollo Avatar answered Sep 20 '22 21:09

ollo


Try this code :

String data = null;
    BufferedReader br = new BufferedReader(new FileReader("e://XMLFile.xml"));
    StringBuilder builder = new StringBuilder();
    while ((data = br.readLine()) != null) {
        builder.append(data);
    }
    System.out.println(builder);
    String replaceAll = builder.toString().replaceAll("<div class=\"XYZ\".+?</div>", "");
    System.out.println(replaceAll);

I have read the input XML from a file and stored it in a StringBuilder object by reading it line by line, and then replaced the entire tag will empty string.

like image 44
Ankur Shanbhag Avatar answered Sep 19 '22 21:09

Ankur Shanbhag


This may help you.

 String selectTags="div,li,p,ul,ol,span,table,tr,td,address,em";
 /*selecting some specific tags */
 Elements webContentElements = parsedDoc.select(selectTags); 
 String removeTags = "img,a,form"; 
 /*Removing some tags from selected elements*/
 webContentElements.select(removeTags).remove();
like image 24
Stephen Avatar answered Sep 21 '22 21:09

Stephen