Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Jsoup: How to get all html between 2 header tags

Tags:

java

jsoup

I am trying to get all html between 2 h1 tags. Actual task is to break the html into frames(chapters) based of the h1(heading 1) tags.

Appreciate any help.

Thanks Sunil

like image 744
Sunil Ramu Avatar asked Oct 29 '25 07:10

Sunil Ramu


2 Answers

If you want to get and process all elements between two consecutive h1 tags you can work on siblings. Here's some example code:

public static void h1s() {
  String html = "<html>" +
  "<head></head>" +
  "<body>" +
  "  <h1>title 1</h1>" +
  "  <p>hello 1</p>" +
  "  <table>" +
  "    <tr>" +
  "      <td>hello</td>" +
  "      <td>world</td>" +
  "      <td>1</td>" +
  "    </tr>" +
  "  </table>" +
  "  <h1>title 2</h1>" +
  "  <p>hello 2</p>" +
  "  <table>" +
  "    <tr>" +
  "      <td>hello</td>" +
  "      <td>world</td>" +
  "      <td>2</td>" +
  "    </tr>" +
  "  </table>" +
  "  <h1>title 3</h1>" +
  "  <p>hello 3</p>" +
  "  <table>" +
  "    <tr>" +
  "      <td>hello</td>" +
  "      <td>world</td>" +
  "      <td>3</td>" +
  "    </tr>" +
  "  </table>" +    
  "</body>" +
  "</html>";
  Document doc = Jsoup.parse(html);
  Element firstH1 = doc.select("h1").first();
  Elements siblings = firstH1.siblingElements();
  List<Element> elementsBetween = new ArrayList<Element>();
  for (int i = 1; i < siblings.size(); i++) {
    Element sibling = siblings.get(i);
    if (! "h1".equals(sibling.tagName()))
      elementsBetween.add(sibling);
    else {
      processElementsBetween(elementsBetween);
      elementsBetween.clear();
    }
  }
  if (! elementsBetween.isEmpty())
    processElementsBetween(elementsBetween);
}

private static void processElementsBetween(
    List<Element> elementsBetween) {
  System.out.println("---");
  for (Element element : elementsBetween) {
    System.out.println(element);
  }
}
like image 64
MarcoS Avatar answered Oct 31 '25 12:10

MarcoS


I don't know Jsoup that good, but a straight forward approach could look like this:

public class Test {

    public static void main(String[] args){

        Document document = Jsoup.parse("<html><body>" +
            "<h1>First</h1><p>text text text</p>" +
            "<h1>Second</h1>more text" +
            "</body></html>");

        List<List<Node>> articles = new ArrayList<List<Node>>();
        List<Node> currentArticle = null;

        for(Node node : document.getElementsByTag("body").get(0).childNodes()){
            if(node.outerHtml().startsWith("<h1>")){
                currentArticle = new ArrayList<Node>();
                articles.add(currentArticle);
            }

            currentArticle.add(node);
        }

        for(List<Node> article : articles){
            for(Node node : article){
                System.out.println(node);
            }
            System.out.println("------- new page ---------");
        }

    }

}

Do you know the structure of the articles and is it always the same? What do you want to do with the articles? Have you considered splitting them on the client side? This would be an easy jQuery Job.

like image 34
Tim Büthe Avatar answered Oct 31 '25 13:10

Tim Büthe



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!