Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Find nested matching HTML tags in Java

I´m working with a valid HTML String (parsed with jsoup, so all tags have closing tags and it`s well formed) in Java, and I need to find the content of a given tag name , for example , working with the following String:

<p> hi! </p>
<p> hi again! </p>
<h1> foo </h1>
<p> bye! </p>

The results I expect, given the tag 'p' are:

1)<p> hi! </p>
2)<p> hi again! </p>
3)<p> bye! </p>

I´ve acomplished this by simply using the apache.commons.lang library with the method StringUtils.substringsBetween(String html, String "opentag" , String "endtag") which would return an array of String with the desired results. However when I search for a tag that has the exact same tag nested within ( a common example is div) I will get the wrong results (I understand why)

For example, working with ...

<div>
 <p> hey there </p>
 <div>  
  <div>
   <p> asd </p>
  </div>
 </div>
</div>

I would expect 3 results: 1)

<div>
 <p> hey there </p>
 <div>  
  <div>
   <p> asd </p>
  </div>
 </div>
</div>

2)

<div>  
 <div>
  <p> asd </p>
 </div>
</div>

3)

<div>
 <p> asd </p>
</div>

However I get one (I know its because of how the occurrences of the tag appear in the String) I just dont know how to solve it. I have been struggling with it for 2 weeks now, I have tried with regex with no succes at all, I´ve also tried spliting the html String into an array of lines but failed too.

How would you approach this problem? I already know that there are tons of libraries that do this for you with methods such as jsoup's getAllElementsByTag(tagName) but I want to do it myself. Any hints are appreciated!

like image 778
antonicelli Avatar asked Jun 03 '13 16:06

antonicelli


2 Answers

You will need to make heavy use of tokenization and recursion to solve this issue. Essentially, every time a new tag opens (say, <div>), you launch through your processing again.

Consider something like the following:

ArrayList<String> elements = new ArrayList<String>();
Scanner scanner = new Scanner(html);

public String populateDivContents(String buildingString) {

    while(scanner.hasNext()) {

        //Get the next token
        String next = scanner.next();

        //If it's a <div>, call recursively
        if(next.equalsIgnoreCase("<div>")) {
            buildingString = buildingString + populateDivContents(next);
        }

        //If we've hit a closing tag, add our built String to the elements
        else if(next.equalsIgnoreCase("</div>") {
            buildingString = buildingString + next;
            elements.add(buildingString);
            return buildingString;
        }

        //Otherwise, simply add the text to our String and keep going
        else {
            buildingString = buildingString + next;
        }
    }
}

This is a very rough sketch and has some issues, especially if your tags are not separated by new lines or spaces from their content (as they are in your examples). It also assumes the HTML is well-formed, as you say. But it's enough to get the idea across. The ArrayList<String> declared will contain all <div> tags and their contents.

like image 147
asteri Avatar answered Oct 15 '22 19:10

asteri


A standard approach for this is to use a stack. I.e., when you encounter an opening tag, you dump in on a stack, and whenever you encounter a closing tag, you pop the topmost item. If the String is indeed well-formed, all closing tags should pop a matching opening tag. From there, it should be a piece of cake to figure out how get to the inner pair's content.

like image 31
Janis F Avatar answered Oct 15 '22 19:10

Janis F