Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

how to extract content from <div> tag java

Tags:

java

html

extract

i have a serious problem. i would like to extract the content from tag such as:

<div class="main-content">
    <div class="sub-content">Sub content here</div>
      Main content here </div>

output i would expect is:

Sub content here
Main content here

i've tried using regex, but the result isn't so impressive. By using:

Pattern.compile("<div>(\\S+)</div>");

would return all the strings before the first <*/div> tag
so, could anyone help me pls?

like image 376
kyo21 Avatar asked May 17 '11 05:05

kyo21


1 Answers

I'd recommend avoiding regex for parsing HTML. You can easily do what you ask by using Jsoup:

public static void main(String[] args) {
    String html = "<html><head/><body><div class=\"main-content\">" +
            "<div class=\"sub-content\">Sub content here</div>" +
            "Main content here </div></body></html>";
    Document document = Jsoup.parse(html);
    Elements divs = document.select("div");
    for (Element div : divs) {
        System.out.println(div.ownText());
    }
}

In response to comment: if you want to put the content of the div elements into an array of Strings you can simply do:

    String[] divsTexts = new String[divs.size()];
    for (int i = 0; i < divs.size(); i++) {
        divsTexts[i] = divs.get(i).ownText();
    }

In response to comment: if you have nested elements and you want to get own text for each element than you can use jquery multiple selector syntax. Here's an example:

public static void main(String[] args) {
    String html = "<html><head/><body><div class=\"main-content\">" +
            "<div class=\"sub-content\">" +
            "<p>a paragraph <b>with some bold text</b></p>" +
            "Sub content here</div>" +
            "Main content here </div></body></html>";
    Document document = Jsoup.parse(html);
    Elements divs = document.select("div, p, b");
    for (Element div : divs) {
        System.out.println(div.ownText());
    }
}

The code above will parse the following HTML:

<html>
<head />
<body>
<div class="main-content">
<div class="sub-content">
<p>a paragraph <b>with some bold text</b></p>
Sub content here</div>
Main content here</div>
</body>
</html>

and print the following output:

Main content here
Sub content here
a paragraph
with some bold text
like image 59
MarcoS Avatar answered Sep 20 '22 17:09

MarcoS